I’m making a python parser for the site: https://www.kinopoisk.ru/lists/series-top250/
The task is to pick film genres from films (displayed on the page as: ‘span’, class _ = ‘selection-film-item-meta__meta-additional-item’)
import requests from bs4 import BeautifulSoup URL = 'https://www.kinopoisk.ru/lists/series-top250/' HEADERS = {'user-agent': 'Mozilla/5.1 (Windows NT 7.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'} def get_html(url, params = ''): r = requests.get(url, headers=HEADERS, params=params) return r def get_content(html): soup = BeautifulSoup(html, 'html.parser') items = soup.find_all('span', class_='selection-film-item-meta__meta-additional-item') cards = [] for item in items: cards.append( { 'title': item.find('span', class_='title') } ) return cards html = get_html(URL) print(get_content(html.text))
I can’t understand why it gives the result: [{‘title’: None}, {‘title’: None}, {‘title’: None}, … {‘title’: None}]
Advertisement
Answer
I’m definitely getting some captcha
blocks from my local machine
but running from google colab I was able to reproduce your error, and since you are running from a VPN you probably will not encounter this issue.
The real issue here is that items
doesn’t have any class title
, so naturally your dictionary is being filled with None
. Since the class you are looking for has a similar sibling (span
with the same class name for country
), you would have to skip every other element from the result to get only the film genres.
<span class="selection-film-item-meta__meta-additional-item">США</span> <span class="selection-film-item-meta__meta-additional-item">мультфильм, фэнтези</span> <span class="selection-film-item-meta__meta-additional-item">США</span> <span class="selection-film-item-meta__meta-additional-item">мультфильм, комедия</span>
I would suggest the use a parent element to be able to extract multiple informations from each film card with more specificity.
def get_content(html): soup = BeautifulSoup(html, 'html.parser') items = soup.find_all('div', class_='selection-film-item-meta selection-film-item-meta_theme_desktop') cards = [] for item in items: title = item.find('p', {'class':'selection-film-item-meta__name'}) additional = item.find_all('span', {'class':'selection-film-item-meta__meta-additional-item'}) cards.append( { 'title': title.get_text(), 'country': additional[0].get_text(), 'genre': additional[1].get_text(), } ) return cards
[{ 'title': 'Аватар: Легенда об Аанге', 'country': 'США', 'genre': 'мультфильм, фэнтези' }, { 'title': 'Гравити Фолз', 'country': 'США', 'genre': 'мультфильм, комедия' }, { 'title': 'Друзья', 'country': 'США', 'genre': 'комедия, мелодрама' }, { ... ...