I’m making a python parser for the site: https://www.kinopoisk.ru/lists/series-top250/
The task is to pick film genres from films (displayed on the page as: ‘span’, class _ = ‘selection-film-item-meta__meta-additional-item’)
import requests
from bs4 import BeautifulSoup
URL = 'https://www.kinopoisk.ru/lists/series-top250/'
HEADERS = {'user-agent': 'Mozilla/5.1 (Windows NT 7.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
def get_html(url, params = ''):
r = requests.get(url, headers=HEADERS, params=params)
return r
def get_content(html):
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('span', class_='selection-film-item-meta__meta-additional-item')
cards = []
for item in items:
cards.append(
{
'title': item.find('span', class_='title')
}
)
return cards
html = get_html(URL)
print(get_content(html.text))
I can’t understand why it gives the result: [{‘title’: None}, {‘title’: None}, {‘title’: None}, … {‘title’: None}]
Advertisement
Answer
I’m definitely getting some captcha
blocks from my local machine
but running from google colab I was able to reproduce your error, and since you are running from a VPN you probably will not encounter this issue.
The real issue here is that items
doesn’t have any class title
, so naturally your dictionary is being filled with None
. Since the class you are looking for has a similar sibling (span
with the same class name for country
), you would have to skip every other element from the result to get only the film genres.
<span class="selection-film-item-meta__meta-additional-item">США</span>
<span class="selection-film-item-meta__meta-additional-item">мультфильм, фэнтези</span>
<span class="selection-film-item-meta__meta-additional-item">США</span>
<span class="selection-film-item-meta__meta-additional-item">мультфильм, комедия</span>
I would suggest the use a parent element to be able to extract multiple informations from each film card with more specificity.
def get_content(html):
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('div', class_='selection-film-item-meta selection-film-item-meta_theme_desktop')
cards = []
for item in items:
title = item.find('p', {'class':'selection-film-item-meta__name'})
additional = item.find_all('span', {'class':'selection-film-item-meta__meta-additional-item'})
cards.append(
{
'title': title.get_text(),
'country': additional[0].get_text(),
'genre': additional[1].get_text(),
}
)
return cards
[{
'title': 'Аватар: Легенда об Аанге',
'country': 'США',
'genre': 'мультфильм, фэнтези'
}, {
'title': 'Гравити Фолз',
'country': 'США',
'genre': 'мультфильм, комедия'
}, {
'title': 'Друзья',
'country': 'США',
'genre': 'комедия, мелодрама'
}, {