Skip to content
Advertisement

The python parser does not read information from the site, but returns None

I’m making a python parser for the site: https://www.kinopoisk.ru/lists/series-top250/

The task is to pick film genres from films (displayed on the page as: ‘span’, class _ = ‘selection-film-item-meta__meta-additional-item’)

import requests
from bs4 import BeautifulSoup

URL = 'https://www.kinopoisk.ru/lists/series-top250/'
HEADERS = {'user-agent': 'Mozilla/5.1 (Windows NT 7.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0',
           'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}

def get_html(url, params = ''):
    r = requests.get(url, headers=HEADERS, params=params)
    return r


def get_content(html):
    soup = BeautifulSoup(html, 'html.parser')
    items = soup.find_all('span', class_='selection-film-item-meta__meta-additional-item')
    cards = []

    for item in items:
        cards.append(
            {
                'title': item.find('span', class_='title')
            }
        )
    return cards


html = get_html(URL)
print(get_content(html.text))

I can’t understand why it gives the result: [{‘title’: None}, {‘title’: None}, {‘title’: None}, … {‘title’: None}]

Advertisement

Answer

I’m definitely getting some captcha blocks from my local machine

https://www.kinopoisk.ru/**showcaptcha**?cc=1&retpath=https%3A//www.kinopoisk.ru/lists/series-top250%3F_ea4584...

but running from google colab I was able to reproduce your error, and since you are running from a VPN you probably will not encounter this issue.

The real issue here is that items doesn’t have any class title, so naturally your dictionary is being filled with None. Since the class you are looking for has a similar sibling (span with the same class name for country), you would have to skip every other element from the result to get only the film genres.

<span class="selection-film-item-meta__meta-additional-item">США</span>
<span class="selection-film-item-meta__meta-additional-item">мультфильм, фэнтези</span>
<span class="selection-film-item-meta__meta-additional-item">США</span>
<span class="selection-film-item-meta__meta-additional-item">мультфильм, комедия</span>

I would suggest the use a parent element to be able to extract multiple informations from each film card with more specificity.

def get_content(html):
    soup = BeautifulSoup(html, 'html.parser')
    items = soup.find_all('div', class_='selection-film-item-meta selection-film-item-meta_theme_desktop')

    cards = []
    for item in items:
        title = item.find('p', {'class':'selection-film-item-meta__name'})
        additional = item.find_all('span', {'class':'selection-film-item-meta__meta-additional-item'})
        cards.append(
            {
                'title': title.get_text(),
                'country': additional[0].get_text(),
                'genre': additional[1].get_text(),
            }
        )
    return cards
[{
    'title': 'Аватар: Легенда об Аанге',
    'country': 'США',
    'genre': 'мультфильм, фэнтези'
}, {
    'title': 'Гравити Фолз',
    'country': 'США',
    'genre': 'мультфильм, комедия'
}, {
    'title': 'Друзья',
    'country': 'США',
    'genre': 'комедия, мелодрама'
}, {
...
...
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement