I am doing web scraping on this URL which is a newspaper image with highlighted words. My purpose is to retrieve all those highlighted words in red. Inspecting the page gives the class: image-overlay hit-rect ng-star-inserted
in which attribute title
must be extract:
Using the following code snippet with BeautifulSoup
:
from bs4 import BeautifulSoup pg_snippet_highlighted_words = soup.find_all("div", class_="image-overlay hit-rect ng-star-inserted") print(pg_snippet_highlighted_words) # returns nothing: [] print(pg_snippet_highlighted_words.get("title")) # AttributeError: ("'NoneType' object has no attribute 'get'",) when soup.find() is executed!
However, I get []
as a result!
My expected result is a list with length of 17
in this specific example, containing all the highlighted words in this page, e.g., the ones identified with title
attribute in inspect as follows:
EXPECTED_RESULT = ["Katri", "Katrina", "Katri", "Katri", "Katri", "Katri", "Katri", "Katri", "Ikonen.", "Katrina", "Katri", "Ikonen.", "Katri", "Katrina", "Katri", "Katri", "Katri"]
Is BeautifulSoup a correct tool to extract information when dealing with dynamic content?
Cheers,
Advertisement
Answer
The data you’re looking for is loaded from external URL via JavaScript. To get the data you can use following example:
import requests api_url = "https://digi.kansalliskirjasto.fi/rest/binding-search/ocr-hits/761979" params = {"page": "12", "term": ["Katri", "Katrina", "Ikonen"]} data = [d["text"] for d in requests.get(api_url, params=params).json()] print(data)
Prints:
['Katri', 'Katrina', 'Katri', 'Katri', 'Katri', 'Katri', 'Katri', 'Katri', 'Ikonen.', 'Katrina', 'Katri', 'Ikonen.', 'Katri', 'Katrina', 'Katri', 'Katri', 'Katri']