Skip to content
Advertisement

webscraping an image with highlighted text

I am doing web scraping on this URL which is a newspaper image with highlighted words. My purpose is to retrieve all those highlighted words in red. Inspecting the page gives the class: image-overlay hit-rect ng-star-inserted in which attribute title must be extract:

enter image description here Using the following code snippet with BeautifulSoup:

from bs4 import BeautifulSoup
pg_snippet_highlighted_words = soup.find_all("div", class_="image-overlay hit-rect ng-star-inserted")
print(pg_snippet_highlighted_words) # returns nothing: []
print(pg_snippet_highlighted_words.get("title")) # AttributeError: ("'NoneType' object has no attribute 'get'",) when soup.find() is executed!

However, I get [] as a result!

My expected result is a list with length of 17 in this specific example, containing all the highlighted words in this page, e.g., the ones identified with title attribute in inspect as follows:

EXPECTED_RESULT = ["Katri", "Katrina", "Katri", "Katri", "Katri", "Katri", "Katri", "Katri", "Ikonen.", "Katrina", "Katri", "Ikonen.", "Katri", "Katrina", "Katri", "Katri", "Katri"]

Is BeautifulSoup a correct tool to extract information when dealing with dynamic content?

Cheers,

Advertisement

Answer

The data you’re looking for is loaded from external URL via JavaScript. To get the data you can use following example:

import requests

api_url = "https://digi.kansalliskirjasto.fi/rest/binding-search/ocr-hits/761979"
params = {"page": "12", "term": ["Katri", "Katrina", "Ikonen"]}

data = [d["text"] for d in requests.get(api_url, params=params).json()]
print(data)

Prints:

['Katri', 'Katrina', 'Katri', 'Katri', 'Katri', 'Katri', 'Katri', 'Katri', 'Ikonen.', 'Katrina', 'Katri', 'Ikonen.', 'Katri', 'Katrina', 'Katri', 'Katri', 'Katri']
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement