Skip to content
Advertisement

Scraping all entries of lazyloading page using python

See this page with ECB press releases. These go back to 1997, so it would be nice to automate getting all the links going back in time.

I found the tag that harbours the links ('//*[@id="lazyload-container"]'), but it only gets the most recent links.

How to get the rest?

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'/usr/local/bin/geckodriver') 
driver.get(url)
element = driver.find_element_by_xpath('//*[@id="lazyload-container"]')
element = element.get_attribute('innerHTML')

Advertisement

Answer

The data is loaded via JavaScript from another URL. You can use this example how to load the releases from different years:

import requests
from bs4 import BeautifulSoup

url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html"

for year in range(1997, 2023):
    soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
    for a in soup.select(".title a")[::-1]:
        print(a.find_previous(class_="date").text, a.text)

Prints:

25 April 1997 "EUR" - the new currency code for the euro
1 July 1997 Change of presidency of the European Monetary Institute
2 July 1997 The security features of the euro banknotes
2 July 1997 The EMI's mandate with respect to banknotes

...

17 February 2022 Financial statements of the ECB for 2021
21 February 2022 Survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD) - December 2021
21 February 2022 Results of the December 2021 survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD)

EDIT: To print links:

import requests
from bs4 import BeautifulSoup

url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html"

for year in range(1997, 2023):
    soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
    for a in soup.select(".title a")[::-1]:
        print(
            a.find_previous(class_="date").text,
            a.text,
            "https://www.ecb.europa.eu" + a["href"],
        )

Prints:

...

15 December 1999 Monetary policy decisions https://www.ecb.europa.eu/press/pr/date/1999/html/pr991215.en.html
20 December 1999 Visit by the Finnish Prime Minister https://www.ecb.europa.eu/press/pr/date/1999/html/pr991220.en.html

...
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement