Why do I run into trouble webscraping this website in Python?

Tags: ,



I am new to Python and I am trying to webscrape this website. What I am trying to do is to get just dates and articles’ titles from this website. I follow a procedure I found on SO which is as follows:

from bs4 import BeautifulSoup
import requests


url = "https://www.ecb.europa.eu/press/inter/html/index.en.html"
res = requests.get(url)
soup = BeautifulSoup(res.text)

movies = soup.select(".title a , .date")
print(movies)

movies_titles = [title.text for title in movies]
movies_links = ["http://www.ecb.europa.eu"+ title["href"] for title in movies]
print(movies_titles)
print(movies_links)

I got .title a , .date using SelectorGadget in the url I shared. However, print(movies) is empty. What am I doing wrong?

Can anyone help me?

Thanks!

Answer

The content is not part of index.en.html but is loaded in by js from

https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html

Then you can’t select pairs afaik, so you need to select for titles and dates separately:

titles = soup.select(".title a")
dates = soup.select(".date")
pairs = list(zip(titles, dates))

Then you can print them out like this:

movies_titles = [pair[0].text for pair in pairs]
print(movies_titles)

movies_links = ["http://www.ecb.europa.eu" + pair[0]["href"] for pair in pairs]
print(movies_links)

Result:

['Christine Lagarde:xa0Interview with CNBC', 'Fabio Panetta:xa0Interview with El País ', 'Isabel Schnabel:xa0Interview with Der Spiegel', 'Philip R. Lane:xa0Interview with CNBC', 'Frank Elderson:xa0Q&A on Twitter', 'Isabel Schnabel:xa0Interview with Les Echos ', 'Philip R. Lane:xa0Interview with the Financial Times', 'Luis de Guindos:xa0Interview with Público', 'Philip R. Lane:xa0Interview with Expansión', 'Isabel Schnabel:xa0Interview with LETA', 'Fabio Panetta:xa0Interview with Der Spiegel', 'Christine Lagarde:xa0Interview with Le Journal du Dimanche ', 'Philip R. Lane:xa0Interview with Süddeutsche Zeitung', 'Isabel Schnabel:xa0Interview with Deutschlandfunk', 'Philip R. Lane:xa0Interview with SKAI TV', 'Isabel Schnabel:xa0Interview with Der Standard']
['http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210412~ccd1b7c9bf.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210411~44ade9c3b5.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210409~c8c348a12c.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210323~e4026c61d1.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210317_1~1d81212506.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210317~458636d643.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210316~930d09ce3c.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210302~c793ad7b68.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210226~79eba6f9fb.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210225~5f1be75a9f.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210209~af9c628e30.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210207~f6e34f3b90.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210131_1~650f5ce5f7.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210131~13d84cb9b2.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210127~9ad88eb038.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210112~1c3f989acd.en.html']

Full code:

from bs4 import BeautifulSoup
import requests

url = "https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html"
res = requests.get(url)
soup = BeautifulSoup(res.text)

titles = soup.select(".title a")
dates = soup.select(".date")
pairs = list(zip(titles, dates))

movies_titles = [pair[0].text for pair in pairs]
print(movies_titles)

movies_links = ["http://www.ecb.europa.eu" + pair[0]["href"] for pair in pairs]
print(movies_links)


Source: stackoverflow