I am new to Python and I am trying to webscrape this website. What I am trying to do is to get just dates and articles’ titles from this website. I follow a procedure I found on SO which is as follows:
from bs4 import BeautifulSoup import requests url = "https://www.ecb.europa.eu/press/inter/html/index.en.html" res = requests.get(url) soup = BeautifulSoup(res.text) movies = soup.select(".title a , .date") print(movies) movies_titles = [title.text for title in movies] movies_links = ["http://www.ecb.europa.eu"+ title["href"] for title in movies] print(movies_titles) print(movies_links)
I got .title a , .date
using SelectorGadget in the url I shared. However, print(movies)
is empty. What am I doing wrong?
Can anyone help me?
Thanks!
Advertisement
Answer
The content is not part of index.en.html
but is loaded in by js
from
https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html
Then you can’t select pairs afaik, so you need to select for titles and dates separately:
titles = soup.select(".title a") dates = soup.select(".date") pairs = list(zip(titles, dates))
Then you can print them out like this:
movies_titles = [pair[0].text for pair in pairs] print(movies_titles) movies_links = ["http://www.ecb.europa.eu" + pair[0]["href"] for pair in pairs] print(movies_links)
Result:
['Christine Lagarde:xa0Interview with CNBC', 'Fabio Panetta:xa0Interview with El País ', 'Isabel Schnabel:xa0Interview with Der Spiegel', 'Philip R. Lane:xa0Interview with CNBC', 'Frank Elderson:xa0Q&A on Twitter', 'Isabel Schnabel:xa0Interview with Les Echos ', 'Philip R. Lane:xa0Interview with the Financial Times', 'Luis de Guindos:xa0Interview with Público', 'Philip R. Lane:xa0Interview with Expansión', 'Isabel Schnabel:xa0Interview with LETA', 'Fabio Panetta:xa0Interview with Der Spiegel', 'Christine Lagarde:xa0Interview with Le Journal du Dimanche ', 'Philip R. Lane:xa0Interview with Süddeutsche Zeitung', 'Isabel Schnabel:xa0Interview with Deutschlandfunk', 'Philip R. Lane:xa0Interview with SKAI TV', 'Isabel Schnabel:xa0Interview with Der Standard'] ['http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210412~ccd1b7c9bf.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210411~44ade9c3b5.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210409~c8c348a12c.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210323~e4026c61d1.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210317_1~1d81212506.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210317~458636d643.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210316~930d09ce3c.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210302~c793ad7b68.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210226~79eba6f9fb.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210225~5f1be75a9f.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210209~af9c628e30.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210207~f6e34f3b90.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210131_1~650f5ce5f7.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210131~13d84cb9b2.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210127~9ad88eb038.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210112~1c3f989acd.en.html']
Full code:
from bs4 import BeautifulSoup import requests url = "https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html" res = requests.get(url) soup = BeautifulSoup(res.text) titles = soup.select(".title a") dates = soup.select(".date") pairs = list(zip(titles, dates)) movies_titles = [pair[0].text for pair in pairs] print(movies_titles) movies_links = ["http://www.ecb.europa.eu" + pair[0]["href"] for pair in pairs] print(movies_links)