I am new to Python and I am trying to webscrape this website. What I am trying to do is to get just dates and articles’ titles from this website. I follow a procedure I found on SO which is as follows:
JavaScript
x
17
17
1
from bs4 import BeautifulSoup
2
import requests
3
4
5
url = "https://www.ecb.europa.eu/press/inter/html/index.en.html"
6
res = requests.get(url)
7
soup = BeautifulSoup(res.text)
8
9
movies = soup.select(".title a , .date")
10
print(movies)
11
12
movies_titles = [title.text for title in movies]
13
movies_links = ["http://www.ecb.europa.eu"+ title["href"] for title in movies]
14
print(movies_titles)
15
print(movies_links)
16
17
I got .title a , .date
using SelectorGadget in the url I shared. However, print(movies)
is empty. What am I doing wrong?
Can anyone help me?
Thanks!
Advertisement
Answer
The content is not part of index.en.html
but is loaded in by js
from
JavaScript
1
2
1
https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html
2
Then you can’t select pairs afaik, so you need to select for titles and dates separately:
JavaScript
1
4
1
titles = soup.select(".title a")
2
dates = soup.select(".date")
3
pairs = list(zip(titles, dates))
4
Then you can print them out like this:
JavaScript
1
6
1
movies_titles = [pair[0].text for pair in pairs]
2
print(movies_titles)
3
4
movies_links = ["http://www.ecb.europa.eu" + pair[0]["href"] for pair in pairs]
5
print(movies_links)
6
Result:
JavaScript
1
3
1
['Christine Lagarde:xa0Interview with CNBC', 'Fabio Panetta:xa0Interview with El País ', 'Isabel Schnabel:xa0Interview with Der Spiegel', 'Philip R. Lane:xa0Interview with CNBC', 'Frank Elderson:xa0Q&A on Twitter', 'Isabel Schnabel:xa0Interview with Les Echos ', 'Philip R. Lane:xa0Interview with the Financial Times', 'Luis de Guindos:xa0Interview with Público', 'Philip R. Lane:xa0Interview with Expansión', 'Isabel Schnabel:xa0Interview with LETA', 'Fabio Panetta:xa0Interview with Der Spiegel', 'Christine Lagarde:xa0Interview with Le Journal du Dimanche ', 'Philip R. Lane:xa0Interview with Süddeutsche Zeitung', 'Isabel Schnabel:xa0Interview with Deutschlandfunk', 'Philip R. Lane:xa0Interview with SKAI TV', 'Isabel Schnabel:xa0Interview with Der Standard']
2
['http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210412~ccd1b7c9bf.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210411~44ade9c3b5.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210409~c8c348a12c.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210323~e4026c61d1.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210317_1~1d81212506.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210317~458636d643.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210316~930d09ce3c.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210302~c793ad7b68.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210226~79eba6f9fb.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210225~5f1be75a9f.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210209~af9c628e30.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210207~f6e34f3b90.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210131_1~650f5ce5f7.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210131~13d84cb9b2.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210127~9ad88eb038.en.html', 'http://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in210112~1c3f989acd.en.html']
3
Full code:
JavaScript
1
17
17
1
from bs4 import BeautifulSoup
2
import requests
3
4
url = "https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html"
5
res = requests.get(url)
6
soup = BeautifulSoup(res.text)
7
8
titles = soup.select(".title a")
9
dates = soup.select(".date")
10
pairs = list(zip(titles, dates))
11
12
movies_titles = [pair[0].text for pair in pairs]
13
print(movies_titles)
14
15
movies_links = ["http://www.ecb.europa.eu" + pair[0]["href"] for pair in pairs]
16
print(movies_links)
17