I’m working with web-scraping from www.albumoftheyear.org, but in my code I can only get an empty df.
I don’t know if the site is protected with some cloudflare and if this is a cause or I’m making a mistake with the selected tags.
The basic idea is to iterate through the pages and collect the data (title, year, genre) from the albums and create a df (pandas).
Here is the code developed:
JavaScript
x
23
23
1
import pandas as pd
2
import requests
3
from bs4 import BeautifulSoup
4
5
url = 'https://www.albumoftheyear.org/list/1500-rolling-stones-500-greatest-albums-of-all-time-2020/{}'
6
7
title = []
8
data = []
9
genre = []
10
11
for i in range(1,11):
12
soup = BeautifulSoup(requests.get(url.format(i)).content, "html.parser")
13
album_lists = soup.find_all(class_='albumListRow')
14
for album_list in album_lists:
15
album_title = album_list.find('h2',{'class':'albumListTitle'}).find('a').text
16
album_data = album_list.find('div', {'class':'albumListDate'}).text
17
album_genre = album_list.find('div', {'class': 'albumListGenre'}).find('a').text
18
title.append(album_title)
19
data.append(album_data)
20
genre.append(album_genre)
21
22
df = pd.DataFrame(list(zip(title,data,genre)), columns=['title', 'data','genre'])
23
Advertisement
Answer
A working solution using selenium. Note you need to have the webdriver for your browser on your system. I am using Chrome and the chromedriver can be gotten from here. Yes you need both the browser and the driver.
JavaScript
1
37
37
1
import pandas as pd
2
from selenium import webdriver
3
from selenium.webdriver.common.keys import Keys
4
from selenium.webdriver.common.by import By
5
from selenium.webdriver.support.ui import WebDriverWait
6
from selenium.webdriver.support import expected_conditions as EC
7
from selenium.common.exceptions import NoSuchElementException
8
9
driver = webdriver.Chrome(executable_path=r'C:**YOUR PATH**chromedriver.exe')
10
driver.get(r"https://www.albumoftheyear.org/list/1500-rolling-stones-500-greatest-albums-of-all-time-2020/{}")
11
12
title_list = []
13
date_list = []
14
genre_list = []
15
16
try:
17
element = WebDriverWait(driver, 10).until(
18
EC.presence_of_element_located((By.ID, "centerContent"))
19
)
20
albumlistrow = element.find_elements_by_class_name('albumListRow')
21
for a in albumlistrow:
22
title = a.find_element_by_class_name('albumListTitle')
23
date = a.find_element_by_class_name('albumListDate')
24
try:
25
genre = a.find_element_by_class_name('albumListGenre')
26
except NoSuchElementException:
27
pass
28
title_list.append(title.text)
29
date_list.append(date.text)
30
genre_list.append(genre.text)
31
32
finally:
33
driver.close()
34
35
df = pd.DataFrame(list(zip(title_list,date_list,genre_list)), columns=['title', 'data','genre'])
36
df.head()
37
output
JavaScript
1
8
1
title data genre
2
0 500. Arcade Fire - Funeral September 14, 2004 Indie Rock
3
1 499. Rufus & Chaka Khan - Ask Rufus January 19, 1977 Soul
4
2 498. Suicide - Suicide December 28, 1977 Synth Punk
5
3 497. Various Artists - The Indestructible Beat January 1, 1985 Synth Punk
6
4 496. Shakira - Dónde Están los Ladrones? September 29, 1998 Pop Rock
7
8
If you do not want the albumListRank change this line from
JavaScript
1
2
1
title_list.append(title.text)
2
to
JavaScript
1
2
1
title_list.append(title.text[4:])
2