I’m working with web-scraping from www.albumoftheyear.org, but in my code I can only get an empty df.
I don’t know if the site is protected with some cloudflare and if this is a cause or I’m making a mistake with the selected tags.
The basic idea is to iterate through the pages and collect the data (title, year, genre) from the albums and create a df (pandas).
Here is the code developed:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.albumoftheyear.org/list/1500-rolling-stones-500-greatest-albums-of-all-time-2020/{}'
title = []
data = []
genre = []
for i in range(1,11):
soup = BeautifulSoup(requests.get(url.format(i)).content, "html.parser")
album_lists = soup.find_all(class_='albumListRow')
for album_list in album_lists:
album_title = album_list.find('h2',{'class':'albumListTitle'}).find('a').text
album_data = album_list.find('div', {'class':'albumListDate'}).text
album_genre = album_list.find('div', {'class': 'albumListGenre'}).find('a').text
title.append(album_title)
data.append(album_data)
genre.append(album_genre)
df = pd.DataFrame(list(zip(title,data,genre)), columns=['title', 'data','genre'])
Advertisement
Answer
A working solution using selenium. Note you need to have the webdriver for your browser on your system. I am using Chrome and the chromedriver can be gotten from here. Yes you need both the browser and the driver.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
driver = webdriver.Chrome(executable_path=r'C:**YOUR PATH**chromedriver.exe')
driver.get(r"https://www.albumoftheyear.org/list/1500-rolling-stones-500-greatest-albums-of-all-time-2020/{}")
title_list = []
date_list = []
genre_list = []
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "centerContent"))
)
albumlistrow = element.find_elements_by_class_name('albumListRow')
for a in albumlistrow:
title = a.find_element_by_class_name('albumListTitle')
date = a.find_element_by_class_name('albumListDate')
try:
genre = a.find_element_by_class_name('albumListGenre')
except NoSuchElementException:
pass
title_list.append(title.text)
date_list.append(date.text)
genre_list.append(genre.text)
finally:
driver.close()
df = pd.DataFrame(list(zip(title_list,date_list,genre_list)), columns=['title', 'data','genre'])
df.head()
output
title data genre 0 500. Arcade Fire - Funeral September 14, 2004 Indie Rock 1 499. Rufus & Chaka Khan - Ask Rufus January 19, 1977 Soul 2 498. Suicide - Suicide December 28, 1977 Synth Punk 3 497. Various Artists - The Indestructible Beat... January 1, 1985 Synth Punk 4 496. Shakira - Dónde Están los Ladrones? September 29, 1998 Pop Rock
If you do not want the albumListRank change this line from
title_list.append(title.text)
to
title_list.append(title.text[4:])