I’m working with web-scraping from www.albumoftheyear.org, but in my code I can only get an empty df.
I don’t know if the site is protected with some cloudflare and if this is a cause or I’m making a mistake with the selected tags.
The basic idea is to iterate through the pages and collect the data (title, year, genre) from the albums and create a df (pandas).
Here is the code developed:
import pandas as pd import requests from bs4 import BeautifulSoup url = 'https://www.albumoftheyear.org/list/1500-rolling-stones-500-greatest-albums-of-all-time-2020/{}' title = [] data = [] genre = [] for i in range(1,11): soup = BeautifulSoup(requests.get(url.format(i)).content, "html.parser") album_lists = soup.find_all(class_='albumListRow') for album_list in album_lists: album_title = album_list.find('h2',{'class':'albumListTitle'}).find('a').text album_data = album_list.find('div', {'class':'albumListDate'}).text album_genre = album_list.find('div', {'class': 'albumListGenre'}).find('a').text title.append(album_title) data.append(album_data) genre.append(album_genre) df = pd.DataFrame(list(zip(title,data,genre)), columns=['title', 'data','genre'])
Advertisement
Answer
A working solution using selenium. Note you need to have the webdriver for your browser on your system. I am using Chrome and the chromedriver can be gotten from here. Yes you need both the browser and the driver.
import pandas as pd from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import NoSuchElementException driver = webdriver.Chrome(executable_path=r'C:**YOUR PATH**chromedriver.exe') driver.get(r"https://www.albumoftheyear.org/list/1500-rolling-stones-500-greatest-albums-of-all-time-2020/{}") title_list = [] date_list = [] genre_list = [] try: element = WebDriverWait(driver, 10).until( EC.presence_of_element_located((By.ID, "centerContent")) ) albumlistrow = element.find_elements_by_class_name('albumListRow') for a in albumlistrow: title = a.find_element_by_class_name('albumListTitle') date = a.find_element_by_class_name('albumListDate') try: genre = a.find_element_by_class_name('albumListGenre') except NoSuchElementException: pass title_list.append(title.text) date_list.append(date.text) genre_list.append(genre.text) finally: driver.close() df = pd.DataFrame(list(zip(title_list,date_list,genre_list)), columns=['title', 'data','genre']) df.head()
output
title data genre 0 500. Arcade Fire - Funeral September 14, 2004 Indie Rock 1 499. Rufus & Chaka Khan - Ask Rufus January 19, 1977 Soul 2 498. Suicide - Suicide December 28, 1977 Synth Punk 3 497. Various Artists - The Indestructible Beat... January 1, 1985 Synth Punk 4 496. Shakira - Dónde Están los Ladrones? September 29, 1998 Pop Rock
If you do not want the albumListRank change this line from
title_list.append(title.text)
to
title_list.append(title.text[4:])