Skip to content
Advertisement

Web-scraping return empty values: possible protected site

I’m working with web-scraping from www.albumoftheyear.org, but in my code I can only get an empty df.

I don’t know if the site is protected with some cloudflare and if this is a cause or I’m making a mistake with the selected tags.

The basic idea is to iterate through the pages and collect the data (title, year, genre) from the albums and create a df (pandas).

Here is the code developed:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://www.albumoftheyear.org/list/1500-rolling-stones-500-greatest-albums-of-all-time-2020/{}'

title = []
data  = []
genre = []

for i in range(1,11):
 soup = BeautifulSoup(requests.get(url.format(i)).content, "html.parser")
 album_lists = soup.find_all(class_='albumListRow')
 for album_list in album_lists:
  album_title = album_list.find('h2',{'class':'albumListTitle'}).find('a').text
  album_data = album_list.find('div', {'class':'albumListDate'}).text
  album_genre = album_list.find('div', {'class': 'albumListGenre'}).find('a').text
  title.append(album_title)
  data.append(album_data)
  genre.append(album_genre)

df = pd.DataFrame(list(zip(title,data,genre)), columns=['title', 'data','genre'])

Advertisement

Answer

A working solution using selenium. Note you need to have the webdriver for your browser on your system. I am using Chrome and the chromedriver can be gotten from here. Yes you need both the browser and the driver.

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException

driver = webdriver.Chrome(executable_path=r'C:**YOUR PATH**chromedriver.exe')
driver.get(r"https://www.albumoftheyear.org/list/1500-rolling-stones-500-greatest-albums-of-all-time-2020/{}")

title_list = []
date_list  = []
genre_list = []

try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "centerContent"))
    )
    albumlistrow = element.find_elements_by_class_name('albumListRow')
    for a in albumlistrow:
        title = a.find_element_by_class_name('albumListTitle')
        date = a.find_element_by_class_name('albumListDate')
        try:
            genre = a.find_element_by_class_name('albumListGenre')
        except NoSuchElementException:
            pass
        title_list.append(title.text)
        date_list.append(date.text)
        genre_list.append(genre.text)

finally:
    driver.close()

df = pd.DataFrame(list(zip(title_list,date_list,genre_list)), columns=['title', 'data','genre'])
df.head()

output

    title                                               data                genre
0   500. Arcade Fire - Funeral                          September 14, 2004  Indie Rock
1   499. Rufus & Chaka Khan - Ask Rufus                 January 19, 1977    Soul
2   498. Suicide - Suicide                              December 28, 1977   Synth Punk
3   497. Various Artists - The Indestructible Beat...   January 1, 1985     Synth Punk
4   496. Shakira - Dónde Están los Ladrones?            September 29, 1998  Pop Rock

If you do not want the albumListRank change this line from

title_list.append(title.text)

to

title_list.append(title.text[4:])
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement