Skip to content
Advertisement

Why can’t I save my scraped html table to pandas dataframe?

I have a python script that scrapes a html table. When I try to save my scraped data to pandas dataframe, I get an error. Please help me check what am doing wrong?

Here is my codeblock

from selenium import webdriver 
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import time
import pandas as pd

def HDI():
    url = 'https://worldpopulationreview.com/country-rankings/hdi-by-country'

    service = Service(executable_path="C:/driver/new/chromedriver_win32/chromedriver.exe")
    driver = webdriver.Chrome(service=service)
    driver.get(url)
    time.sleep(5)

    btn = driver.find_element(By.CLASS_NAME, '_3p_1XEZR')
    btn.click()
    time.sleep(5)

    temp_height=0

    while True:
        #Looping down the scroll bar
        driver.execute_script("window.scrollBy(0,500)")
        #sleep and let the scroll bar react
        time.sleep(5)
        #Get the distance of the current scroll bar from the top
        check_height = driver.execute_script("return document.documentElement.scrollTop || window.window.pageYOffset || document.body.scrollTop;")
        #If the two are equal to the end
        if check_height==temp_height:
           break
        temp_height=check_height
    time.sleep(3)

    row_headers = []
    tableheads = driver.find_elements(By.CLASS_NAME, 'datatable-th')
    for value in tableheads:
        thead_values = value.find_element(By.CLASS_NAME, 'has-tooltip-bottom').text.strip()

        row_headers.append(thead_values)

    tablebodies = driver.find_elements(By.TAG_NAME, 'tr')
    for row in tablebodies:
        tabledata = row.find_elements(By.CSS_SELECTOR, 'tr, td')

        row_data = []
        for data in tabledata:
            row_data.append(data.text)

    df = pd.DataFrame(row_data, columns=row_headers)
    df 

HDI()

Here is the error i get

File "c:UsersLPDocumentspythonHD1 2023HDI2023.py", line 49, in HDI
df = pd.DataFrame(row_data, columns=row_headers)
File "C:UsersLPAppDataLocalProgramsPythonPython310libsite-packagespandascoreinternalsconstruction.py", line 351, in ndarray_to_mgr
_check_values_indices_shape_match(values, index, columns)
File "C:UsersLPAppDataLocalProgramsPythonPython310libsite-packagespandascoreinternalsconstruction.py", line 422, in _check_values_indices_shape_match
  raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}")
ValueError: Shape of passed values is (9, 1), indices imply (9, 9)

I want to save the above scraped values into pandas dataframe. That’s my aim. Please help if you can. THANKS

Advertisement

Answer

In your variable row_data you are only saving one row and you are overwriting it in every iteration. You probably want to use all rows in your DataFrame. You can for example create a new variable row_data_all and pass that to your DataFrame

row_data_all = []
for row in tablebodies:
    tabledata = row.find_elements(selenium.webdriver.common.by.By.CSS_SELECTOR, 'tr, td')

    row_data = []
    for data in tabledata:
        row_data.append(data.text)
    row_data_all.append(row_data)

pd.DataFrame(row_data_all, columns = row_headers)

In case you really wanted to create a DataFrame from a single row you should use

pd.DataFrame(row_data, index = row_headers).T

Alternative

You can also use pandas’ read_html() method, which only needs the html source code. You can even pass it the source code of the entire page, and it will return a list of DataFrames of the tables found in the source code. This will also speed up your function a lot.

html_table = driver.find_element(By.TAG_NAME, "table").get_attribute("outerHTML")
df = pd.read_html(html_table)[0]
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement