I have a python script that scrapes a html table. When I try to save my scraped data to pandas dataframe, I get an error. Please help me check what am doing wrong?
Here is my codeblock
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.service import Service import time import pandas as pd def HDI(): url = 'https://worldpopulationreview.com/country-rankings/hdi-by-country' service = Service(executable_path="C:/driver/new/chromedriver_win32/chromedriver.exe") driver = webdriver.Chrome(service=service) driver.get(url) time.sleep(5) btn = driver.find_element(By.CLASS_NAME, '_3p_1XEZR') btn.click() time.sleep(5) temp_height=0 while True: #Looping down the scroll bar driver.execute_script("window.scrollBy(0,500)") #sleep and let the scroll bar react time.sleep(5) #Get the distance of the current scroll bar from the top check_height = driver.execute_script("return document.documentElement.scrollTop || window.window.pageYOffset || document.body.scrollTop;") #If the two are equal to the end if check_height==temp_height: break temp_height=check_height time.sleep(3) row_headers = [] tableheads = driver.find_elements(By.CLASS_NAME, 'datatable-th') for value in tableheads: thead_values = value.find_element(By.CLASS_NAME, 'has-tooltip-bottom').text.strip() row_headers.append(thead_values) tablebodies = driver.find_elements(By.TAG_NAME, 'tr') for row in tablebodies: tabledata = row.find_elements(By.CSS_SELECTOR, 'tr, td') row_data = [] for data in tabledata: row_data.append(data.text) df = pd.DataFrame(row_data, columns=row_headers) df HDI()
Here is the error i get
File "c:UsersLPDocumentspythonHD1 2023HDI2023.py", line 49, in HDI df = pd.DataFrame(row_data, columns=row_headers) File "C:UsersLPAppDataLocalProgramsPythonPython310libsite-packagespandascoreinternalsconstruction.py", line 351, in ndarray_to_mgr _check_values_indices_shape_match(values, index, columns) File "C:UsersLPAppDataLocalProgramsPythonPython310libsite-packagespandascoreinternalsconstruction.py", line 422, in _check_values_indices_shape_match raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}") ValueError: Shape of passed values is (9, 1), indices imply (9, 9)
I want to save the above scraped values into pandas dataframe. That’s my aim. Please help if you can. THANKS
Advertisement
Answer
In your variable row_data
you are only saving one row and you are overwriting it in every iteration. You probably want to use all rows in your DataFrame. You can for example create a new variable row_data_all
and pass that to your DataFrame
row_data_all = [] for row in tablebodies: tabledata = row.find_elements(selenium.webdriver.common.by.By.CSS_SELECTOR, 'tr, td') row_data = [] for data in tabledata: row_data.append(data.text) row_data_all.append(row_data) pd.DataFrame(row_data_all, columns = row_headers)
In case you really wanted to create a DataFrame from a single row you should use
pd.DataFrame(row_data, index = row_headers).T
Alternative
You can also use pandas’ read_html()
method, which only needs the html source code. You can even pass it the source code of the entire page, and it will return a list of DataFrames of the tables found in the source code. This will also speed up your function a lot.
html_table = driver.find_element(By.TAG_NAME, "table").get_attribute("outerHTML") df = pd.read_html(html_table)[0]