Scraping tables from a JavaScript webpage using Selenium, BeautifulSoup, and Panda

to begin with I am a beginner and trying to achieve something which is currently out of my league. However, I hope you guys can help me out. Much appreciated.

I am trying to scrape the table from spaclens.com. I already tried using the out-of-the-box solution from Google sheets however the site is Java Script based which Google sheets cannot handle. I found some code online which I altered to fit my needs however I am stuck.

import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

# Step 1: Create a session and load the page
driver = webdriver.Chrome()
driver.get('https://www.spaclens.com/')

# Wait for the page to fully load
driver.implicitly_wait(5)

# Step 2: Parse HTML code and grab tables with Beautiful Soup
soup = BeautifulSoup(driver.page_source, 'lxml')

tables = soup.find_all('table')

# Step 3: Read tables with Pandas read_html()
dfs = pd.read_html(str(tables))

print(f'Total tables: {len(dfs)}')
print(dfs[0])

driver.close()

The code above gives me the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-34-a32c8dbcef38> in <module>
     16 
     17 # Step 3: Read tables with Pandas read_html()
---> 18 dfs = pd.read_html(str(tables))
     19 
     20 print(f'Total tables: {len(dfs)}')

~anaconda3libsite-packagespandasutil_decorators.py in wrapper(*args, **kwargs)
    294                 )
    295                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296             return func(*args, **kwargs)
    297 
    298         return wrapper

~anaconda3libsite-packagespandasiohtml.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)
   1084         )
   1085     validate_header_arg(header)
-> 1086     return _parse(
   1087         flavor=flavor,
   1088         io=io,

~anaconda3libsite-packagespandasiohtml.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    915             break
    916     else:
--> 917         raise retained
    918 
    919     ret = []

~anaconda3libsite-packagespandasiohtml.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    896 
    897         try:
--> 898             tables = p.parse_tables()
    899         except ValueError as caught:
    900             # if `io` is an io-like object, check if it's seekable

~anaconda3libsite-packagespandasiohtml.py in parse_tables(self)
    215         list of parsed (header, body, footer) tuples from tables.
    216         """
--> 217         tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
    218         return (self._parse_thead_tbody_tfoot(table) for table in tables)
    219 

~anaconda3libsite-packagespandasiohtml.py in _parse_tables(self, doc, match, attrs)
    545 
    546         if not tables:
--> 547             raise ValueError("No tables found")
    548 
    549         result = []

ValueError: No tables found

Do I need to alter the argument to find the table? Anyone can shed some light on this?

Thanks!!

Answer

Be easier to just grab the data from the source. Comes to you in a nice json format.

import pandas as pd
import requests

url = 'https://www.spaclens.com/company/page'
payload = {
'pageIndex': '1',
'pageSize': '9999',
'query': '{}',
'sort': '{}'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}

jsonData = requests.get(url, headers=headers, params=payload).json()
df = pd.DataFrame(jsonData['data']['items'])

Output: 846 rows, 78 columns

Advertisement

Answer