Scraping tables from a JavaScript webpage using Selenium, BeautifulSoup, and Panda

to begin with I am a beginner and trying to achieve something which is currently out of my league. However, I hope you guys can help me out. Much appreciated.

I am trying to scrape the table from I already tried using the out-of-the-box solution from Google sheets however the site is Java Script based which Google sheets cannot handle. I found some code online which I altered to fit my needs however I am stuck.

import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup

# Step 1: Create a session and load the page
driver = webdriver.Chrome()

# Wait for the page to fully load

# Step 2: Parse HTML code and grab tables with Beautiful Soup
soup = BeautifulSoup(driver.page_source, 'lxml')

tables = soup.find_all('table')

# Step 3: Read tables with Pandas read_html()
dfs = pd.read_html(str(tables))

print(f'Total tables: {len(dfs)}')


The code above gives me the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-34-a32c8dbcef38> in <module>
     17 # Step 3: Read tables with Pandas read_html()
---> 18 dfs = pd.read_html(str(tables))
     20 print(f'Total tables: {len(dfs)}') in wrapper(*args, **kwargs)
    294                 )
    295                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296             return func(*args, **kwargs)
    298         return wrapper in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only)
   1084         )
   1085     validate_header_arg(header)
-> 1086     return _parse(
   1087         flavor=flavor,
   1088         io=io, in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    915             break
    916     else:
--> 917         raise retained
    919     ret = [] in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs)
    897         try:
--> 898             tables = p.parse_tables()
    899         except ValueError as caught:
    900             # if `io` is an io-like object, check if it's seekable in parse_tables(self)
    215         list of parsed (header, body, footer) tuples from tables.
    216         """
--> 217         tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
    218         return (self._parse_thead_tbody_tfoot(table) for table in tables)
    219 in _parse_tables(self, doc, match, attrs)
    546         if not tables:
--> 547             raise ValueError("No tables found")
    549         result = []

ValueError: No tables found

Do I need to alter the argument to find the table? Anyone can shed some light on this?




Be easier to just grab the data from the source. Comes to you in a nice json format.

import pandas as pd
import requests

url = ''
payload = {
'pageIndex': '1',
'pageSize': '9999',
'query': '{}',
'sort': '{}'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'}

jsonData = requests.get(url, headers=headers, params=payload).json()
df = pd.DataFrame(jsonData['data']['items'])

Output: 846 rows, 78 columns

