to begin with I am a beginner and trying to achieve something which is currently out of my league. However, I hope you guys can help me out. Much appreciated.
I am trying to scrape the table from spaclens.com. I already tried using the out-of-the-box solution from Google sheets however the site is Java Script based which Google sheets cannot handle. I found some code online which I altered to fit my needs however I am stuck.
import pandas as pd from selenium import webdriver from bs4 import BeautifulSoup # Step 1: Create a session and load the page driver = webdriver.Chrome() driver.get('https://www.spaclens.com/') # Wait for the page to fully load driver.implicitly_wait(5) # Step 2: Parse HTML code and grab tables with Beautiful Soup soup = BeautifulSoup(driver.page_source, 'lxml') tables = soup.find_all('table') # Step 3: Read tables with Pandas read_html() dfs = pd.read_html(str(tables)) print(f'Total tables: {len(dfs)}') print(dfs[0]) driver.close()
The code above gives me the following error:
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-34-a32c8dbcef38> in <module> 16 17 # Step 3: Read tables with Pandas read_html() ---> 18 dfs = pd.read_html(str(tables)) 19 20 print(f'Total tables: {len(dfs)}') ~anaconda3libsite-packagespandasutil_decorators.py in wrapper(*args, **kwargs) 294 ) 295 warnings.warn(msg, FutureWarning, stacklevel=stacklevel) --> 296 return func(*args, **kwargs) 297 298 return wrapper ~anaconda3libsite-packagespandasiohtml.py in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, thousands, encoding, decimal, converters, na_values, keep_default_na, displayed_only) 1084 ) 1085 validate_header_arg(header) -> 1086 return _parse( 1087 flavor=flavor, 1088 io=io, ~anaconda3libsite-packagespandasiohtml.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs) 915 break 916 else: --> 917 raise retained 918 919 ret = [] ~anaconda3libsite-packagespandasiohtml.py in _parse(flavor, io, match, attrs, encoding, displayed_only, **kwargs) 896 897 try: --> 898 tables = p.parse_tables() 899 except ValueError as caught: 900 # if `io` is an io-like object, check if it's seekable ~anaconda3libsite-packagespandasiohtml.py in parse_tables(self) 215 list of parsed (header, body, footer) tuples from tables. 216 """ --> 217 tables = self._parse_tables(self._build_doc(), self.match, self.attrs) 218 return (self._parse_thead_tbody_tfoot(table) for table in tables) 219 ~anaconda3libsite-packagespandasiohtml.py in _parse_tables(self, doc, match, attrs) 545 546 if not tables: --> 547 raise ValueError("No tables found") 548 549 result = [] ValueError: No tables found
Do I need to alter the argument to find the table? Anyone can shed some light on this?
Thanks!!
Advertisement
Answer
Be easier to just grab the data from the source. Comes to you in a nice json format.
import pandas as pd import requests url = 'https://www.spaclens.com/company/page' payload = { 'pageIndex': '1', 'pageSize': '9999', 'query': '{}', 'sort': '{}'} headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'} jsonData = requests.get(url, headers=headers, params=payload).json() df = pd.DataFrame(jsonData['data']['items'])
Output: 846 rows, 78 columns