Goal: The goal of my project is to use BeautifulSoup
aka bs4
to scrape only necessary data from an HTML
file and import it into excel. The html file is heavily formatted so unfortunately I haven’t been able to tailor more common solutions to my needs.
What I have tried: I have been able to parse the HTML file to the point where I am only pulling the tables I need, and I am able to detect every column of data and print it. In example, if there are a total of 18 columns and 3 rows of data, the code will output 54 times with each piece of table data going from row 1 col 1 to row 3 col 18.
My code is as follows:
from bs4 import BeautifulSoup import csv import pandas as pd url = output = #define table error to detect only tables with extractable data def iserror(func, *args, **kw): try: func(*args, **kw) return False except Exception: return True #read the html with open(url) as html_file: soup = BeautifulSoup(html_file, 'html.parser') #table = soup.find_all('table') all_tables = soup.find_all('table') #print(len(table)) df = pd.DataFrame( columns=(pks_col_names)) col_list = [] table_list = [] for tnum, tables in enumerate(all_tables): if iserror(all_tables[tnum].tbody): #Finds table with data table = tables.findAll('tr') #Loops through rows of each data table for rnum, row in enumerate(table): table_row = table[rnum].findAll('td') if len(table_row)==17: #Loops through columns of each data table for col in table_row: col_list.append(col.string) else: pass else: pass
Example of data output currently achieved
row 1 column 1 (first string in list)
row 1 column 2
row 1 column 3 …
row 3 column 17
row 3 column 18 (last string in list)
The current code creates a single list with the data outputted above, though I am unable to figure out a way to convert that list into a pandas dataframe tying each list output to the appropriate row/column. Could anyone provide ideas on how to do this or how to otherwise rework my code to import this data into a dataframe?
Advertisement
Answer
it’s all messed up: your function iserror
does in fact check if there’s no error (and i don’t think it works at all). what you call tables are rows and you don’t need to enumerate
as you haven’t provided the data, i made only rough tests. but it’s a bit cleaner
row_list = [] for table in all_tables: if is_ok(table.tbody): #Finds table with data rows = table.findAll('tr') #Loops through rows of each data table for row in rows: cols = row.findAll('td') col_list = [] if len(cols)==17: #Loops through columns of each data table for col in cols: col_list.append(col.string) row_list.append(col_list) df = pd.DataFrame(row_list, columns=(pks_col_names))