Goal: The goal of my project is to use BeautifulSoup
aka bs4
to scrape only necessary data from an HTML
file and import it into excel. The html file is heavily formatted so unfortunately I haven’t been able to tailor more common solutions to my needs.
What I have tried: I have been able to parse the HTML file to the point where I am only pulling the tables I need, and I am able to detect every column of data and print it. In example, if there are a total of 18 columns and 3 rows of data, the code will output 54 times with each piece of table data going from row 1 col 1 to row 3 col 18.
My code is as follows:
from bs4 import BeautifulSoup
import csv
import pandas as pd
url =
output =
#define table error to detect only tables with extractable data
def iserror(func, *args, **kw):
try:
func(*args, **kw)
return False
except Exception:
return True
#read the html
with open(url) as html_file:
soup = BeautifulSoup(html_file, 'html.parser')
#table = soup.find_all('table')
all_tables = soup.find_all('table')
#print(len(table))
df = pd.DataFrame( columns=(pks_col_names))
col_list = []
table_list = []
for tnum, tables in enumerate(all_tables):
if iserror(all_tables[tnum].tbody): #Finds table with data
table = tables.findAll('tr')
#Loops through rows of each data table
for rnum, row in enumerate(table):
table_row = table[rnum].findAll('td')
if len(table_row)==17:
#Loops through columns of each data table
for col in table_row:
col_list.append(col.string)
else:
pass
else:
pass
Example of data output currently achieved
row 1 column 1 (first string in list)
row 1 column 2
row 1 column 3 …
row 3 column 17
row 3 column 18 (last string in list)
The current code creates a single list with the data outputted above, though I am unable to figure out a way to convert that list into a pandas dataframe tying each list output to the appropriate row/column. Could anyone provide ideas on how to do this or how to otherwise rework my code to import this data into a dataframe?
Advertisement
Answer
it’s all messed up: your function iserror
does in fact check if there’s no error (and i don’t think it works at all). what you call tables are rows and you don’t need to enumerate
as you haven’t provided the data, i made only rough tests. but it’s a bit cleaner
row_list = []
for table in all_tables:
if is_ok(table.tbody): #Finds table with data
rows = table.findAll('tr')
#Loops through rows of each data table
for row in rows:
cols = row.findAll('td')
col_list = []
if len(cols)==17:
#Loops through columns of each data table
for col in cols:
col_list.append(col.string)
row_list.append(col_list)
df = pd.DataFrame(row_list, columns=(pks_col_names))