Skip to content
Advertisement

Using a for loop with beautiful soup and if statements to populate a dataframe

Goal: The goal of my project is to use BeautifulSoup aka bs4 to scrape only necessary data from an HTML file and import it into excel. The html file is heavily formatted so unfortunately I haven’t been able to tailor more common solutions to my needs.

What I have tried: I have been able to parse the HTML file to the point where I am only pulling the tables I need, and I am able to detect every column of data and print it. In example, if there are a total of 18 columns and 3 rows of data, the code will output 54 times with each piece of table data going from row 1 col 1 to row 3 col 18.

My code is as follows:

from bs4 import BeautifulSoup
import csv
import pandas as pd


url =
output = 

#define table error to detect only tables with extractable data
def iserror(func, *args, **kw):
    try:
        func(*args, **kw)
        return False
    except Exception:
        return True

#read the html
with open(url) as html_file:
    soup = BeautifulSoup(html_file, 'html.parser')

#table = soup.find_all('table')
all_tables = soup.find_all('table')
#print(len(table))

df = pd.DataFrame( columns=(pks_col_names))
col_list = []
table_list = []

for tnum, tables in enumerate(all_tables):
    if iserror(all_tables[tnum].tbody): #Finds table with data
        table = tables.findAll('tr')
        #Loops through rows of each data table
        for rnum, row in enumerate(table):
            table_row = table[rnum].findAll('td')

            if len(table_row)==17: 
        #Loops through columns of each data table
                for col in table_row: 
                        col_list.append(col.string)
          
            else:
                pass
            

        
    else:
        pass

Example of data output currently achieved

row 1 column 1 (first string in list)
row 1 column 2
row 1 column 3 …
row 3 column 17
row 3 column 18 (last string in list)

The current code creates a single list with the data outputted above, though I am unable to figure out a way to convert that list into a pandas dataframe tying each list output to the appropriate row/column. Could anyone provide ideas on how to do this or how to otherwise rework my code to import this data into a dataframe?

Advertisement

Answer

it’s all messed up: your function iserror does in fact check if there’s no error (and i don’t think it works at all). what you call tables are rows and you don’t need to enumerate

as you haven’t provided the data, i made only rough tests. but it’s a bit cleaner

row_list = []
for table in all_tables:
    if is_ok(table.tbody): #Finds table with data
        rows = table.findAll('tr')
        #Loops through rows of each data table
        for row in rows:
            cols = row.findAll('td')
            col_list = []
            if len(cols)==17: 
                #Loops through columns of each data table
                for col in cols: 
                    col_list.append(col.string)
                row_list.append(col_list)

df = pd.DataFrame(row_list, columns=(pks_col_names))
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement