Skip to content
Advertisement

Properly formatting the table after scraped by BeautifulSoup

I am new to Python.

I have been trying to scrape a table from http://www.phc4.org/reports/utilization/inpatient/CountyReport20192C001.htm. The targeted table is titled as “Utilization by Body System”.

I was able to capture the table by using BeautifulSoup; however, the scraped dataframe has been driving me crazy and I could not find a way to address the issue.

My code:

import re
import bs4 as bs4
import urllib.request
source=urllib.request.urlopen('http://www.phc4.org/reports/utilization/inpatient/CountyReport20192C001.htm').read()
soup=bs4.BeautifulSoup(source,'lxml')
#find the county utilization table by MDC 
#using the parental tag scrapling method, find the exact table index then save the parental table
table_mdc=soup.find(text=re.compile("Utilization by Body System")).findParent('table')
# print (table_mdc)
# #constuct the table
for row in table_mdc.find_all('tr'):
    for cell in row.find_all('td'):
        print(cell.text)
with open ('utilization.txt','w') as r:
    for row in table_mdc.find_all('tr'):
        for cell in row.find_all('td'):
            r.write(cell.text)

For instance, the scraped the dataframe is printed as:

Utilization by Body System 
MDC Description
Total Cases
Number
Percent
Total Charges
% of Charges
Avg. Charge
Total Days
% of Total Days
Avg. LOS

Total

 
2,594
 
 
100.0%
 
 
$101,757,824
 
 
100.0%
 
 
$39,228
 
 
11,972
 
 
100.0%
 
 
4.6

There are so many newlines in its output as well as the txt file. The ideal txt file should be look like this:

(with no “total cases” in the header)

What should I do to overcome these issues?

Advertisement

Answer

import pandas as pd


df = pd.read_html(
    "http://www.phc4.org/reports/utilization/inpatient/CountyReport20192C001.htm", attrs={"id": "dgBodySystem"}, header=0)[0]

print(df)
df.to_csv("data.csv", index=False)

Output:

enter image description here

Advertisement