I am new to Python.
I have been trying to scrape a table from http://www.phc4.org/reports/utilization/inpatient/CountyReport20192C001.htm. The targeted table is titled as “Utilization by Body System”.
I was able to capture the table by using BeautifulSoup; however, the scraped dataframe has been driving me crazy and I could not find a way to address the issue.
My code:
import re import bs4 as bs4 import urllib.request source=urllib.request.urlopen('http://www.phc4.org/reports/utilization/inpatient/CountyReport20192C001.htm').read() soup=bs4.BeautifulSoup(source,'lxml') #find the county utilization table by MDC #using the parental tag scrapling method, find the exact table index then save the parental table table_mdc=soup.find(text=re.compile("Utilization by Body System")).findParent('table') # print (table_mdc) # #constuct the table for row in table_mdc.find_all('tr'): for cell in row.find_all('td'): print(cell.text) with open ('utilization.txt','w') as r: for row in table_mdc.find_all('tr'): for cell in row.find_all('td'): r.write(cell.text)
For instance, the scraped the dataframe is printed as:
Utilization by Body System MDC Description Total Cases Number Percent Total Charges % of Charges Avg. Charge Total Days % of Total Days Avg. LOS Total 2,594 100.0% $101,757,824 100.0% $39,228 11,972 100.0% 4.6
There are so many newlines in its output as well as the txt file. The ideal txt file should be look like this:
(with no “total cases” in the header)
What should I do to overcome these issues?
Advertisement
Answer
import pandas as pd df = pd.read_html( "http://www.phc4.org/reports/utilization/inpatient/CountyReport20192C001.htm", attrs={"id": "dgBodySystem"}, header=0)[0] print(df) df.to_csv("data.csv", index=False)
Output: