I am new to Python.
I have been trying to scrape a table from http://www.phc4.org/reports/utilization/inpatient/CountyReport20192C001.htm. The targeted table is titled as “Utilization by Body System”.
I was able to capture the table by using BeautifulSoup; however, the scraped dataframe has been driving me crazy and I could not find a way to address the issue.
My code:
JavaScript
x
18
18
1
import re
2
import bs4 as bs4
3
import urllib.request
4
source=urllib.request.urlopen('http://www.phc4.org/reports/utilization/inpatient/CountyReport20192C001.htm').read()
5
soup=bs4.BeautifulSoup(source,'lxml')
6
#find the county utilization table by MDC
7
#using the parental tag scrapling method, find the exact table index then save the parental table
8
table_mdc=soup.find(text=re.compile("Utilization by Body System")).findParent('table')
9
# print (table_mdc)
10
# #constuct the table
11
for row in table_mdc.find_all('tr'):
12
for cell in row.find_all('td'):
13
print(cell.text)
14
with open ('utilization.txt','w') as r:
15
for row in table_mdc.find_all('tr'):
16
for cell in row.find_all('td'):
17
r.write(cell.text)
18
For instance, the scraped the dataframe is printed as:
JavaScript
1
38
38
1
Utilization by Body System
2
MDC Description
3
Total Cases
4
Number
5
Percent
6
Total Charges
7
% of Charges
8
Avg. Charge
9
Total Days
10
% of Total Days
11
Avg. LOS
12
13
Total
14
15
16
2,594
17
18
19
100.0%
20
21
22
$101,757,824
23
24
25
100.0%
26
27
28
$39,228
29
30
31
11,972
32
33
34
100.0%
35
36
37
4.6
38
There are so many newlines in its output as well as the txt file. The ideal txt file should be look like this:
(with no “total cases” in the header)
What should I do to overcome these issues?
Advertisement
Answer
JavaScript
1
9
1
import pandas as pd
2
3
4
df = pd.read_html(
5
"http://www.phc4.org/reports/utilization/inpatient/CountyReport20192C001.htm", attrs={"id": "dgBodySystem"}, header=0)[0]
6
7
print(df)
8
df.to_csv("data.csv", index=False)
9
Output: