I want to convert a website table to pandas df, but BeautifulSoup
doesn’t recognize the table (snipped image below). Below is the code I tried with no luck.
JavaScript
x
14
14
1
from bs4 import BeautifulSoup
2
import requests
3
import pandas as pd
4
5
url = 'https://www.ndbc.noaa.gov/ship_obs.php'
6
headers = {'User-Agent': 'Mozilla/5.0'}
7
response = requests.get(url, headers=headers)
8
9
soup = BeautifulSoup(response.content, 'html.parser')
10
tables = soup.find_all('table', rules = 'all')
11
#tables =soup.find_all("table",{"style":"color:#333399;"}) #instead of above line to specify table with no luck!
12
df = pd.read_html(table, skiprows=2, flavor='bs4')
13
df.head()
14
I also tried the code below with no luck
JavaScript
1
3
1
df = pd.read_html('https://www.ndbc.noaa.gov/ship_obs.php')
2
print(df)
3
Advertisement
Answer
Your table is not in the <table>
tag but in multiple <span>
tags.
You can parse these to a dataframe like so:
JavaScript
1
8
1
import pandas as pd
2
import requests
3
import bs4
4
5
url = f"https://www.ndbc.noaa.gov/ship_obs.php"
6
soup = bs4.BeautifulSoup(requests.get(url).text, 'html.parser').find('pre').find_all("span")
7
print(pd.DataFrame([r.getText().split() for r in soup]))
8
Output:
JavaScript
1
15
15
1
0 1 2 3 4 5 40 41 42 43 44 45
2
0 SHIP HOUR LAT LON WDIR WSPD °T ft sec °T Acc Ice
3
1 SHIP 19 46.5 -72.3 260 5.1 None None None None None None
4
2 SHIP 19 46.8 -71.2 110 2.9 None None None None None None
5
3 SHIP 19 47.4 -61.8 40 18.1 None None None None None None
6
4 SHIP 19 47.7 -53.2 40 8.0 None None None None None None
7
..
8
170 SHIP 19 17.6 -62.4 100 20.0 None None None None None None
9
171 SHIP 19 25.8 -78.0 40 24.1 None None None None None None
10
172 SHIP 19 1.5 104.8 20 22.0 None None None None None None
11
173 SHIP 19 57.9 1.2 180 - None None None None None None
12
174 SHIP 19 35.1 -10.0 310 24.1 None None None None None None
13
14
[175 rows x 46 columns]
15