I want to convert a website table to pandas df, but BeautifulSoup
doesn’t recognize the table (snipped image below). Below is the code I tried with no luck.
from bs4 import BeautifulSoup import requests import pandas as pd url = 'https://www.ndbc.noaa.gov/ship_obs.php' headers = {'User-Agent': 'Mozilla/5.0'} response = requests.get(url, headers=headers) soup = BeautifulSoup(response.content, 'html.parser') tables = soup.find_all('table', rules = 'all') #tables =soup.find_all("table",{"style":"color:#333399;"}) #instead of above line to specify table with no luck! df = pd.read_html(table, skiprows=2, flavor='bs4') df.head()
I also tried the code below with no luck
df = pd.read_html('https://www.ndbc.noaa.gov/ship_obs.php') print(df)
Advertisement
Answer
Your table is not in the <table>
tag but in multiple <span>
tags.
You can parse these to a dataframe like so:
import pandas as pd import requests import bs4 url = f"https://www.ndbc.noaa.gov/ship_obs.php" soup = bs4.BeautifulSoup(requests.get(url).text, 'html.parser').find('pre').find_all("span") print(pd.DataFrame([r.getText().split() for r in soup]))
Output:
0 1 2 3 4 5 ... 40 41 42 43 44 45 0 SHIP HOUR LAT LON WDIR WSPD ... °T ft sec °T Acc Ice 1 SHIP 19 46.5 -72.3 260 5.1 ... None None None None None None 2 SHIP 19 46.8 -71.2 110 2.9 ... None None None None None None 3 SHIP 19 47.4 -61.8 40 18.1 ... None None None None None None 4 SHIP 19 47.7 -53.2 40 8.0 ... None None None None None None .. ... ... ... ... ... ... ... ... ... ... ... ... ... 170 SHIP 19 17.6 -62.4 100 20.0 ... None None None None None None 171 SHIP 19 25.8 -78.0 40 24.1 ... None None None None None None 172 SHIP 19 1.5 104.8 20 22.0 ... None None None None None None 173 SHIP 19 57.9 1.2 180 - ... None None None None None None 174 SHIP 19 35.1 -10.0 310 24.1 ... None None None None None None [175 rows x 46 columns]