I’m trying to extract a table from a webpage and have tried a number of alternatives, but the table always seems to remain empty.
Two of what I thought were the most promising sets of code are attached below. Any means of extracting the data from the webpage would be considered as helpful. I have also included a screenshot of the table I want to extract.
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC browser = webdriver.Chrome() browser.set_window_size(1120, 550) # Create an URL object url = 'https://www.flightradar24.com/data/aircraft/ja11jc' browser.get(url) element = WebDriverWait(browser, 3).until( EC.presence_of_element_located((By.ID, "tbl-datatable")) ) data = element.get_attribute('tbl-datatable') print(data) browser.quit()
or alternatively,
# Import libraries import requests from bs4 import BeautifulSoup import pandas as pd # Create an URL object url = 'https://www.flightradar24.com/data/aircraft/ja11jc' # Create object page page = requests.get(url) # parser-lxml = Change html to Python friendly format # Obtain page's information soup = BeautifulSoup(page.text, 'lxml') soup # Obtain information from tag <table> table1 = soup.find("table", id='tbl-datatable') table1 # Obtain every title of columns with tag <th> headers = [] for i in table1.find_all('th'): title = i.text headers.append(title) # Create a dataframe mydata = pd.DataFrame(columns = headers) # Create a for loop to fill mydata for j in table1.find_all('tr')[1:]: row_data = j.find_all('td') row = [i.text for i in row_data] length = len(mydata) mydata.loc[length] = row
Advertisement
Answer
Best practice is and first shot scraping table data should go with pandas.read_html()
, it works in most cases, needs adjustments in some cases and only fails in specific ones.
Issue here is, that a user-agent
is needed with requests
to avoid the 403
, so we have to help pandas
with that:
requests.get('http://www.flightradar24.com/data/aircraft/ja11jc', headers={'User-Agent': 'some user agent string'}).text )[0]
Now the table could be scraped, but have to be transformed a bit, cause that is what the browser will do while rendering – .dropna(axis=1)
drops columns with NaN values and [:-1]
slices the last row, that contains non relevant information:
requests.get('http://www.flightradar24.com/data/aircraft/ja11jc', headers={'User-Agent': 'some user agent string'}).text )[0].dropna(axis=1)[:-1]
You could also use selenium
give it some time.sleep(3)
while browser renders table in final form and process the driver.page_source
but in my opinion this is a bit to much, in this case.
Example
import pandas as pd import requests df = pd.read_html( requests.get('http://www.flightradar24.com/data/aircraft/ja11jc', headers={'User-Agent': 'some user agent string'}).text )[0].dropna(axis=1)[:-1] df.columns = ['DATE','FROM', 'TO', 'FLIGHT', 'FLIGHT TIME', 'STD', 'ATD', 'STA','STATUS'] df
Output
DATE | FROM | TO | FLIGHT | FLIGHT TIME | STD | ATD | STA | STATUS | |
---|---|---|---|---|---|---|---|---|---|
0 | 10 Dec 2022 | Tokunoshima (TKN) | Kagoshima (KOJ) | JL3798 | — | 10:00 | — | 11:10 | Scheduled |
1 | 10 Dec 2022 | Amami (ASJ) | Tokunoshima (TKN) | JL3843 | — | 08:55 | — | 09:30 | Scheduled |
… | … | … | … | … | … | … | … | … | … |
58 | 03 Dec 2022 | Amami (ASJ) | Kagoshima (KOJ) | JL3724 | 0:56 | 01:45 | 02:02 | 02:50 | Landed 02:58 |
59 | 03 Dec 2022 | Kagoshima (KOJ) | Amami (ASJ) | JL3725 | 1:06 | 00:00 | 00:09 | 01:15 | Landed 01:14 |