Skip to content
Advertisement

How to scrape table with flight data, avoiding an empty result?

I’m trying to extract a table from a webpage and have tried a number of alternatives, but the table always seems to remain empty.

Two of what I thought were the most promising sets of code are attached below. Any means of extracting the data from the webpage would be considered as helpful. I have also included a screenshot of the table I want to extract.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
browser = webdriver.Chrome()
browser.set_window_size(1120, 550)
# Create an URL object
url = 'https://www.flightradar24.com/data/aircraft/ja11jc'
browser.get(url)
element = WebDriverWait(browser, 3).until(
   EC.presence_of_element_located((By.ID, "tbl-datatable"))
)
data = element.get_attribute('tbl-datatable')
print(data)
browser.quit()

or alternatively,

# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# Create an URL object
url = 'https://www.flightradar24.com/data/aircraft/ja11jc'
# Create object page
page = requests.get(url)
 
# parser-lxml = Change html to Python friendly format
# Obtain page's information
soup = BeautifulSoup(page.text, 'lxml')
soup
 
# Obtain information from tag <table>
table1 = soup.find("table", id='tbl-datatable')
table1
 
# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
 title = i.text
 headers.append(title)
 
 
 # Create a dataframe
mydata = pd.DataFrame(columns = headers)
 
# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
 row_data = j.find_all('td')
 row = [i.text for i in row_data]
 length = len(mydata)
 mydata.loc[length] = row

Advertisement

Answer

Best practice is and first shot scraping table data should go with pandas.read_html(), it works in most cases, needs adjustments in some cases and only fails in specific ones.

Issue here is, that a user-agent is needed with requests to avoid the 403, so we have to help pandas with that:

requests.get('http://www.flightradar24.com/data/aircraft/ja11jc', 
        headers={'User-Agent': 'some user agent string'}).text
     )[0]

Now the table could be scraped, but have to be transformed a bit, cause that is what the browser will do while rendering – .dropna(axis=1) drops columns with NaN values and [:-1] slices the last row, that contains non relevant information:

requests.get('http://www.flightradar24.com/data/aircraft/ja11jc', 
        headers={'User-Agent': 'some user agent string'}).text
     )[0].dropna(axis=1)[:-1]

You could also use selenium give it some time.sleep(3) while browser renders table in final form and process the driver.page_source but in my opinion this is a bit to much, in this case.

Example

import pandas as pd
import requests

df = pd.read_html(
        requests.get('http://www.flightradar24.com/data/aircraft/ja11jc', 
        headers={'User-Agent': 'some user agent string'}).text
     )[0].dropna(axis=1)[:-1]

df.columns = ['DATE','FROM', 'TO', 'FLIGHT', 'FLIGHT TIME', 'STD', 'ATD', 'STA','STATUS']
df

Output

DATE FROM TO FLIGHT FLIGHT TIME STD ATD STA STATUS
0 10 Dec 2022 Tokunoshima (TKN) Kagoshima (KOJ) JL3798 10:00 11:10 Scheduled
1 10 Dec 2022 Amami (ASJ) Tokunoshima (TKN) JL3843 08:55 09:30 Scheduled
58 03 Dec 2022 Amami (ASJ) Kagoshima (KOJ) JL3724 0:56 01:45 02:02 02:50 Landed 02:58
59 03 Dec 2022 Kagoshima (KOJ) Amami (ASJ) JL3725 1:06 00:00 00:09 01:15 Landed 01:14
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement