I’m trying to extract a table from a webpage and have tried a number of alternatives, but the table always seems to remain empty.
Two of what I thought were the most promising sets of code are attached below. Any means of extracting the data from the webpage would be considered as helpful. I have also included a screenshot of the table I want to extract.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
 
browser = webdriver.Chrome()
browser.set_window_size(1120, 550)
# Create an URL object
url = 'https://www.flightradar24.com/data/aircraft/ja11jc'
browser.get(url)
element = WebDriverWait(browser, 3).until(
   EC.presence_of_element_located((By.ID, "tbl-datatable"))
)
data = element.get_attribute('tbl-datatable')
print(data)
browser.quit()
or alternatively,
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
 
# Create an URL object
url = 'https://www.flightradar24.com/data/aircraft/ja11jc'
# Create object page
page = requests.get(url)
 
# parser-lxml = Change html to Python friendly format
# Obtain page's information
soup = BeautifulSoup(page.text, 'lxml')
soup
 
# Obtain information from tag <table>
table1 = soup.find("table", id='tbl-datatable')
table1
 
# Obtain every title of columns with tag <th>
headers = []
for i in table1.find_all('th'):
 title = i.text
 headers.append(title)
 
 
 # Create a dataframe
mydata = pd.DataFrame(columns = headers)
 
# Create a for loop to fill mydata
for j in table1.find_all('tr')[1:]:
 row_data = j.find_all('td')
 row = [i.text for i in row_data]
 length = len(mydata)
 mydata.loc[length] = row
Advertisement
Answer
Best practice is and first shot scraping table data should go with pandas.read_html(), it works in most cases, needs adjustments in some cases and only fails in specific ones.
Issue here is, that a user-agent is needed with requests to avoid the 403, so we have to help pandas with that:
requests.get('http://www.flightradar24.com/data/aircraft/ja11jc', 
        headers={'User-Agent': 'some user agent string'}).text
     )[0]
Now the table could be scraped, but have to be transformed a bit, cause that is what the browser will do while rendering – .dropna(axis=1) drops columns with NaN values and [:-1] slices the last row, that contains non relevant information:
requests.get('http://www.flightradar24.com/data/aircraft/ja11jc', 
        headers={'User-Agent': 'some user agent string'}).text
     )[0].dropna(axis=1)[:-1]
You could also use selenium give it some time.sleep(3) while browser renders table in final form and process the driver.page_source but in my opinion this is a bit to much, in this case.
Example
import pandas as pd
import requests
df = pd.read_html(
        requests.get('http://www.flightradar24.com/data/aircraft/ja11jc', 
        headers={'User-Agent': 'some user agent string'}).text
     )[0].dropna(axis=1)[:-1]
df.columns = ['DATE','FROM', 'TO', 'FLIGHT', 'FLIGHT TIME', 'STD', 'ATD', 'STA','STATUS']
df
Output
| DATE | FROM | TO | FLIGHT | FLIGHT TIME | STD | ATD | STA | STATUS | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 Dec 2022 | Tokunoshima (TKN) | Kagoshima (KOJ) | JL3798 | — | 10:00 | — | 11:10 | Scheduled | 
| 1 | 10 Dec 2022 | Amami (ASJ) | Tokunoshima (TKN) | JL3843 | — | 08:55 | — | 09:30 | Scheduled | 
| … | … | … | … | … | … | … | … | … | … | 
| 58 | 03 Dec 2022 | Amami (ASJ) | Kagoshima (KOJ) | JL3724 | 0:56 | 01:45 | 02:02 | 02:50 | Landed 02:58 | 
| 59 | 03 Dec 2022 | Kagoshima (KOJ) | Amami (ASJ) | JL3725 | 1:06 | 00:00 | 00:09 | 01:15 | Landed 01:14 | 
