How to clean data so that the correct arrival code is there for the city pair?
From the picture, the CSV is like column 1: City Pair (Departure – Arrival), column 2 is meant to be the Departure Code, and column 3 is meant to be the Arrival Code.
As you can see for row 319 in the first column, it is ‘London – Paris’, column 2 is CDG, and column 3 is LHR when it should be the opposite.
Also in row 324, Dusseldorf – Paris is CDG in column 2 and DUS in column 3, when it should be the opposite.
The majority of the CSV is correct, but there are a lot of errors like this.
Could someone help on how to sort this for all the rows? Maybe indexing through all of the rows in column 1 and making sure the text before and after the ‘-‘ matches the codes in columns 2 and 3?
old_index airports arr dep
0 319 London, United Kingdom - Paris, France CDG LHR
1 320 London, United Kingdom - Paris, France CDG LHR
2 321 London, United Kingdom - Paris, France CDG LHR
3 322 London, United Kingdom - Paris, France CDG LHR
4 323 London, United Kingdom - Paris, France CDG LHR
5 324 Dusseldorf, Germany - Paris, France CDG DUS
6 325 Amsterdam, Netherlands - Paris, France CDG AMS
7 326 Amsterdam, Netherlands - Paris, France CDG AMS
8 327 Amsterdam, Netherlands - Paris, France CDG AMS
9 328 Amsterdam, Netherlands - Paris, France CDG AMS
10 329 Amsterdam, Netherlands - Paris, France CDG AMS
Advertisement
Answer
As an example, and using the cleaner looking data from your other question…
Given:
..from your other question.
Try:
import pandas as pd
import numpy as np
import math
from math import sin, cos, sqrt, atan2, radians
def get_distance(in_lat1, in_lon1, in_lat2, in_lon2):
# approximate radius of earth in km
R = 6373.0
lat1 = radians(in_lat1)
lon1 = radians(in_lon1)
lat2 = radians(in_lat2)
lon2 = radians(in_lon2)
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance = R * c
return distance
df = pd.DataFrame({'Normalised City Pair': {0: 'London, United Kingdom - New York, United States',
1: 'Johannesburg, South Africa - London, United Kingdom',
2: 'London, United Kingdom - New York, United States',
3: 'Johannesburg, South Africa - London, United Kingdom',
4: 'London, United Kingdom - Singapore, Singapore'},
'Departure Code': {0: 'LHR', 1: 'JNB', 2: 'LHR', 3: 'JNB', 4: 'SIN'},
'Arrival Code': {0: 'JFK', 1: 'LHR', 2: 'JFK', 3: 'LHR', 4: 'LHR'},
'Departure_lat': {0: 51.5, 1: -26.1, 2: 51.5, 3: -26.1, 4: 1.3},
'Departure_lon': {0: -0.45, 1: 28.23, 2: -0.45, 3: 28.23, 4: 103.98},
'Arrival_lat': {0: 40.64, 1: 51.47, 2: 40.64, 3: 51.47, 4: 51.47},
'Arrival_lon': {0: -73.79, 1: -0.45, 2: -73.79, 3: -0.45, 4: -0.45}})
df_airports = pd.read_csv('https://ourairports.com/data/airports.csv')
df_airports = df_airports[['name', 'iata_code']].copy()
df_airports = df_airports[df_airports['iata_code'].notna()].reset_index(drop=True)
# df_airports.query('iata_code == "CDG" | iata_code == "LHR"')
df['Distance'] = df.apply(lambda x: get_distance(x['Departure_lat'], x['Departure_lon'], x['Arrival_lat'], x['Arrival_lon']), axis=1)
#df[['ap_dep', 'ap_arr']] = df['Normalised City Pair'].str.split(' - ', expand=True)
df_airports = df_airports.sort_values('name')
df_airports = df_airports.drop_duplicates(subset ='iata_code', keep='first')
df['dep_ap_name'] = df['Departure Code'].map(df_airports.set_index('iata_code')['name'])
df['arr_ap_name'] = df['Arrival Code'].map(df_airports.set_index('iata_code')['name'])
Output:
If the frame now has too many columns and you want something cleaner and to reorder the columns try final_df = df[['a', 'b', 'c', 'd']]
where abc are the columns and ordering you would like.