Skip to content
Advertisement

How to clean data so that the correct arrival code is there for the city pair?

How to clean data so that the correct arrival code is there for the city pair?

From the picture, the CSV is like column 1: City Pair (Departure – Arrival), column 2 is meant to be the Departure Code, and column 3 is meant to be the Arrival Code.

As you can see for row 319 in the first column, it is ‘London – Paris’, column 2 is CDG, and column 3 is LHR when it should be the opposite.

Also in row 324, Dusseldorf – Paris is CDG in column 2 and DUS in column 3, when it should be the opposite.

The majority of the CSV is correct, but there are a lot of errors like this.

Could someone help on how to sort this for all the rows? Maybe indexing through all of the rows in column 1 and making sure the text before and after the ‘-‘ matches the codes in columns 2 and 3?

Columns example in CSV

    old_index   airports                                arr dep
0   319         London, United Kingdom - Paris, France  CDG LHR
1   320         London, United Kingdom - Paris, France  CDG LHR
2   321         London, United Kingdom - Paris, France  CDG LHR
3   322         London, United Kingdom - Paris, France  CDG LHR
4   323         London, United Kingdom - Paris, France  CDG LHR
5   324         Dusseldorf, Germany - Paris, France     CDG DUS
6   325         Amsterdam, Netherlands - Paris, France  CDG AMS
7   326         Amsterdam, Netherlands - Paris, France  CDG AMS
8   327         Amsterdam, Netherlands - Paris, France  CDG AMS
9   328         Amsterdam, Netherlands - Paris, France  CDG AMS
10  329         Amsterdam, Netherlands - Paris, France  CDG AMS

Advertisement

Answer

As an example, and using the cleaner looking data from your other question…

Given:

enter image description here

..from your other question.

Try:

import pandas as pd
import numpy as np
import math
from math import sin, cos, sqrt, atan2, radians

def get_distance(in_lat1, in_lon1, in_lat2, in_lon2):
    # approximate radius of earth in km
    R = 6373.0

    lat1 = radians(in_lat1)
    lon1 = radians(in_lon1)
    lat2 = radians(in_lat2)
    lon2 = radians(in_lon2)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    return distance

df = pd.DataFrame({'Normalised City Pair': {0: 'London, United Kingdom - New York, United States',
  1: 'Johannesburg, South Africa - London, United Kingdom',
  2: 'London, United Kingdom - New York, United States',
  3: 'Johannesburg, South Africa - London, United Kingdom',
  4: 'London, United Kingdom - Singapore, Singapore'},
 'Departure Code': {0: 'LHR', 1: 'JNB', 2: 'LHR', 3: 'JNB', 4: 'SIN'},
 'Arrival Code': {0: 'JFK', 1: 'LHR', 2: 'JFK', 3: 'LHR', 4: 'LHR'},
 'Departure_lat': {0: 51.5, 1: -26.1, 2: 51.5, 3: -26.1, 4: 1.3},
 'Departure_lon': {0: -0.45, 1: 28.23, 2: -0.45, 3: 28.23, 4: 103.98},
 'Arrival_lat': {0: 40.64, 1: 51.47, 2: 40.64, 3: 51.47, 4: 51.47},
 'Arrival_lon': {0: -73.79, 1: -0.45, 2: -73.79, 3: -0.45, 4: -0.45}})

df_airports = pd.read_csv('https://ourairports.com/data/airports.csv')
df_airports = df_airports[['name', 'iata_code']].copy()
df_airports = df_airports[df_airports['iata_code'].notna()].reset_index(drop=True)
# df_airports.query('iata_code == "CDG" | iata_code == "LHR"')

df['Distance'] = df.apply(lambda x: get_distance(x['Departure_lat'], x['Departure_lon'], x['Arrival_lat'], x['Arrival_lon']), axis=1)

#df[['ap_dep', 'ap_arr']] = df['Normalised City Pair'].str.split(' - ', expand=True)

df_airports = df_airports.sort_values('name')

df_airports = df_airports.drop_duplicates(subset ='iata_code', keep='first')

df['dep_ap_name'] = df['Departure Code'].map(df_airports.set_index('iata_code')['name'])
df['arr_ap_name'] = df['Arrival Code'].map(df_airports.set_index('iata_code')['name'])

Output:

enter image description here

If the frame now has too many columns and you want something cleaner and to reorder the columns try final_df = df[['a', 'b', 'c', 'd']] where abc are the columns and ordering you would like.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement