How to clean data so that the correct arrival code is there for the city pair?
From the picture, the CSV is like column 1: City Pair (Departure – Arrival), column 2 is meant to be the Departure Code, and column 3 is meant to be the Arrival Code.
As you can see for row 319 in the first column, it is ‘London – Paris’, column 2 is CDG, and column 3 is LHR when it should be the opposite.
Also in row 324, Dusseldorf – Paris is CDG in column 2 and DUS in column 3, when it should be the opposite.
The majority of the CSV is correct, but there are a lot of errors like this.
Could someone help on how to sort this for all the rows? Maybe indexing through all of the rows in column 1 and making sure the text before and after the ‘-‘ matches the codes in columns 2 and 3?
old_index airports arr dep 0 319 London, United Kingdom - Paris, France CDG LHR 1 320 London, United Kingdom - Paris, France CDG LHR 2 321 London, United Kingdom - Paris, France CDG LHR 3 322 London, United Kingdom - Paris, France CDG LHR 4 323 London, United Kingdom - Paris, France CDG LHR 5 324 Dusseldorf, Germany - Paris, France CDG DUS 6 325 Amsterdam, Netherlands - Paris, France CDG AMS 7 326 Amsterdam, Netherlands - Paris, France CDG AMS 8 327 Amsterdam, Netherlands - Paris, France CDG AMS 9 328 Amsterdam, Netherlands - Paris, France CDG AMS 10 329 Amsterdam, Netherlands - Paris, France CDG AMS
Advertisement
Answer
As an example, and using the cleaner looking data from your other question…
Given:
..from your other question.
Try:
import pandas as pd import numpy as np import math from math import sin, cos, sqrt, atan2, radians def get_distance(in_lat1, in_lon1, in_lat2, in_lon2): # approximate radius of earth in km R = 6373.0 lat1 = radians(in_lat1) lon1 = radians(in_lon1) lat2 = radians(in_lat2) lon2 = radians(in_lon2) dlon = lon2 - lon1 dlat = lat2 - lat1 a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2 c = 2 * atan2(sqrt(a), sqrt(1 - a)) distance = R * c return distance df = pd.DataFrame({'Normalised City Pair': {0: 'London, United Kingdom - New York, United States', 1: 'Johannesburg, South Africa - London, United Kingdom', 2: 'London, United Kingdom - New York, United States', 3: 'Johannesburg, South Africa - London, United Kingdom', 4: 'London, United Kingdom - Singapore, Singapore'}, 'Departure Code': {0: 'LHR', 1: 'JNB', 2: 'LHR', 3: 'JNB', 4: 'SIN'}, 'Arrival Code': {0: 'JFK', 1: 'LHR', 2: 'JFK', 3: 'LHR', 4: 'LHR'}, 'Departure_lat': {0: 51.5, 1: -26.1, 2: 51.5, 3: -26.1, 4: 1.3}, 'Departure_lon': {0: -0.45, 1: 28.23, 2: -0.45, 3: 28.23, 4: 103.98}, 'Arrival_lat': {0: 40.64, 1: 51.47, 2: 40.64, 3: 51.47, 4: 51.47}, 'Arrival_lon': {0: -73.79, 1: -0.45, 2: -73.79, 3: -0.45, 4: -0.45}}) df_airports = pd.read_csv('https://ourairports.com/data/airports.csv') df_airports = df_airports[['name', 'iata_code']].copy() df_airports = df_airports[df_airports['iata_code'].notna()].reset_index(drop=True) # df_airports.query('iata_code == "CDG" | iata_code == "LHR"') df['Distance'] = df.apply(lambda x: get_distance(x['Departure_lat'], x['Departure_lon'], x['Arrival_lat'], x['Arrival_lon']), axis=1) #df[['ap_dep', 'ap_arr']] = df['Normalised City Pair'].str.split(' - ', expand=True) df_airports = df_airports.sort_values('name') df_airports = df_airports.drop_duplicates(subset ='iata_code', keep='first') df['dep_ap_name'] = df['Departure Code'].map(df_airports.set_index('iata_code')['name']) df['arr_ap_name'] = df['Arrival Code'].map(df_airports.set_index('iata_code')['name'])
Output:
If the frame now has too many columns and you want something cleaner and to reorder the columns try final_df = df[['a', 'b', 'c', 'd']]
where abc are the columns and ordering you would like.