I have the following dataframe df with names of the travelling cities in route column of an aircraft with it’s ticket_price.
I want to obtain individual city names from route and one hot encode them.
Dataframe (df)
id| route | ticket_price 1 | Mumbai - Pune - Bangalore | 10000 2 | Pune - Delhi | 7000 3 | Delhi - Pune | 6500
Required Dataframe (df_encoded)
id | route_mumbai | route_pune | route_bangalore | route_delhi | ticket_price 1 | 1 | 1 | 1 | 0 | 10000 2 | 0 | 1 | 0 | 1 | 7000 3 | 0 | 1 | 0 | 1 | 6500
Code
I have performed some preprocessing on the route column using the following code but am unable to understand how to one hot encode it.
def location_preprocessing(text):
"""
Function to Preprocess the features having location names.
"""
text = text.replace(" ", "") # Remove whitespaces
text = text.split("|") # Obtain individual cities
lst_text = [x.lower() for x in text] # Lowercase city names
text = " ".join(lst_text) # Convert to string from list
return text
df['route'] = df['route'].apply(lambda x: location_preprocessing(x))
If I apply one hot encoding directly using the below code then all routes are considered unique and are one hot encoded individually which is not desired. I want individual cities to be one hot encoded not the route.
df = pd.get_dummies(df, columns = ['route']) # One-hot Encoding `route`
How to obtain the required dataframe?
Advertisement
Answer
If you have dataframe:
id route ticket_price 0 1 Mumbai - Pune - Bangalore 10000 1 2 Pune - Delhi 7000 2 3 Delhi - Pune 6500
Then:
df.route = df.route.str.split(" - ")
df_out = pd.concat(
[
df.explode("route")
.pivot_table(index="id", columns="route", aggfunc="size", fill_value=0)
.add_prefix("Route_"),
df.set_index("id").ticket_price,
],
axis=1,
)
print(df_out)
Prints:
Route_Bangalore Route_Delhi Route_Mumbai Route_Pune ticket_price id 1 1 0 1 1 10000 2 0 1 0 1 7000 3 0 1 0 1 6500