dedup records(window function pandas)

Question

Hi I am looking to dedup my records ordered by cancel date so I will only be interested in the most recent record. sample data id cancel_date type_of_fruit 1 2021-03-02 apple 1 2021-01-01 apple 2 2021-02-01 orange expected output id cancel_date type_of_fruit 1 2021-03-02 apple 2 2021-02-01 orange I wrote the …

Accepted Answer

Here is how you can achieve this.Below code will convert cancel_date column into datetime object, because you want to order it using cancel_date:#--if cancel_date is a string, then this code will convert to datetime--import pandas as pddf['cancel_date']= pd.to_datetime(df['cancel_date'])Next grouping the table on id (this is similar to partition in SQL), then using cancel_date column to be sorted in descending order. Below code will achieve the same:df["Rank"] = df.groupby("id")["cancel_date"].rank(method="first", ascending= False)Finally, filtering the data with rank as 1:filtered_df = df[df["Rank"] == 1]filtered_df.head()

Advertisement

Answer