Skip to content
Advertisement

dedup records(window function pandas)

Hi I am looking to dedup my records ordered by cancel date so I will only be interested in the most recent record.

sample data

id cancel_date type_of_fruit
1 2021-03-02 apple
1 2021-01-01 apple
2 2021-02-01 orange

expected output

id cancel_date type_of_fruit
1 2021-03-02 apple
2 2021-02-01 orange

I wrote the SQL way but I have to implement this logic in pandas, please help

JavaScript

Advertisement

Answer

Here is how you can achieve this.

Below code will convert cancel_date column into datetime object, because you want to order it using cancel_date:

JavaScript

Next grouping the table on id (this is similar to partition in SQL), then using cancel_date column to be sorted in descending order. Below code will achieve the same:

JavaScript

Finally, filtering the data with rank as 1:

JavaScript
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement