Hi I am looking to dedup my records ordered by cancel date so I will only be interested in the most recent record.
sample data
id | cancel_date | type_of_fruit |
---|---|---|
1 | 2021-03-02 | apple |
1 | 2021-01-01 | apple |
2 | 2021-02-01 | orange |
expected output
id | cancel_date | type_of_fruit |
---|---|---|
1 | 2021-03-02 | apple |
2 | 2021-02-01 | orange |
I wrote the SQL way but I have to implement this logic in pandas, please help
JavaScript
x
9
1
SELECT
2
*
3
FROM
4
(SELECT *,
5
rank() over(partition by id order by cancel_date desc) as rank
6
FROM df
7
ORDER BY id, cancel_date DESC) a
8
where rank = 1
9
Advertisement
Answer
Here is how you can achieve this.
Below code will convert cancel_date
column into datetime
object, because you want to order it using cancel_date
:
JavaScript
1
5
1
#--if cancel_date is a string, then this code will convert to datetime--
2
3
import pandas as pd
4
df['cancel_date']= pd.to_datetime(df['cancel_date'])
5
Next grouping the table on id
(this is similar to partition in SQL
), then using cancel_date
column to be sorted in descending
order. Below code will achieve the same:
JavaScript
1
2
1
df["Rank"] = df.groupby("id")["cancel_date"].rank(method="first", ascending= False)
2
Finally, filtering the data with rank as 1:
JavaScript
1
3
1
filtered_df = df[df["Rank"] == 1]
2
filtered_df.head()
3