I have a Pandas Dataframe with a date column. I want to only have the oldest records for each month and remove any records that came before. There will be duplicates and I want to keep them. I also need a new column with only the month and year.
Input
Provider | date |
---|---|
Apple | 01/01/2022 |
Apple | 05/01/2022 |
Apple | 20/01/2022 |
Apple | 20/01/2022 |
Apple | 05/02/2022 |
Apple | 10/02/2022 |
Output:
Provider | date | month_year |
---|---|---|
Apple | 20/01/2022 | 01/2022 |
Apple | 20/01/2022 | 01/2022 |
Apple | 10/02/2022 | 02/2022 |
Advertisement
Answer
Create column month_year
with Series.dt.strftime
and then compare maximal datetimes per groups by original date
column in GroupBy.transform
and filter in boolean indexing
:
JavaScript
x
11
11
1
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
2
3
df = df.assign(month_year = df['date'].dt.strftime('%m/%Y'))
4
5
df = df[df.groupby(['Provider', 'month_year'])['date'].transform('max').eq(df['date'])]
6
print (df)
7
Provider date month_year
8
2 Apple 2022-01-20 01/2022
9
3 Apple 2022-01-20 01/2022
10
5 Apple 2022-02-10 02/2022
11