Skip to content
Advertisement

How to use pandas to create a column that stores count of first occurrences on a group-by?

Q1. Given data frame 1, I am trying to get group-by unique new occurrences & another column that gives me existing ID count per month

ID     Date
1    Jan-2020
2    Feb-2020
3    Feb-2020
1    Mar-2020
2    Mar-2020
3    Mar-2020
4    Apr-2020
5    Apr-2020

Expected output for unique newly added group-by ID values & for existing sum of ID values

Date       ID_Count   Existing_count
Jan-2020      1           0
Feb-2020      2           1  
Mar-2020      0           3
Apr-2020      2           3

Note: Mar-2020 ID_Count is ZERO because ID 1, 2, and 3 were present in previous months.

Note: Existing count is 0 for Jan-2020 because there were zero IDs before Jan. The existing count for Feb-2020 is 1 because before Feb there was only 1. Mar-2020 has 3 existing counts as it adds Jan + Feb and so on

Advertisement

Answer

I think you can do it like this:

df['month'] = pd.to_datetime(df['Date'], format='%b-%Y')

# Find new IDs
df['new'] = df.groupby('ID').cumcount()==0

# Count new IDs by month
df_ct = df.groupby('month')['new'].sum().to_frame(name='ID_Count')

# Count all previous new IDs
df_ct['Existing_cnt'] = df_ct['ID_Count'].shift().cumsum().fillna(0).astype(int) 
df_ct.index = df_ct.index.strftime('%b-%Y')
df_ct

Output:

          ID_Count  Existing_cnt
month                           
Jan-2020         1             0
Feb-2020         2             1
Mar-2020         0             3
Apr-2020         2             3
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement