I would like to count the frequency of a value for the past x days. In the example below, I would like to count the frequency of value in the Name
column for the past 28 days. The data is already sorted by Date
import pandas as pd import time d = {'Name': ['Jack', 'Jim', 'Jack', 'Jim', 'Jack', 'Jack', 'Jim', 'Jack', 'Jane', 'Jane'], 'Date': ['08/01/2021', '27/01/2021', '05/02/2021', '10/02/2021', '17/02/2021', '18/02/2021', '20/02/2021', '21/02/2021', '22/02/2021', '29/03/2021']} df = pd.DataFrame(data=d) df['Date'] = pd.to_datetime(df.Date, format='%d/%m/%Y') # Make sure pandas is sorted by Date df = df.sort_values('Date')
I found some solutions on StackOverFlow but all of them are neither correct on the dataset nor fast.
Approach 1 – not quite correct
df['count1'] = df.set_index('Date').groupby('Name', sort=False)['Name'].rolling('28d', closed='both').count().tolist()
Approach 2 – correct approach but very slow <~ from this link
df['count2'] = df.assign(count=1).groupby(['Name']).apply(lambda x: x.rolling('28d', on='Date').sum())['count']
Approach 3 – using sum
– Not correct
df['count3'] = df.assign(count=1).groupby('Name').rolling('28d', on='Date').sum().reset_index().sort_values('Date')['count']
Approach4 – also using sum
– not correct as the indexes are not right <~ this link
df['count4'] = df.set_index('Date').assign(count_last=1).groupby('Name').rolling('28d').sum().reset_index()["count_last"]
Output
Name Date count1 count2 count3 count4 0 Jack 2021-01-08 1.0 1.0 1.0 1.0 1 Jim 2021-01-27 2.0 1.0 1.0 1.0 2 Jack 2021-02-05 2.0 1.0 2.0 2.0 3 Jim 2021-02-10 3.0 2.0 3.0 3.0 4 Jack 2021-02-17 4.0 2.0 4.0 4.0 #<~ all are wrong here except approach 2 5 Jack 2021-02-18 1.0 3.0 1.0 1.0 6 Jim 2021-02-20 2.0 3.0 1.0 1.0 7 Jack 2021-02-21 3.0 4.0 1.0 1.0 8 Jane 2021-02-22 1.0 1.0 2.0 2.0 9 Jane 2021-03-29 1.0 1.0 3.0 3.0
Performances
Method 1: 0.0014538764953613281 ms Method 2: 0.0034720897674560547 ms Method 3: 0.002077817916870117 ms Method 4: 0.0035729408264160156 ms
Updated <~ Solution
Based on the marked answer, I wrote this solution and I think it works when the data has NULL values and duplicates. Also, it does not change the size of the original dataset.
def frequency_of_last_n_days(df: pd.DataFrame, identifier: str, timestamp: str, delta: int) -> pd.DataFrame: col_name = "count_%s" % identifier temp_df = df.set_index(timestamp) .groupby(identifier, sort=False)[identifier] .rolling('%sd' % delta, closed='both') .count() .rename(col_name) temp_df = temp_df[~temp_df.index.duplicated(keep="first")] return df.merge(temp_df, how="left", left_on=[identifier, timestamp], right_index=True) frequency_of_last_n_days(df, "Name", "Date", 30)
Advertisement
Answer
IIUC, the issue is coming from your tolist()
that messes up with index alignment and shuffles the output.
Use a merge
instead:
df2 = (df .merge(df.set_index('Date') .groupby('Name', sort=False)['Name'] .rolling('28d', closed='both') # do you really want closed="both"? .count().rename('count'), left_on=['Name', 'Date'], right_index=True ) )
output:
Name Date count 0 Jack 2021-01-08 1.0 1 Jim 2021-01-27 1.0 2 Jack 2021-02-05 2.0 <- if you want 1 here, remove closed='both' 3 Jim 2021-02-10 2.0 4 Jack 2021-02-17 2.0 5 Jack 2021-02-18 3.0 6 Jim 2021-02-20 3.0 7 Jack 2021-02-21 4.0 8 Jane 2021-02-22 1.0 9 Jane 2021-03-29 1.0