Skip to content
Advertisement

Creating a new column if a condition is satisfied in the last N days python pandas

I have a dataframe like this:

timestamp value id.
2020-12-16 25 1
2020-12-17 45 1
2020-12-31 40 1
2021-01-31 37 1
2020-12-15 12 2
2020-12-16 78. 2.

I want to create a new column outcome which takes a value yes is the id doesn’t have any entry for the last 25 days. For e.g., this is the expected output

timestamp value id. outcome
2020-12-16 25 1 yes
2020-12-17 45 1 yes.
2020-12-31 40 1 yes.
2021-01-31 37 1 no.
2020-12-15 12 2 yes.
2020-12-16 78. 2. yes.

In the above example, we can assume that the start date is 2020-12-15. So, our comparisons start after this date for all the ids. For id 1, starting from 2020-12-15, it has entries in the last 25 days except for 2021-01-31. The entry that happened before 2021-01-31 is on 2020-12-31 which is > 25 days in this example. So, the outcome variable takes a value no. For the rest of the rows, it is yes because for each of the dates, there’s a prior entry that falls within the 25 day difference.

I am very confused about how to write an algorithm for this.

I would have to groupby id, order by timestamp and then do rolling something on the timestamp to create the outcome column?

I am very confused about this and any help will be much appreciated!

Thanks in advance!

Advertisement

Answer

Because your calculation requires sorting we can avoid grouping. Sort, take a row-difference and use where to NaN the values that cross groups (i.e. the earliest row for every ID). Because you want the first difference to be relative to '2020-12-15' we can use fillna to find the difference from that date and use np.where to assign your strings values based on your condition.

import pandas as pd
import numpy as np

df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.sort_values(['id', 'timestamp'])

s = (df['timestamp'].diff()
       .where(df['id'].eq(df['id'].shift()))
       .fillna(df['timestamp'] - pd.to_datetime('2020-12-15')))
#0    1 days
#1    1 days
#2   14 days
#3   31 days
#4    0 days
#5    1 days

df['outcome'] = np.where(s <= pd.Timedelta(25, 'D'), 'yes', 'no')
#   timestamp  value  id outcome
#0 2020-12-16     25   1     yes
#1 2020-12-17     45   1     yes
#2 2020-12-31     40   1     yes
#3 2021-01-31     37   1      no
#4 2020-12-15     12   2     yes
#5 2020-12-16     78   2     yes
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement