I have a dataframe like this:
| timestamp | value | id. |
|---|---|---|
| 2020-12-16 | 25 | 1 |
| 2020-12-17 | 45 | 1 |
| 2020-12-31 | 40 | 1 |
| 2021-01-31 | 37 | 1 |
| 2020-12-15 | 12 | 2 |
| 2020-12-16 | 78. | 2. |
I want to create a new column outcome which takes a value yes is the id doesn’t have any entry for the last 25 days.
For e.g., this is the expected output
| timestamp | value | id. | outcome |
|---|---|---|---|
| 2020-12-16 | 25 | 1 | yes |
| 2020-12-17 | 45 | 1 | yes. |
| 2020-12-31 | 40 | 1 | yes. |
| 2021-01-31 | 37 | 1 | no. |
| 2020-12-15 | 12 | 2 | yes. |
| 2020-12-16 | 78. | 2. | yes. |
In the above example, we can assume that the start date is 2020-12-15. So, our comparisons start after this date for all the ids.
For id 1, starting from 2020-12-15, it has entries in the last 25 days except for 2021-01-31. The entry that happened before 2021-01-31 is on 2020-12-31 which is > 25 days in this example. So, the outcome variable takes a value no. For the rest of the rows, it is yes because for each of the dates, there’s a prior entry that falls within the 25 day difference.
I am very confused about how to write an algorithm for this.
I would have to groupby id, order by timestamp and then do rolling something on the timestamp to create the outcome column?
I am very confused about this and any help will be much appreciated!
Thanks in advance!
Advertisement
Answer
Because your calculation requires sorting we can avoid grouping. Sort, take a row-difference and use where to NaN the values that cross groups (i.e. the earliest row for every ID). Because you want the first difference to be relative to '2020-12-15' we can use fillna to find the difference from that date and use np.where to assign your strings values based on your condition.
import pandas as pd
import numpy as np
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.sort_values(['id', 'timestamp'])
s = (df['timestamp'].diff()
.where(df['id'].eq(df['id'].shift()))
.fillna(df['timestamp'] - pd.to_datetime('2020-12-15')))
#0 1 days
#1 1 days
#2 14 days
#3 31 days
#4 0 days
#5 1 days
df['outcome'] = np.where(s <= pd.Timedelta(25, 'D'), 'yes', 'no')
# timestamp value id outcome
#0 2020-12-16 25 1 yes
#1 2020-12-17 45 1 yes
#2 2020-12-31 40 1 yes
#3 2021-01-31 37 1 no
#4 2020-12-15 12 2 yes
#5 2020-12-16 78 2 yes