I have a dataframe like this:
timestamp | value | id. |
---|---|---|
2020-12-16 | 25 | 1 |
2020-12-17 | 45 | 1 |
2020-12-31 | 40 | 1 |
2021-01-31 | 37 | 1 |
2020-12-15 | 12 | 2 |
2020-12-16 | 78. | 2. |
I want to create a new column outcome
which takes a value yes
is the id
doesn’t have any entry for the last 25 days.
For e.g., this is the expected output
timestamp | value | id. | outcome |
---|---|---|---|
2020-12-16 | 25 | 1 | yes |
2020-12-17 | 45 | 1 | yes. |
2020-12-31 | 40 | 1 | yes. |
2021-01-31 | 37 | 1 | no. |
2020-12-15 | 12 | 2 | yes. |
2020-12-16 | 78. | 2. | yes. |
In the above example, we can assume that the start date is 2020-12-15
. So, our comparisons start after this date for all the id
s.
For id 1
, starting from 2020-12-15
, it has entries in the last 25 days
except for 2021-01-31
. The entry that happened before 2021-01-31
is on 2020-12-31
which is > 25 days
in this example. So, the outcome
variable takes a value no
. For the rest of the rows, it is yes
because for each of the dates, there’s a prior entry that falls within the 25 day difference.
I am very confused about how to write an algorithm for this.
I would have to groupby
id
, order by
timestamp
and then do rolling
something on the timestamp
to create the outcome
column?
I am very confused about this and any help will be much appreciated!
Thanks in advance!
Advertisement
Answer
Because your calculation requires sorting we can avoid grouping. Sort, take a row-difference and use where
to NaN
the values that cross groups (i.e. the earliest row for every ID). Because you want the first difference to be relative to '2020-12-15'
we can use fillna
to find the difference from that date and use np.where
to assign your strings values based on your condition.
import pandas as pd import numpy as np df['timestamp'] = pd.to_datetime(df['timestamp']) df = df.sort_values(['id', 'timestamp']) s = (df['timestamp'].diff() .where(df['id'].eq(df['id'].shift())) .fillna(df['timestamp'] - pd.to_datetime('2020-12-15'))) #0 1 days #1 1 days #2 14 days #3 31 days #4 0 days #5 1 days df['outcome'] = np.where(s <= pd.Timedelta(25, 'D'), 'yes', 'no') # timestamp value id outcome #0 2020-12-16 25 1 yes #1 2020-12-17 45 1 yes #2 2020-12-31 40 1 yes #3 2021-01-31 37 1 no #4 2020-12-15 12 2 yes #5 2020-12-16 78 2 yes