Skip to content
Advertisement

Create a Boolean Column that checks for uid elsehwere in a Dataframe matching a certain condition?

Suppose I have a dataframe:

Case_no Event_ID Date Type Done_by
1 1 2023-01-01 Email Sent Customer
1 2 2023-01-02 Response Agent
1 3 2023-01-03 Email Sent Customer
1 4 2023-01-10 Response Agent
2 5 2023-01-02 Email Sent Customer
2 6 2023-01-04 Email Sent Customer

And I wanted to add two boolean columns that check for each case if an email has been sent in, has there been a response within 5 days on that case. There is no way to link emails and responses but just having it flag up that a response has been sent and whether or not it was in 5 days after is good enough for what I am doing as such:

Case_no Event_ID Date Type Done_by Response? Response_Within_Five_Days
1 1 2023-01-01 Email Sent Customer TRUE TRUE
1 2 2023-01-02 Response Agent nan NAN
1 3 2023-01-03 Email Sent Customer TRUE FALSE
1 4 2023-01-10 Response Agent NAN NAN
2 5 2023-01-02 Email Sent Customer FALSE FALSE
2 6 2023-01-04 Email Sent Customer FALSE FALSE

How would I go about doing that?

Advertisement

Answer

Solution for compare per groups if next value after Email Sent is Response by DataFrameGroupBy.shift:

df['Date'] = pd.to_datetime(df['Date'])

shift = df.groupby('Case_no')[['Type','Date']].shift(-1)

m1 = df['Type'].eq('Email Sent')
m2 = shift['Type'].eq('Response')
m3 = shift['Date'].sub(df['Date']).dt.days.lt(5)

df['Response?'] = (m1 & m2).astype('boolean')
df['Response_Within_Five_Days'] = df['Response?'] & m3

df.loc[df['Type'].eq('Response'), ['Response?','Response_Within_Five_Days']] = np.nan
print (df)
   Case_no  Event_ID       Date        Type   Done_by  Response?  
0        1         1 2023-01-01  Email Sent  Customer       True   
1        1         2 2023-01-02    Response     Agent       <NA>   
2        1         3 2023-01-03  Email Sent  Customer       True   
3        1         4 2023-01-10    Response     Agent       <NA>   
4        2         5 2023-01-02  Email Sent  Customer      False   
5        2         6 2023-01-04  Email Sent  Customer      False   

   Response_Within_Five_Days  
0                       True  
1                       <NA>  
2                      False  
3                       <NA>  
4                      False  
5                      False  

EDIT: For check next rows after Email Sent use custom function:

def f(x):
    
    m = x['Type'].eq('Email Sent')
    for k, v in x.loc[m, 'Date'].items():
        s = x[~m].loc[k:, 'Date'].sub(v).dt.days
        x.loc[k, 'Response?'] = not s.empty
        x.loc[k, 'Response_Within_Five_Days'] = s.lt(5).any()

    return x

df = df.reset_index(drop=True)

df =  df.groupby('Case_no').apply(f)
print (df)
   Case_no  Event_ID       Date        Type   Done_by Response?  
0        1         1 2023-01-01  Email Sent  Customer      True   
1        1         2 2023-01-02    Response     Agent       NaN   
2        1         3 2023-01-03  Email Sent  Customer      True   
3        1         4 2023-01-10    Response     Agent       NaN   
4        2         5 2023-01-02  Email Sent  Customer     False   
5        2         6 2023-01-04  Email Sent  Customer     False   

  Response_Within_Five_Days  
0                      True  
1                       NaN  
2                     False  
3                       NaN  
4                     False  
5                     False  
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement