Imagine I have a dataframe with user events For each user after his last (by timestamp) event I want to add new row with 'End' event with the same timestamp as in previous event: I have no idea how to do that. In SQL I would do that with LAG() or LEAD(). But what about pandas? Answer Use DataFrame.drop_duplicates for

How to add a row based on last user event in pandas?

Imagine I have a dataframe with user events

+---------+------------------+---------------------+
| user_id | event_name       | timestamp           |
+---------+------------------+---------------------+
| 1       | HomeAppear       | 2020-12-13 06:38:14 |
+---------+------------------+---------------------+
| 1       | TariffsAppear    | 2020-12-13 06:40:13 |
+---------+------------------+---------------------+
| 1       | CheckoutPayClick | 2020-12-13 06:50:12 |
+---------+------------------+---------------------+
| 2       | HomeAppear       | 2020-12-13 11:38:33 |
+---------+------------------+---------------------+
| 2       | TariffsAppear    | 2020-12-13 11:39:18 |
+---------+------------------+---------------------+

JavaScript
 
+---------+------------------+---------------------+
| user_id | event_name       | timestamp           |
+---------+------------------+---------------------+
| 1       | HomeAppear       | 2020-12-13 06:38:14 |
+---------+------------------+---------------------+
| 1       | TariffsAppear    | 2020-12-13 06:40:13 |
+---------+------------------+---------------------+
| 1       | CheckoutPayClick | 2020-12-13 06:50:12 |
+---------+------------------+---------------------+
| 2       | HomeAppear       | 2020-12-13 11:38:33 |
+---------+------------------+---------------------+
| 2       | TariffsAppear    | 2020-12-13 11:39:18 |
+---------+------------------+---------------------+
​

For each user after his last (by timestamp) event I want to add new row with ‘End’ event with the same timestamp as in previous event:

+---------+------------------+---------------------+
| 1       | End              | 2020-12-13 06:50:12 |
+---------+------------------+---------------------+

JavaScript
 
+---------+------------------+---------------------+
| 1       | End              | 2020-12-13 06:50:12 |
+---------+------------------+---------------------+
​

I have no idea how to do that. In SQL I would do that with LAG() or LEAD(). But what about pandas?

Answer

Use DataFrame.drop_duplicates for last row of User_id, change event_name to End and add to original by concat with sorting index (added safest sort mergesort):

#if necessary sorting
df = df.sort_values(['user_id', 'timestamp'], ignore_index=True)

df2 = df.drop_duplicates('user_id', keep='last').assign(event_name = 'End')

df = pd.concat([df, df2]).sort_index(kind='mergesort').reset_index(drop=True)
print (df)
   user_id        event_name            timestamp
0        1        HomeAppear  2020-12-13 06:38:14
1        1     TariffsAppear  2020-12-13 06:40:13
2        1  CheckoutPayClick  2020-12-13 06:50:12
3        1               End  2020-12-13 06:50:12
4        2        HomeAppear  2020-12-13 11:38:33
5        2     TariffsAppear  2020-12-13 11:39:18
6        2               End  2020-12-13 11:39:18

JavaScript
 
#if necessary sorting
df = df.sort_values(['user_id', 'timestamp'], ignore_index=True)
​
df2 = df.drop_duplicates('user_id', keep='last').assign(event_name = 'End')
​
df = pd.concat([df, df2]).sort_index(kind='mergesort').reset_index(drop=True)
print (df)
   user_id        event_name            timestamp
0        1        HomeAppear  2020-12-13 06:38:14
1        1     TariffsAppear  2020-12-13 06:40:13
2        1  CheckoutPayClick  2020-12-13 06:50:12
3        1               End  2020-12-13 06:50:12
4        2        HomeAppear  2020-12-13 11:38:33
5        2     TariffsAppear  2020-12-13 11:39:18
6        2               End  2020-12-13 11:39:18
​

Advertisement

Answer