Imagine I have a dataframe with user events
+---------+------------------+---------------------+ | user_id | event_name | timestamp | +---------+------------------+---------------------+ | 1 | HomeAppear | 2020-12-13 06:38:14 | +---------+------------------+---------------------+ | 1 | TariffsAppear | 2020-12-13 06:40:13 | +---------+------------------+---------------------+ | 1 | CheckoutPayClick | 2020-12-13 06:50:12 | +---------+------------------+---------------------+ | 2 | HomeAppear | 2020-12-13 11:38:33 | +---------+------------------+---------------------+ | 2 | TariffsAppear | 2020-12-13 11:39:18 | +---------+------------------+---------------------+
For each user after his last (by timestamp) event I want to add new row with ‘End’ event with the same timestamp as in previous event:
+---------+------------------+---------------------+ | 1 | End | 2020-12-13 06:50:12 | +---------+------------------+---------------------+
I have no idea how to do that. In SQL I would do that with LAG() or LEAD(). But what about pandas?
Advertisement
Answer
Use DataFrame.drop_duplicates for last row of User_id, change event_name to End and add to original by concat with sorting index (added safest sort mergesort):
#if necessary sorting
df = df.sort_values(['user_id', 'timestamp'], ignore_index=True)
df2 = df.drop_duplicates('user_id', keep='last').assign(event_name = 'End')
df = pd.concat([df, df2]).sort_index(kind='mergesort').reset_index(drop=True)
print (df)
user_id event_name timestamp
0 1 HomeAppear 2020-12-13 06:38:14
1 1 TariffsAppear 2020-12-13 06:40:13
2 1 CheckoutPayClick 2020-12-13 06:50:12
3 1 End 2020-12-13 06:50:12
4 2 HomeAppear 2020-12-13 11:38:33
5 2 TariffsAppear 2020-12-13 11:39:18
6 2 End 2020-12-13 11:39:18