Imagine I have a dataframe with user events
+---------+------------------+---------------------+ | user_id | event_name | timestamp | +---------+------------------+---------------------+ | 1 | HomeAppear | 2020-12-13 06:38:14 | +---------+------------------+---------------------+ | 1 | TariffsAppear | 2020-12-13 06:40:13 | +---------+------------------+---------------------+ | 1 | CheckoutPayClick | 2020-12-13 06:50:12 | +---------+------------------+---------------------+ | 2 | HomeAppear | 2020-12-13 11:38:33 | +---------+------------------+---------------------+ | 2 | TariffsAppear | 2020-12-13 11:39:18 | +---------+------------------+---------------------+
For each user after his last (by timestamp) event I want to add new row with ‘End’ event with the same timestamp as in previous event:
+---------+------------------+---------------------+ | 1 | End | 2020-12-13 06:50:12 | +---------+------------------+---------------------+
I have no idea how to do that. In SQL I would do that with LAG() or LEAD(). But what about pandas?
Advertisement
Answer
Use DataFrame.drop_duplicates
for last row of User_id
, change event_name
to End
and add to original by concat
with sorting index (added safest sort mergesort
):
#if necessary sorting df = df.sort_values(['user_id', 'timestamp'], ignore_index=True) df2 = df.drop_duplicates('user_id', keep='last').assign(event_name = 'End') df = pd.concat([df, df2]).sort_index(kind='mergesort').reset_index(drop=True) print (df) user_id event_name timestamp 0 1 HomeAppear 2020-12-13 06:38:14 1 1 TariffsAppear 2020-12-13 06:40:13 2 1 CheckoutPayClick 2020-12-13 06:50:12 3 1 End 2020-12-13 06:50:12 4 2 HomeAppear 2020-12-13 11:38:33 5 2 TariffsAppear 2020-12-13 11:39:18 6 2 End 2020-12-13 11:39:18