Imagine I have a dataframe with user events
JavaScript
x
14
14
1
+---------+------------------+---------------------+
2
| user_id | event_name | timestamp |
3
+---------+------------------+---------------------+
4
| 1 | HomeAppear | 2020-12-13 06:38:14 |
5
+---------+------------------+---------------------+
6
| 1 | TariffsAppear | 2020-12-13 06:40:13 |
7
+---------+------------------+---------------------+
8
| 1 | CheckoutPayClick | 2020-12-13 06:50:12 |
9
+---------+------------------+---------------------+
10
| 2 | HomeAppear | 2020-12-13 11:38:33 |
11
+---------+------------------+---------------------+
12
| 2 | TariffsAppear | 2020-12-13 11:39:18 |
13
+---------+------------------+---------------------+
14
For each user after his last (by timestamp) event I want to add new row with ‘End’ event with the same timestamp as in previous event:
JavaScript
1
4
1
+---------+------------------+---------------------+
2
| 1 | End | 2020-12-13 06:50:12 |
3
+---------+------------------+---------------------+
4
I have no idea how to do that. In SQL I would do that with LAG() or LEAD(). But what about pandas?
Advertisement
Answer
Use DataFrame.drop_duplicates
for last row of User_id
, change event_name
to End
and add to original by concat
with sorting index (added safest sort mergesort
):
JavaScript
1
16
16
1
#if necessary sorting
2
df = df.sort_values(['user_id', 'timestamp'], ignore_index=True)
3
4
df2 = df.drop_duplicates('user_id', keep='last').assign(event_name = 'End')
5
6
df = pd.concat([df, df2]).sort_index(kind='mergesort').reset_index(drop=True)
7
print (df)
8
user_id event_name timestamp
9
0 1 HomeAppear 2020-12-13 06:38:14
10
1 1 TariffsAppear 2020-12-13 06:40:13
11
2 1 CheckoutPayClick 2020-12-13 06:50:12
12
3 1 End 2020-12-13 06:50:12
13
4 2 HomeAppear 2020-12-13 11:38:33
14
5 2 TariffsAppear 2020-12-13 11:39:18
15
6 2 End 2020-12-13 11:39:18
16