I have the following sample df
JavaScript
x
5
1
df = pd.DataFrame({'ID':['A','A','B','B'],'TimeStamp':['2022-08-02T17:33:44.358Z',
2
'2022-08-02T17:33:44.600Z',
3
'2022-08-02T17:33:44.814Z',
4
'2022-08-02T17:33:45.028Z']})
5
I want to groupby Id, and get the timedelta difference between the timestamps, i manage to get something similar to the wanted series. Through this code. Although, it is taking quite a long time, is there a way to do it more efficiently?
JavaScript
1
2
1
df.assign(post_data = df['TimeStamp'].shift(1)).groupby(['Id'])[['TimeStamp','post_data']].apply(lambda x : (x.iloc[:,0] - x.iloc[:,1]).to_frame('diff'))
2
Wanted series
JavaScript
1
5
1
{'diff': {0: NaT,
2
1: Timedelta('0 days 00:00:00.242000'),
3
2: Timedelta('0 days 00:00:00.214000'),
4
3: Timedelta('0 days 00:00:00.214000')}
5
Advertisement
Answer
here is one way about it btw, if you groupby ID, then the desired result you shared is incorrected. the third row should be zero since its a different ID
JavaScript
1
11
11
1
#convert the timeStamp to timestamp
2
df['TimeStamp'] = pd.to_datetime(df['TimeStamp'])
3
4
# create post_data via vectorization intead of lambda, it'll be fast
5
df['post_data']=df.groupby('ID')['TimeStamp'].shift(1)
6
7
#finally, take the difference
8
df['diff'] = df['TimeStamp'].sub(df['post_data'])
9
df
10
11
JavaScript
1
7
1
ID TimeStamp post_data diff
2
0 A 2022-08-02 17:33:44.358000+00:00 NaT NaT
3
1 A 2022-08-02 17:33:44.600000+00:00 2022-08-02 17:33:44.358000+00:00 0 days 00:00:00.242000
4
2 B 2022-08-02 17:33:44.814000+00:00 NaT NaT
5
3 B 2022-08-02 17:33:45.028000+00:00 2022-08-02 17:33:44.814000+00:00 0 days 00:00:00.214000
6
7