I have the following sample df
df = pd.DataFrame({'ID':['A','A','B','B'],'TimeStamp':['2022-08-02T17:33:44.358Z', '2022-08-02T17:33:44.600Z', '2022-08-02T17:33:44.814Z', '2022-08-02T17:33:45.028Z']})
I want to groupby Id, and get the timedelta difference between the timestamps, i manage to get something similar to the wanted series. Through this code. Although, it is taking quite a long time, is there a way to do it more efficiently?
df.assign(post_data = df['TimeStamp'].shift(1)).groupby(['Id'])[['TimeStamp','post_data']].apply(lambda x : (x.iloc[:,0] - x.iloc[:,1]).to_frame('diff'))
Wanted series
{'diff': {0: NaT, 1: Timedelta('0 days 00:00:00.242000'), 2: Timedelta('0 days 00:00:00.214000'), 3: Timedelta('0 days 00:00:00.214000')}
Advertisement
Answer
here is one way about it btw, if you groupby ID, then the desired result you shared is incorrected. the third row should be zero since its a different ID
#convert the timeStamp to timestamp df['TimeStamp'] = pd.to_datetime(df['TimeStamp']) # create post_data via vectorization intead of lambda, it'll be fast df['post_data']=df.groupby('ID')['TimeStamp'].shift(1) #finally, take the difference df['diff'] = df['TimeStamp'].sub(df['post_data']) df
ID TimeStamp post_data diff 0 A 2022-08-02 17:33:44.358000+00:00 NaT NaT 1 A 2022-08-02 17:33:44.600000+00:00 2022-08-02 17:33:44.358000+00:00 0 days 00:00:00.242000 2 B 2022-08-02 17:33:44.814000+00:00 NaT NaT 3 B 2022-08-02 17:33:45.028000+00:00 2022-08-02 17:33:44.814000+00:00 0 days 00:00:00.214000