Skip to content
Advertisement

Creating time delta diff column based on groupby id

I have the following sample df

    df = pd.DataFrame({'ID':['A','A','B','B'],'TimeStamp':['2022-08-02T17:33:44.358Z',
'2022-08-02T17:33:44.600Z',
'2022-08-02T17:33:44.814Z',
'2022-08-02T17:33:45.028Z']})

I want to groupby Id, and get the timedelta difference between the timestamps, i manage to get something similar to the wanted series. Through this code. Although, it is taking quite a long time, is there a way to do it more efficiently?

df.assign(post_data = df['TimeStamp'].shift(1)).groupby(['Id'])[['TimeStamp','post_data']].apply(lambda x : (x.iloc[:,0] - x.iloc[:,1]).to_frame('diff'))

Wanted series

{'diff': {0: NaT,
  1: Timedelta('0 days 00:00:00.242000'),
  2: Timedelta('0 days 00:00:00.214000'),
  3: Timedelta('0 days 00:00:00.214000')}

Advertisement

Answer

here is one way about it btw, if you groupby ID, then the desired result you shared is incorrected. the third row should be zero since its a different ID

#convert the timeStamp to timestamp
df['TimeStamp'] = pd.to_datetime(df['TimeStamp'])

# create post_data via vectorization intead of lambda, it'll be fast
df['post_data']=df.groupby('ID')['TimeStamp'].shift(1)

#finally, take the difference
df['diff'] = df['TimeStamp'].sub(df['post_data'])
df

    ID                        TimeStamp                            post_data                     diff
0   A   2022-08-02 17:33:44.358000+00:00                                 NaT                        NaT
1   A   2022-08-02 17:33:44.600000+00:00    2022-08-02 17:33:44.358000+00:00    0 days 00:00:00.242000
2   B   2022-08-02 17:33:44.814000+00:00                                 NaT                        NaT
3   B   2022-08-02 17:33:45.028000+00:00    2022-08-02 17:33:44.814000+00:00    0 days 00:00:00.214000

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement