I have a dataframe with different timestamp for each user, and I want to calculate the duration. I used this code to import my CSV files:
import pandas as pd
import glob
path = r'C:Users...Desktop'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0,encoding='ISO-8859-1')
li.append(df)
df = pd.concat(li, axis=0, ignore_index=True)
df.head()
ID timestamp 1828765 31-05-2021 22:27:03 1828765 31-05-2021 22:27:12 1828765 31-05-2021 22:27:13 1828765 31-05-2021 22:27:34 2056557 21-07-2021 10:27:12 2056557 21-07-2021 10:27:20 2056557 21-07-2021 10:27:22
And I want to get something like that
ID timestamp duration(s) 1828765 31-05-2021 22:27:03 NAN 1828765 31-05-2021 22:27:12 9 1828765 31-05-2021 22:27:13 1 1828765 31-05-2021 22:27:34 21 2056557 21-07-2021 10:27:12 NAN 2056557 21-07-2021 10:27:20 8 2056557 21-07-2021 10:27:22 2
I’ve used this code, but doesn’t work for me
import datetime
df['timestamp'] = pd.to_datetime(df['timestamp'], format = "%d-%m-%Y %H:%M:%S")
df['time_diff'] = 0
for i in range(df.shape[0] - 1):
df['time_diff'][i+1] = (datetime.datetime.min + (df['timestamp'][i+1] - df['timestamp'][i])).time()
Advertisement
Answer
Operations which occur over groups of values are GroupBy operations in pandas.
pandas supports mathematical operations over timestamps natively. For this reason, subtraction will give the correct duration between any two timestamps.
We’ve already successfully converted out timestamp column to datetime64[ns]
df['timestamp'] = pd.to_datetime(df['timestamp'], format="%d-%m-%Y %H:%M:%S")
Now we can take the difference between rows within groups with Groupby.diff
df['duration'] = df.groupby('ID')['timestamp'].diff()
df
ID timestamp duration 0 1828765 2021-05-31 22:27:03 NaT 1 1828765 2021-05-31 22:27:12 0 days 00:00:09 2 1828765 2021-05-31 22:27:13 0 days 00:00:01 3 1828765 2021-05-31 22:27:34 0 days 00:00:21 4 2056557 2021-07-21 10:27:12 NaT 5 2056557 2021-07-21 10:27:20 0 days 00:00:08 6 2056557 2021-07-21 10:27:22 0 days 00:00:02
If we want to get the duration in seconds we can extract the total number of seconds using Series.dt.total_seconds:
df['duration (s)'] = df.groupby('ID')['timestamp'].diff().dt.total_seconds()
df:
ID timestamp duration (s) 0 1828765 2021-05-31 22:27:03 NaN 1 1828765 2021-05-31 22:27:12 9.0 2 1828765 2021-05-31 22:27:13 1.0 3 1828765 2021-05-31 22:27:34 21.0 4 2056557 2021-07-21 10:27:12 NaN 5 2056557 2021-07-21 10:27:20 8.0 6 2056557 2021-07-21 10:27:22 2.0
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'ID': [1828765, 1828765, 1828765, 1828765, 2056557, 2056557, 2056557],
'timestamp': ['31-05-2021 22:27:03', '31-05-2021 22:27:12',
'31-05-2021 22:27:13', '31-05-2021 22:27:34',
'21-07-2021 10:27:12', '21-07-2021 10:27:20',
'21-07-2021 10:27:22']
})
df['timestamp'] = pd.to_datetime(df['timestamp'], format="%d-%m-%Y %H:%M:%S")
df['duration (s)'] = df.groupby('ID')['timestamp'].diff().dt.total_seconds()
print(df)