I have a dataframe with different timestamp for each user, and I want to calculate the duration. I used this code to import my CSV files:
import pandas as pd import glob path = r'C:Users...Desktop' all_files = glob.glob(path + "/*.csv") li = [] for filename in all_files: df = pd.read_csv(filename, index_col=None, header=0,encoding='ISO-8859-1') li.append(df) df = pd.concat(li, axis=0, ignore_index=True)
df.head()
ID timestamp 1828765 31-05-2021 22:27:03 1828765 31-05-2021 22:27:12 1828765 31-05-2021 22:27:13 1828765 31-05-2021 22:27:34 2056557 21-07-2021 10:27:12 2056557 21-07-2021 10:27:20 2056557 21-07-2021 10:27:22
And I want to get something like that
ID timestamp duration(s) 1828765 31-05-2021 22:27:03 NAN 1828765 31-05-2021 22:27:12 9 1828765 31-05-2021 22:27:13 1 1828765 31-05-2021 22:27:34 21 2056557 21-07-2021 10:27:12 NAN 2056557 21-07-2021 10:27:20 8 2056557 21-07-2021 10:27:22 2
I’ve used this code, but doesn’t work for me
import datetime df['timestamp'] = pd.to_datetime(df['timestamp'], format = "%d-%m-%Y %H:%M:%S") df['time_diff'] = 0 for i in range(df.shape[0] - 1): df['time_diff'][i+1] = (datetime.datetime.min + (df['timestamp'][i+1] - df['timestamp'][i])).time()
Advertisement
Answer
Operations which occur over groups of values are GroupBy operations in pandas
.
pandas
supports mathematical operations over timestamps natively. For this reason, subtraction will give the correct duration between any two timestamps.
We’ve already successfully converted out timestamp
column to datetime64[ns]
df['timestamp'] = pd.to_datetime(df['timestamp'], format="%d-%m-%Y %H:%M:%S")
Now we can take the difference between rows within groups with Groupby.diff
df['duration'] = df.groupby('ID')['timestamp'].diff()
df
ID timestamp duration 0 1828765 2021-05-31 22:27:03 NaT 1 1828765 2021-05-31 22:27:12 0 days 00:00:09 2 1828765 2021-05-31 22:27:13 0 days 00:00:01 3 1828765 2021-05-31 22:27:34 0 days 00:00:21 4 2056557 2021-07-21 10:27:12 NaT 5 2056557 2021-07-21 10:27:20 0 days 00:00:08 6 2056557 2021-07-21 10:27:22 0 days 00:00:02
If we want to get the duration in seconds we can extract the total number of seconds using Series.dt.total_seconds
:
df['duration (s)'] = df.groupby('ID')['timestamp'].diff().dt.total_seconds()
df
:
ID timestamp duration (s) 0 1828765 2021-05-31 22:27:03 NaN 1 1828765 2021-05-31 22:27:12 9.0 2 1828765 2021-05-31 22:27:13 1.0 3 1828765 2021-05-31 22:27:34 21.0 4 2056557 2021-07-21 10:27:12 NaN 5 2056557 2021-07-21 10:27:20 8.0 6 2056557 2021-07-21 10:27:22 2.0
Complete Working Example:
import pandas as pd df = pd.DataFrame({ 'ID': [1828765, 1828765, 1828765, 1828765, 2056557, 2056557, 2056557], 'timestamp': ['31-05-2021 22:27:03', '31-05-2021 22:27:12', '31-05-2021 22:27:13', '31-05-2021 22:27:34', '21-07-2021 10:27:12', '21-07-2021 10:27:20', '21-07-2021 10:27:22'] }) df['timestamp'] = pd.to_datetime(df['timestamp'], format="%d-%m-%Y %H:%M:%S") df['duration (s)'] = df.groupby('ID')['timestamp'].diff().dt.total_seconds() print(df)