I’m simply looking for a more intuitive and faster way to get start and end times of uninterrupted time sequences. Here’s a reproducible example as well as my way of doing it for the time being:
import pandas as pd import numpy as np import datetime data = ['1999-01-01 00:00:00', '1999-01-01 01:00:00', '1999-01-01 02:00:00', '1999-01-10 10:00:00', '1999-01-10 11:00:00', '1999-01-10 12:00:00', '1999-01-10 13:00:00', '1999-01-20 17:00:00', '1999-01-20 18:00:00', '1999-01-20 19:00:00'] df = pd.DataFrame(data, columns = ['time']) df['time'] = pd.to_datetime(df['time']) # Conversion: new_df = pd.DataFrame(columns=['Start Date', 'End Date']) new_df2 = pd.DataFrame(columns=['End Date']) df['diff'] = df['time'].diff(1) df['diff2'] = df['diff'].shift(-1) new_df['Start Date'] = df['time'].loc[df['diff'] != pd.to_timedelta(1, unit ='h')].reset_index(drop = True) new_df2['End Date'] = df['time'].loc[df['diff2'] != pd.to_timedelta(1, unit ='h')].reset_index(drop = True) new_df['End Date'] = new_df2['End Date'] new_df['Duration [Hours]'] = (new_df['End Date'] - new_df['Start Date']) / np.timedelta64(1, 'h') print(new_df)
Resulting dataframe:
Start Date End Date Duration [Hours] 0 1999-01-01 00:00:00 1999-01-01 02:00:00 2.0 1 1999-01-10 10:00:00 1999-01-10 13:00:00 3.0 2 1999-01-20 17:00:00 1999-01-20 19:00:00 2.0
Any kind of help will be valuable.
Advertisement
Answer
Another approach is to create a group
column indicating which group each row belongs to. This can be done by using shift
and cumsum
. After that, we can simply use groupby
to create the wanted columns.
df['group'] = (df['time'].shift(1) != df['time'] - pd.Timedelta(hours=1)).cumsum() df = df.groupby('group')['time'].agg(['first', 'last', 'count']) df['count'] = df['count'] - 1 df.columns = ['Start Date', 'End Date', 'Duration [Hours]'] df = df.reset_index(drop=True)
We need to remove 1 from the count value since it’s the time we want, not simply the number of rows in each group. The last two lines are simply for cleanup to match the wanted output dataframe.
Result:
Start Date End Date Duration [Hours] 0 1999-01-01 00:00:00 1999-01-01 02:00:00 2 1 1999-01-10 10:00:00 1999-01-10 13:00:00 3 2 1999-01-20 17:00:00 1999-01-20 19:00:00 2