Skip to content
Advertisement

Converting a time series into start & end dates using Pandas

I’m simply looking for a more intuitive and faster way to get start and end times of uninterrupted time sequences. Here’s a reproducible example as well as my way of doing it for the time being:

import pandas as pd
import numpy as np
import datetime

data = ['1999-01-01 00:00:00', '1999-01-01 01:00:00', '1999-01-01 02:00:00',
        '1999-01-10 10:00:00', '1999-01-10 11:00:00', '1999-01-10 12:00:00', '1999-01-10 13:00:00',
        '1999-01-20 17:00:00', '1999-01-20 18:00:00', '1999-01-20 19:00:00']
df = pd.DataFrame(data, columns = ['time'])
df['time'] = pd.to_datetime(df['time'])

# Conversion:
new_df = pd.DataFrame(columns=['Start Date', 'End Date'])
new_df2 = pd.DataFrame(columns=['End Date'])

df['diff'] = df['time'].diff(1)
df['diff2'] = df['diff'].shift(-1)

new_df['Start Date'] = df['time'].loc[df['diff'] != pd.to_timedelta(1, unit ='h')].reset_index(drop = True)
new_df2['End Date'] = df['time'].loc[df['diff2'] != pd.to_timedelta(1, unit ='h')].reset_index(drop = True)

new_df['End Date'] = new_df2['End Date']
new_df['Duration [Hours]'] = (new_df['End Date'] - new_df['Start Date']) / np.timedelta64(1, 'h')

print(new_df)

Resulting dataframe:

           Start Date            End Date  Duration [Hours]
0 1999-01-01 00:00:00 1999-01-01 02:00:00               2.0
1 1999-01-10 10:00:00 1999-01-10 13:00:00               3.0
2 1999-01-20 17:00:00 1999-01-20 19:00:00               2.0

Any kind of help will be valuable.

Advertisement

Answer

Another approach is to create a group column indicating which group each row belongs to. This can be done by using shift and cumsum. After that, we can simply use groupby to create the wanted columns.

df['group'] = (df['time'].shift(1) != df['time'] - pd.Timedelta(hours=1)).cumsum()
df = df.groupby('group')['time'].agg(['first', 'last', 'count'])
df['count'] = df['count'] - 1

df.columns = ['Start Date', 'End Date', 'Duration [Hours]']
df = df.reset_index(drop=True)

We need to remove 1 from the count value since it’s the time we want, not simply the number of rows in each group. The last two lines are simply for cleanup to match the wanted output dataframe.

Result:

             Start Date            End Date Duration [Hours]
0   1999-01-01 00:00:00 1999-01-01 02:00:00                2
1   1999-01-10 10:00:00 1999-01-10 13:00:00                3
2   1999-01-20 17:00:00 1999-01-20 19:00:00                2
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement