Skip to content
Advertisement

Merging segments from the same trips into a single trip for analysis

In the MWE below, I show my attempt to line-plot trips (from my df aggregated per month):

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

d = {'user': [62, 62, 154, 154, 154, 73, 73, 73, 73],
     'start': ['2008-06-20 04:21:40','2008-06-20 05:40:31','2007-05-01 10:10:00',
         '2007-05-01 10:36:00','2007-05-01 15:30:00','2008-05-19 16:25:54',
          '2008-05-21 02:21:37','2008-05-22 01:30:09','2008-05-29 01:55:59'],
    'end': ['2008-06-20 05:33:46','2008-06-20 05:53:11', '2007-05-01 10:36:00',
         '2007-05-01 11:00:00','2007-05-01 15:55:00','2008-05-19 16:29:22',
         '2008-05-21 02:25:04','2008-05-22 01:33:51','2008-05-29 01:59:25'],
 'mode': ['bus','walk','bus','walk','taxi','walk','walk','walk','walk'],}

df = pd.DataFrame(d)
df[['start', 'end']] = df[['start', 'end']].apply(pd.to_datetime)

df['Date'] = df['start'].dt.strftime('%Y-%m')

df.sort_values(['Date' ], ascending=True, inplace=True)
df.reset_index(drop=True, inplace=True)

# aggregate monthly trips
df = df.groupby(['Date']).size().reset_index().rename(
    columns={0:'Trips'})
df
    Date   Trips
0   2007-05 3
1   2008-05 4
2   2008-06 2

# the I can plot trips like so:
sns.lineplot(data=df, x = df['Date'], y='Trips')

I realised in my df, some trips contains jump (maybe due to data log), so they should be merged into single trip before aggregation. In the given df example above (before grouping).

df
   user        start                 end       mode     Date
0   154 2007-05-01 10:10:00 2007-05-01 10:36:00 bus     2007-05
1   154 2007-05-01 10:36:00 2007-05-01 11:00:00 walk    2007-05
2   154 2007-05-01 15:30:00 2007-05-01 15:55:00 taxi    2007-05
3   73  2008-05-19 16:25:54 2008-05-19 16:29:22 walk    2008-05
4   73  2008-05-21 02:21:37 2008-05-21 02:25:04 walk    2008-05
5   73  2008-05-22 01:30:09 2008-05-22 01:33:51 walk    2008-05
6   73  2008-05-29 01:55:59 2008-05-29 01:59:25 walk    2008-05
7   62  2008-06-20 04:21:40 2008-06-20 05:33:46 bus     2008-06
8   62  2008-06-20 05:40:31 2008-06-20 05:53:11 walk    2008-06

User 154 does undertake 2-trips, not 3. First started at 10:10:00 on bus then changes travel mode to at 10:36:00 to walk. Also user 62 does only a trip, started at 04:21:40 - 05:33:46 on a bus, then stopped, waited for about 6-minutes then continued to at about 05:40:31.

Question

I want to merge such trips into a single trip so that if the time interval between the end of a user’s trip to beginning of the next is less than 30 minutes, that should be considered a single trip. Then I can plot,

In the above MWE, the aggregated trips are “over-estimated”, the expected final df should be:

    Date     Trips
0   2007-05     2
1   2008-05     4
2   2008-06     1

Advertisement

Answer

You can use a custom function:

offset = pd.DateOffset(minutes=30)
merge_trip = lambda x: x['start'].ge(x['end'].shift() + offset).cumsum().add(1)

df['Trips'] = df.groupby('user').apply(merge_trip).droplevel('user')

out = df.groupby('Date', as_index=False)['Trips'].max()

Output:

>>> out
      Date  Trips
0  2007-05      2
1  2008-05      4
2  2008-06      1

>>> df
   user               start                 end  mode     Date  Trips
0   154 2007-05-01 10:10:00 2007-05-01 10:36:00   bus  2007-05      1
1   154 2007-05-01 10:36:00 2007-05-01 11:00:00  walk  2007-05      1
2   154 2007-05-01 15:30:00 2007-05-01 15:55:00  taxi  2007-05      2
3    73 2008-05-19 16:25:54 2008-05-19 16:29:22  walk  2008-05      1
4    73 2008-05-21 02:21:37 2008-05-21 02:25:04  walk  2008-05      2
5    73 2008-05-22 01:30:09 2008-05-22 01:33:51  walk  2008-05      3
6    73 2008-05-29 01:55:59 2008-05-29 01:59:25  walk  2008-05      4
7    62 2008-06-20 04:21:40 2008-06-20 05:33:46   bus  2008-06      1
8    62 2008-06-20 05:40:31 2008-06-20 05:53:11  walk  2008-06      1
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement