Pandas: Subtract timestamps

I grouped a dataframe test_df2 by frequency 'B' (by business day, so each name of the group is the date of that day at 00:00) and am now looping over the groups to calculate timestamp differences and save them in the dict grouped_bins. The data in the original dataframe and the groups looks like this:

	timestamp	status	externalId
0	2020-05-11 13:06:05.922	1	1
7	2020-05-11 13:14:29.759	10	1
8	2020-05-11 13:16:09.147	1	2
16	2020-05-11 13:19:08.641	10	2

What I want is to calculate the difference between each row’s timestamp, for example of rows 7 and 0, since they have the same externalId.

What I did for that purpose is the following.

# Group function. Dataframes are saved in a dict.
def groupDataWithFrequency(self, dataFrameLabel: str, groupKey: str, frequency: str):
    '''Group time series by frequency. Starts at the beginning of the data frame.'''
    print(f"Binning {dataFrameLabel} data with frequency {frequency}")
    if (isinstance(groupKey, str)):
        return self.dataDict[dataFrameLabel].groupby(pd.Grouper(key=groupKey, freq=frequency, origin="start"))

grouped_jobstates = groupDataWithFrequency("jobStatus", "timestamp", frequency)

JavaScript
​x
 
# Group function. Dataframes are saved in a dict.
def groupDataWithFrequency(self, dataFrameLabel: str, groupKey: str, frequency: str):
    '''Group time series by frequency. Starts at the beginning of the data frame.'''
    print(f"Binning {dataFrameLabel} data with frequency {frequency}")
    if (isinstance(groupKey, str)):
        return self.dataDict[dataFrameLabel].groupby(pd.Grouper(key=groupKey, freq=frequency, origin="start"))
​
grouped_jobstates = groupDataWithFrequency("jobStatus", "timestamp", frequency)
​

After grouping, I loop over each group (to maintain the day) and try to calculate the difference between the timestamos, which is where it goes wrong.

grouped_bins = {}

def jobStatusPRAggregator(data, name):
    if (data["status"] == 1):

        # Find corresponding element in original dataframe
        correspondingStatus = test_df2.loc[(test_df2["externalId"] == data["externalId"]) & (test_df2["timestamp"] != data["timestamp"])]

        # Calculate time difference
        time = correspondingStatus["timestamp"] - data["timestamp"]

        # some prints:
        print(type(time))
        # <class 'pandas.core.series.Series'> --> Why is this a series?

        print(time.array[0])
        # 0 days 00:08:23.837000 --> This looks correct, I think?

        print(time)
        # 7   0 days 00:08:23.837000
        # Name: timestamp, dtype: timedelta64[ns]
    
        # Check if element exists in dict
        elem = next((x for x in grouped_bins if ((x["startDate"] == name) ("productiveTime" in x))), None)
      
        # If does not exist yet, add to dict
        if elem is None:
            grouped_bins.append( {"startDate": name, "productiveTime": time })
        else:
            elem["productiveTime"] = elem["productiveTime"]
            # See below for problem

# Loop over groups
for name, group in grouped_jobstates:
    group.apply(jobStatusPRAggregator, args=(name,), axis=1)

JavaScript
 
grouped_bins = {}
​
def jobStatusPRAggregator(data, name):
    if (data["status"] == 1):
​
        # Find corresponding element in original dataframe
        correspondingStatus = test_df2.loc[(test_df2["externalId"] == data["externalId"]) & (test_df2["timestamp"] != data["timestamp"])]
​
        # Calculate time difference
        time = correspondingStatus["timestamp"] - data["timestamp"]
​
        # some prints:
        print(type(time))
        # <class 'pandas.core.series.Series'> --> Why is this a series?
​
        print(time.array[0])
        # 0 days 00:08:23.837000 --> This looks correct, I think?
​
        print(time)
        # 7   0 days 00:08:23.837000
        # Name: timestamp, dtype: timedelta64[ns]
    
        # Check if element exists in dict
        elem = next((x for x in grouped_bins if ((x["startDate"] == name) ("productiveTime" in x))), None)
      
        # If does not exist yet, add to dict
        if elem is None:
            grouped_bins.append( {"startDate": name, "productiveTime": time })
        else:
            elem["productiveTime"] = elem["productiveTime"]
            # See below for problem
​
# Loop over groups
for name, group in grouped_jobstates:
    group.apply(jobStatusPRAggregator, args=(name,), axis=1)
​

The problem I face is the following. The element in the dict (elem) looks like this in the end:

{'startDate': Timestamp('2020-05-11 00:00:00', freq='B'), 'productiveTime': 0      NaT
7      NaT
8      NaT
16     NaT
17     NaT
        ..
1090   NaT
1091   NaT
1099   NaT
1100   NaT
1107   NaT
Name: timestamp, Length: 254, dtype: timedelta64[ns]}

JavaScript
 
{'startDate': Timestamp('2020-05-11 00:00:00', freq='B'), 'productiveTime': 0      NaT
7      NaT
8      NaT
16     NaT
17     NaT
        ..
1090   NaT
1091   NaT
1099   NaT
1100   NaT
1107   NaT
Name: timestamp, Length: 254, dtype: timedelta64[ns]}
​

What I want is something like this:

{'startDate': Timestamp('2020-05-11 00:00:00', freq='B'), 'productiveTime': 2 Days 12 hours 39 minutes 29 seconds
Name: timestamp, Length: 254, dtype: timedelta64[ns]}

JavaScript
 
{'startDate': Timestamp('2020-05-11 00:00:00', freq='B'), 'productiveTime': 2 Days 12 hours 39 minutes 29 seconds
Name: timestamp, Length: 254, dtype: timedelta64[ns]}
​

Though I am open to suggestions on how to store time durations in Python/Pandas.

I am also open to suggestions regarding the loop itself.

Answer

To obtain timestamp differences between consecutive rows of the same externalId, you should be able to simply write, for example:

df2 = df.assign(delta=df.groupby('externalId')['timestamp'].diff())

JavaScript
 
df2 = df.assign(delta=df.groupby('externalId')['timestamp'].diff())
​

On the example you give:

>>> df2
                 timestamp  status  externalId                  delta
0  2020-05-11 13:06:05.922       1           1                    NaT
7  2020-05-11 13:14:29.759      10           1 0 days 00:08:23.837000
8  2020-05-11 13:16:09.147       1           2                    NaT
16 2020-05-11 13:19:08.641      10           2 0 days 00:02:59.494000

JavaScript
 
>>> df2
                 timestamp  status  externalId                  delta
0  2020-05-11 13:06:05.922       1           1                    NaT
7  2020-05-11 13:14:29.759      10           1 0 days 00:08:23.837000
8  2020-05-11 13:16:09.147       1           2                    NaT
16 2020-05-11 13:19:08.641      10           2 0 days 00:02:59.494000
​

If your timestamps are not actually of type Timestamp yet, then you can convert them into it first:

df['timestamp'] = pd.to_datetime(df['timestamp'])

JavaScript
 
df['timestamp'] = pd.to_datetime(df['timestamp'])
​

Advertisement

Answer