I grouped a dataframe test_df2
by frequency 'B'
(by business day, so each name of the group is the date of that day at 00:00) and am now looping over the groups to calculate timestamp differences and save them in the dict grouped_bins
. The data in the original dataframe and the groups looks like this:
timestamp | status | externalId | |
---|---|---|---|
0 | 2020-05-11 13:06:05.922 | 1 | 1 |
7 | 2020-05-11 13:14:29.759 | 10 | 1 |
8 | 2020-05-11 13:16:09.147 | 1 | 2 |
16 | 2020-05-11 13:19:08.641 | 10 | 2 |
What I want is to calculate the difference between each row’s timestamp
, for example of rows 7
and 0
, since they have the same externalId
.
What I did for that purpose is the following.
# Group function. Dataframes are saved in a dict. def groupDataWithFrequency(self, dataFrameLabel: str, groupKey: str, frequency: str): '''Group time series by frequency. Starts at the beginning of the data frame.''' print(f"Binning {dataFrameLabel} data with frequency {frequency}") if (isinstance(groupKey, str)): return self.dataDict[dataFrameLabel].groupby(pd.Grouper(key=groupKey, freq=frequency, origin="start")) grouped_jobstates = groupDataWithFrequency("jobStatus", "timestamp", frequency)
After grouping, I loop over each group (to maintain the day) and try to calculate the difference between the timestamos, which is where it goes wrong.
grouped_bins = {} def jobStatusPRAggregator(data, name): if (data["status"] == 1): # Find corresponding element in original dataframe correspondingStatus = test_df2.loc[(test_df2["externalId"] == data["externalId"]) & (test_df2["timestamp"] != data["timestamp"])] # Calculate time difference time = correspondingStatus["timestamp"] - data["timestamp"] # some prints: print(type(time)) # <class 'pandas.core.series.Series'> --> Why is this a series? print(time.array[0]) # 0 days 00:08:23.837000 --> This looks correct, I think? print(time) # 7 0 days 00:08:23.837000 # Name: timestamp, dtype: timedelta64[ns] # Check if element exists in dict elem = next((x for x in grouped_bins if ((x["startDate"] == name) ("productiveTime" in x))), None) # If does not exist yet, add to dict if elem is None: grouped_bins.append( {"startDate": name, "productiveTime": time }) else: elem["productiveTime"] = elem["productiveTime"] # See below for problem # Loop over groups for name, group in grouped_jobstates: group.apply(jobStatusPRAggregator, args=(name,), axis=1)
The problem I face is the following. The element in the dict (elem
) looks like this in the end:
{'startDate': Timestamp('2020-05-11 00:00:00', freq='B'), 'productiveTime': 0 NaT 7 NaT 8 NaT 16 NaT 17 NaT .. 1090 NaT 1091 NaT 1099 NaT 1100 NaT 1107 NaT Name: timestamp, Length: 254, dtype: timedelta64[ns]}
What I want is something like this:
{'startDate': Timestamp('2020-05-11 00:00:00', freq='B'), 'productiveTime': 2 Days 12 hours 39 minutes 29 seconds Name: timestamp, Length: 254, dtype: timedelta64[ns]}
Though I am open to suggestions on how to store time durations in Python/Pandas.
I am also open to suggestions regarding the loop itself.
Advertisement
Answer
To obtain timestamp differences between consecutive rows of the same externalId
, you should be able to simply write, for example:
df2 = df.assign(delta=df.groupby('externalId')['timestamp'].diff())
On the example you give:
>>> df2 timestamp status externalId delta 0 2020-05-11 13:06:05.922 1 1 NaT 7 2020-05-11 13:14:29.759 10 1 0 days 00:08:23.837000 8 2020-05-11 13:16:09.147 1 2 NaT 16 2020-05-11 13:19:08.641 10 2 0 days 00:02:59.494000
If your timestamps are not actually of type Timestamp
yet, then you can convert them into it first:
df['timestamp'] = pd.to_datetime(df['timestamp'])