I have a CSV file that stores data from different smartphone sensors. The timestamps are elapsed nanoseconds since the program to record the data was started. Short example:
timestamps,acce_x,acce_y,acce_z,grav_x,grav_y,grav_z,labels
25993266,-2.5290375,6.9180603,4.3400116,-2.9009695,7.935462,4.978274,OTHER
28129496,-2.5290375,6.9180603,4.3400116,-2.87558475,7.87134935,5.091722799999999,OTHER
31028666,-2.53741455,6.9312286499999995,4.605766300000001,-2.8502,7.8072367,5.2051716,OTHER
33164897,-2.5457916,6.944397,4.871521,-2.79687885,7.73525185,5.3374355,OTHER
36064067,-2.4727707,6.91207125,5.1803741500000005,-2.7435577,7.663267,5.4696994,OTHER
38200297,-2.3997498,6.8797455,5.4892273,-2.6648885,7.59296125,5.6024062,OTHER
41099467,-2.25849155,6.85580445,5.79090115,-2.5862193,7.5226555,5.735113,OTHER
43235697,-2.1172333,6.8318634,6.092575,-2.50272225,7.45811375,5.85305635,OTHER
46134867,-1.9903412,6.810318,6.32122035,-2.4192252,7.393572,5.9709997,OTHER
The time steps between the timestamps are not equal, but I would like them to be. My question is how to achieve this? I was thinking about simply downsampling the nanoseconds to microseconds using the code below. This is my first attempt that does not return an error during execution, but it returns a CSV file without the timestamps and every row after the first is completely empty.
series = pandas.read_csv("file3.csv", header=0, index_col=0, squeeze=True, nrows=1000)
series.index = pandas.to_datetime(series.index, unit='ns')
downsampled = series.resample("U").mean()
downsampled.to_csv("file4.csv", index=False)
I would be thankful for ways to improve my code as well as other ideas to achieve my goal in general.
Advertisement
Answer
When you resample over milliseconds, there aren’t enough values to fill consecutive buckets, so you end up with NaN’s.
If you want your timesteps to be equal while also having all buckets filled, you can find the maximum difference and use that as the resampling rate:
First, set the index to be Timedelta
‘s, since it’s the time elapsed since the app started.
df.index = df.index.map(lambda t: pd.Timedelta(t, unit='ns'))
df.index
# output:
TimedeltaIndex(['0 days 00:00:00.025993266', '0 days 00:00:00.028129496',
'0 days 00:00:00.031028666', '0 days 00:00:00.033164897',
'0 days 00:00:00.036064067', '0 days 00:00:00.038200297',
'0 days 00:00:00.041099467', '0 days 00:00:00.043235697',
'0 days 00:00:00.046134867'],
dtype='timedelta64[ns]', name='timestamps', freq=None)
Next, resampling:
import numpy as np
max_diff = np.diff(df.index).max()
# numpy.timedelta64(2899170,'ns')
# convert to pandas.Timedelta to use it with `resample`
dfr = df.resample(pd.Timedelta(max_diff)).mean()
dfr
Output:
acce_x acce_y acce_z grav_x grav_y grav_z
timestamps
0 days 00:00:00.025993266 -2.529037 6.918060 4.340012 -2.888277 7.903406 5.034998
0 days 00:00:00.028892436 -2.537415 6.931229 4.605766 -2.850200 7.807237 5.205172
0 days 00:00:00.031791606 -2.545792 6.944397 4.871521 -2.796879 7.735252 5.337435
0 days 00:00:00.034690776 -2.472771 6.912071 5.180374 -2.743558 7.663267 5.469699
0 days 00:00:00.037589946 -2.399750 6.879746 5.489227 -2.664888 7.592961 5.602406
0 days 00:00:00.040489116 -2.187862 6.843834 5.941738 -2.544471 7.490385 5.794085
0 days 00:00:00.043388286 -1.990341 6.810318 6.321220 -2.419225 7.393572 5.971000
And to verify that your index is evenly spaced, it has freq='2899170N'
:
dfr.index
# output:
TimedeltaIndex(['0 days 00:00:00.025993266', '0 days 00:00:00.028892436',
'0 days 00:00:00.031791606', '0 days 00:00:00.034690776',
'0 days 00:00:00.037589946', '0 days 00:00:00.040489116',
'0 days 00:00:00.043388286'],
dtype='timedelta64[ns]', name='timestamps', freq='2899170N')
Or check via diff:
np.diff(dfr.index)
# output:
array([2899170, 2899170, 2899170, 2899170, 2899170, 2899170],
dtype='timedelta64[ns]')