I have a CSV file that stores data from different smartphone sensors. The timestamps are elapsed nanoseconds since the program to record the data was started. Short example:
timestamps,acce_x,acce_y,acce_z,grav_x,grav_y,grav_z,labels 25993266,-2.5290375,6.9180603,4.3400116,-2.9009695,7.935462,4.978274,OTHER 28129496,-2.5290375,6.9180603,4.3400116,-2.87558475,7.87134935,5.091722799999999,OTHER 31028666,-2.53741455,6.9312286499999995,4.605766300000001,-2.8502,7.8072367,5.2051716,OTHER 33164897,-2.5457916,6.944397,4.871521,-2.79687885,7.73525185,5.3374355,OTHER 36064067,-2.4727707,6.91207125,5.1803741500000005,-2.7435577,7.663267,5.4696994,OTHER 38200297,-2.3997498,6.8797455,5.4892273,-2.6648885,7.59296125,5.6024062,OTHER 41099467,-2.25849155,6.85580445,5.79090115,-2.5862193,7.5226555,5.735113,OTHER 43235697,-2.1172333,6.8318634,6.092575,-2.50272225,7.45811375,5.85305635,OTHER 46134867,-1.9903412,6.810318,6.32122035,-2.4192252,7.393572,5.9709997,OTHER
The time steps between the timestamps are not equal, but I would like them to be. My question is how to achieve this? I was thinking about simply downsampling the nanoseconds to microseconds using the code below. This is my first attempt that does not return an error during execution, but it returns a CSV file without the timestamps and every row after the first is completely empty.
series = pandas.read_csv("file3.csv", header=0, index_col=0, squeeze=True, nrows=1000) series.index = pandas.to_datetime(series.index, unit='ns') downsampled = series.resample("U").mean() downsampled.to_csv("file4.csv", index=False)
I would be thankful for ways to improve my code as well as other ideas to achieve my goal in general.
Advertisement
Answer
When you resample over milliseconds, there aren’t enough values to fill consecutive buckets, so you end up with NaN’s.
If you want your timesteps to be equal while also having all buckets filled, you can find the maximum difference and use that as the resampling rate:
First, set the index to be Timedelta
‘s, since it’s the time elapsed since the app started.
df.index = df.index.map(lambda t: pd.Timedelta(t, unit='ns')) df.index # output: TimedeltaIndex(['0 days 00:00:00.025993266', '0 days 00:00:00.028129496', '0 days 00:00:00.031028666', '0 days 00:00:00.033164897', '0 days 00:00:00.036064067', '0 days 00:00:00.038200297', '0 days 00:00:00.041099467', '0 days 00:00:00.043235697', '0 days 00:00:00.046134867'], dtype='timedelta64[ns]', name='timestamps', freq=None)
Next, resampling:
import numpy as np max_diff = np.diff(df.index).max() # numpy.timedelta64(2899170,'ns') # convert to pandas.Timedelta to use it with `resample` dfr = df.resample(pd.Timedelta(max_diff)).mean() dfr
Output:
acce_x acce_y acce_z grav_x grav_y grav_z timestamps 0 days 00:00:00.025993266 -2.529037 6.918060 4.340012 -2.888277 7.903406 5.034998 0 days 00:00:00.028892436 -2.537415 6.931229 4.605766 -2.850200 7.807237 5.205172 0 days 00:00:00.031791606 -2.545792 6.944397 4.871521 -2.796879 7.735252 5.337435 0 days 00:00:00.034690776 -2.472771 6.912071 5.180374 -2.743558 7.663267 5.469699 0 days 00:00:00.037589946 -2.399750 6.879746 5.489227 -2.664888 7.592961 5.602406 0 days 00:00:00.040489116 -2.187862 6.843834 5.941738 -2.544471 7.490385 5.794085 0 days 00:00:00.043388286 -1.990341 6.810318 6.321220 -2.419225 7.393572 5.971000
And to verify that your index is evenly spaced, it has freq='2899170N'
:
dfr.index # output: TimedeltaIndex(['0 days 00:00:00.025993266', '0 days 00:00:00.028892436', '0 days 00:00:00.031791606', '0 days 00:00:00.034690776', '0 days 00:00:00.037589946', '0 days 00:00:00.040489116', '0 days 00:00:00.043388286'], dtype='timedelta64[ns]', name='timestamps', freq='2899170N')
Or check via diff:
np.diff(dfr.index) # output: array([2899170, 2899170, 2899170, 2899170, 2899170, 2899170], dtype='timedelta64[ns]')