Skip to content
Advertisement

Resampling timestamps in a CSV

I have a CSV file that stores data from different smartphone sensors. The timestamps are elapsed nanoseconds since the program to record the data was started. Short example:

timestamps,acce_x,acce_y,acce_z,grav_x,grav_y,grav_z,labels
25993266,-2.5290375,6.9180603,4.3400116,-2.9009695,7.935462,4.978274,OTHER
28129496,-2.5290375,6.9180603,4.3400116,-2.87558475,7.87134935,5.091722799999999,OTHER
31028666,-2.53741455,6.9312286499999995,4.605766300000001,-2.8502,7.8072367,5.2051716,OTHER
33164897,-2.5457916,6.944397,4.871521,-2.79687885,7.73525185,5.3374355,OTHER
36064067,-2.4727707,6.91207125,5.1803741500000005,-2.7435577,7.663267,5.4696994,OTHER
38200297,-2.3997498,6.8797455,5.4892273,-2.6648885,7.59296125,5.6024062,OTHER
41099467,-2.25849155,6.85580445,5.79090115,-2.5862193,7.5226555,5.735113,OTHER
43235697,-2.1172333,6.8318634,6.092575,-2.50272225,7.45811375,5.85305635,OTHER
46134867,-1.9903412,6.810318,6.32122035,-2.4192252,7.393572,5.9709997,OTHER

The time steps between the timestamps are not equal, but I would like them to be. My question is how to achieve this? I was thinking about simply downsampling the nanoseconds to microseconds using the code below. This is my first attempt that does not return an error during execution, but it returns a CSV file without the timestamps and every row after the first is completely empty.

series = pandas.read_csv("file3.csv", header=0, index_col=0, squeeze=True, nrows=1000)
series.index = pandas.to_datetime(series.index, unit='ns')
downsampled = series.resample("U").mean()
downsampled.to_csv("file4.csv", index=False)

I would be thankful for ways to improve my code as well as other ideas to achieve my goal in general.

Advertisement

Answer

When you resample over milliseconds, there aren’t enough values to fill consecutive buckets, so you end up with NaN’s.

If you want your timesteps to be equal while also having all buckets filled, you can find the maximum difference and use that as the resampling rate:

First, set the index to be Timedelta‘s, since it’s the time elapsed since the app started.

df.index = df.index.map(lambda t: pd.Timedelta(t, unit='ns'))
df.index

# output:
TimedeltaIndex(['0 days 00:00:00.025993266', '0 days 00:00:00.028129496',
                '0 days 00:00:00.031028666', '0 days 00:00:00.033164897',
                '0 days 00:00:00.036064067', '0 days 00:00:00.038200297',
                '0 days 00:00:00.041099467', '0 days 00:00:00.043235697',
                '0 days 00:00:00.046134867'],
               dtype='timedelta64[ns]', name='timestamps', freq=None)

Next, resampling:

import numpy as np

max_diff = np.diff(df.index).max()
# numpy.timedelta64(2899170,'ns')

# convert to pandas.Timedelta to use it with `resample`
dfr = df.resample(pd.Timedelta(max_diff)).mean()
dfr

Output:

                             acce_x    acce_y    acce_z    grav_x    grav_y    grav_z
timestamps                                                                           
0 days 00:00:00.025993266 -2.529037  6.918060  4.340012 -2.888277  7.903406  5.034998
0 days 00:00:00.028892436 -2.537415  6.931229  4.605766 -2.850200  7.807237  5.205172
0 days 00:00:00.031791606 -2.545792  6.944397  4.871521 -2.796879  7.735252  5.337435
0 days 00:00:00.034690776 -2.472771  6.912071  5.180374 -2.743558  7.663267  5.469699
0 days 00:00:00.037589946 -2.399750  6.879746  5.489227 -2.664888  7.592961  5.602406
0 days 00:00:00.040489116 -2.187862  6.843834  5.941738 -2.544471  7.490385  5.794085
0 days 00:00:00.043388286 -1.990341  6.810318  6.321220 -2.419225  7.393572  5.971000

And to verify that your index is evenly spaced, it has freq='2899170N':

dfr.index
# output:
TimedeltaIndex(['0 days 00:00:00.025993266', '0 days 00:00:00.028892436',
                '0 days 00:00:00.031791606', '0 days 00:00:00.034690776',
                '0 days 00:00:00.037589946', '0 days 00:00:00.040489116',
                '0 days 00:00:00.043388286'],
               dtype='timedelta64[ns]', name='timestamps', freq='2899170N')

Or check via diff:

np.diff(dfr.index)
# output:
array([2899170, 2899170, 2899170, 2899170, 2899170, 2899170],
      dtype='timedelta64[ns]')
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement