I have data where 13000 high frequency time-series data points are mapped as one column per point assosiated with one timestamp, when the data was measured. I read this data from an influxDB and get a Pandas.DataFrame structured as follows:
acquisition time stamp (DateTimeIndex), datapoint1 (int), ..., datapointN (int)
I know the sampling frequency of the high frequency data (10 MHz), so each datapoint sample has a length of 100 nanoseconds. With the given time stamp of the acquisition, I can calculate back to give each of the data points a time stamp, spaced 100 nanoseconds.
In total I have hundreds of tousands of rows like above which I would like to convert in a nanosecond-precision time series which should look like follows:
time stamp 1 (DateTimeIndex), acquisition stamp 1, datapoint1,1 time stamp 2 (DateTimeIndex), acquisition stamp 1, datapoint1,2 ... time stamp N (DateTimeIndex), acquisition stamp 1, datapoint1,N time stamp N + 1 (DateTimeIndex), acquisition stamp 2, datapoint2,1 time stamp N + 2 (DateTimeIndex), acquisition stamp 2, datapoint2,2 ... time stamp 2*N (DateTimeIndex), acquisition stamp 2, datapoint2,N ... ... time stamp M*N (DateTimeIndex), acquisition stamp M, datapoint2,N
Where N is the number of data points (= number of columns in the original frame = 13000) an M is the number of rows (= individial measurement data sets, each with 13000 data points)
Currently, my code looks as follows:
out_data = DataFrame() out_data = data_from_influx.apply(lambda item: processRawdataColumns(item, sampling_frequency, out_data))
with
def processRawdataColumns(raw_data: Series, sampling_frequency: int, result_frame: DataFrame) -> DataFrame: """ Method transforming raw data organized as indexed array into timeseries data, assuming that the initial acquisition timestamp equals the time stamp of the last sample in the series :param raw_data: initial data, expected is a pandas.Series where Series.name = acquisition timestamp :param sampling_frequency: the sampling frequency of time-resolved raw data :return: pandas.DataFrame with time stamped raw data """ try: acq_timestamp = raw_data.name.value processed_data: DataFrame = raw_data.to_frame() processed_data = processed_data.reset_index() processed_data['index'] = processed_data['index'].apply(lambda item: int(item.replace('rawdata', '').lstrip( '0'))) processed_data['acqtimestamp'] = raw_data.name processed_data['time'] = processed_data['index'] .apply(lambda index: acq_timestamp - int((len(raw_data) - index - 1) * 1E9 / sampling_frequency)) processed_data = processed_data.drop(columns=['index']) processed_data['time'] = pd.to_datetime(processed_data['time']) processed_data = processed_data.rename(columns={raw_data.name: 'rawdata'}) processed_data = processed_data.set_index('time') result_frame = result_frame.append(processed_data) return result_frame except Exception as err: print(err) return DataFrame(index=['time'])
However, the data is not properly structured. I get an output_frame
structured by M rows, each with one DataFrame containing a properly transformed high-frequency data time-series. How can I get to the “flat” structure of time indexed rows with one value and a time stamp?
As sample data it is enough to consider data like this (sample_frequency = 1E7):
acq_stamp rawdata000001 rawdata000002 rawdata000003 rawdata000004 0 2022-05-15T21:00:02.660160000 1 2 3 4 1 2022-05-15T21:00:04.660160000 5 6 7 8
This should become
time stamp acq_stamp value 0 2022-05-15T21:00:02.660159700 2022-05-15T21:00:02.660160000 1 1 2022-05-15T21:00:02.660159800 2022-05-15T21:00:02.660160000 2 2 2022-05-15T21:00:02.660159900 2022-05-15T21:00:02.660160000 3 3 2022-05-15T21:00:02.660160000 2022-05-15T21:00:02.660160000 4 4 2022-05-15T21:00:04.660159700 2022-05-15T21:00:04.660160000 5 5 2022-05-15T21:00:04.660159800 2022-05-15T21:00:04.660160000 6 6 2022-05-15T21:00:04.660159900 2022-05-15T21:00:04.660160000 7 7 2022-05-15T21:00:04.660160000 2022-05-15T21:00:04.660160000 8
Advertisement
Answer
I’m not sure I’m fully understanding the rebasing logic, but this does lead to from your sample input to your sample output.
# Convert to pandas datetime. df.acq_stamp = pd.to_datetime(df.acq_stamp) # Melt your dataframe, keeping the acq_stamp column. df = df.melt('acq_stamp', var_name='rawdata') # Get the numerical value out of the rawdata column name. df.rawdata = df.rawdata.str.extract('(d+)').astype(int) # Find the absolute difference between each rawdata point and the max rawdata point, # Make this into units of 100ns and subtract it from the acq_stamp. rawdiff_as_ns = df.rawdata.sub(df.rawdata.max()).abs().mul(100) df['timestamp'] = df.acq_stamp.sub(pd.to_timedelta(rawdiff_as_ns, unit='ns')) # Sort your data. df = df.sort_values('timestamp', ignore_index=True) # Outputting just the columns you wanted: print(df[['timestamp', 'acq_stamp', 'value']])
Output:
timestamp acq_stamp value 0 2022-05-15 21:00:02.660159700 2022-05-15 21:00:02.660160 1 1 2022-05-15 21:00:02.660159800 2022-05-15 21:00:02.660160 2 2 2022-05-15 21:00:02.660159900 2022-05-15 21:00:02.660160 3 3 2022-05-15 21:00:02.660160000 2022-05-15 21:00:02.660160 4 4 2022-05-15 21:00:04.660159700 2022-05-15 21:00:04.660160 5 5 2022-05-15 21:00:04.660159800 2022-05-15 21:00:04.660160 6 6 2022-05-15 21:00:04.660159900 2022-05-15 21:00:04.660160 7 7 2022-05-15 21:00:04.660160000 2022-05-15 21:00:04.660160 8