I have a dataframe in this format:
time column ID column Value 2022-01-01 00:00:00 1 10 2022-01-01 00:15:00 1 0 2022-01-01 00:30:00 1 9 2022-01-01 00:45:00 1 0 2022-01-02 00:00:00 1 0 2022-01-02 00:15:00 1 0 2022-01-02 00:30:00 1 5 2022-01-02 00:45:00 1 15 2022-01-01 00:00:00 2 6 2022-01-01 00:15:00 2 2 2022-01-01 00:30:00 2 0 2022-01-01 00:45:00 2 0 2022-01-02 00:00:00 2 0 2022-01-02 00:15:00 2 0 2022-01-02 00:30:00 2 0 2022-01-02 00:45:00 2 7
… though my dataframe is much larger, with more than 500 hundred IDs.
I want to convert this 2D – dataframe into a 3D array in this format (num_time_samples, value, ID). Essentially I would like to have one 2D-array for every unique ID.
I plan on using the value column to build lag based feature vectors, but I’m stuck on how to convert the dataframe. I’ve searched and tried df.value, reshaping, etc and nothing has worked.
Advertisement
Answer
df = pd.DataFrame( { 'time column': [ '00:00:00', '00:15:00', '00:00:00', '00:15:00', ], 'ID column': [ 1, 1, 2, 2, ], 'Value': [ 10, 0, 6, 2, ], } )
where df
actually is a subset of your dataframe keeping eveything data-type-naive.
I want to convert this 2D – dataframe into a 3D array in this format (num_time_samples, value, ID).
Why not do
a = ( df .set_index(['time column', 'ID column']) .unstack(level=-1) # which leaves 'time column' as first dimension index .to_numpy() .reshape( ( df['ID column'].unique().size, df['time column'].unique().size, 1, ) ) )
a
looks like
>>> a array([[[10], [ 6]], [[ 0], [ 2]]], dtype=int64) >>> a.shape (2, 2, 1) >>> a.ndim 3
a
is structured as time column
× ID column
× Value
(and indexable accordingly). E.g. let’s get individuals’ 00:15:00
-data
>>> a[1] # <=> a[1, ...] <=> a[1, :, :] array([[0], [2]], dtype=int64)
Let’s get the first and second individual’s time series, respectively,
>>> a[:, 0] # <=> a[:, 0, :] <=> a[..., 0, :] array([[10], [ 0]], dtype=int64)
>>> a[:, 1] array([[6], [2]], dtype=int64)