Skip to content
Advertisement

Convert 2D dataframe to 3D numpy array based on unique ID

I have a dataframe in this format:

time column           ID column   Value
2022-01-01 00:00:00   1           10
2022-01-01 00:15:00   1           0
2022-01-01 00:30:00   1           9
2022-01-01 00:45:00   1           0
2022-01-02 00:00:00   1           0
2022-01-02 00:15:00   1           0
2022-01-02 00:30:00   1           5
2022-01-02 00:45:00   1           15
2022-01-01 00:00:00   2           6
2022-01-01 00:15:00   2           2
2022-01-01 00:30:00   2           0
2022-01-01 00:45:00   2           0
2022-01-02 00:00:00   2           0
2022-01-02 00:15:00   2           0
2022-01-02 00:30:00   2           0
2022-01-02 00:45:00   2           7

… though my dataframe is much larger, with more than 500 hundred IDs.

I want to convert this 2D – dataframe into a 3D array in this format (num_time_samples, value, ID). Essentially I would like to have one 2D-array for every unique ID.

I plan on using the value column to build lag based feature vectors, but I’m stuck on how to convert the dataframe. I’ve searched and tried df.value, reshaping, etc and nothing has worked.

Advertisement

Answer

Say you have

df = pd.DataFrame(
    {
        'time column': [
            '00:00:00', '00:15:00', '00:00:00', '00:15:00',
        ],
        'ID column': [
            1,          1,          2,          2,
        ],
        'Value': [
            10,         0,          6,          2,
        ],
    }
)

where df actually is a subset of your dataframe keeping eveything data-type-naive.

I want to convert this 2D – dataframe into a 3D array in this format (num_time_samples, value, ID).

Why not do

a = (
    df
    .set_index(['time column', 'ID column'])
    .unstack(level=-1)    # which leaves 'time column' as first dimension index
    .to_numpy()
    .reshape(
        (
            df['ID column'].unique().size,
            df['time column'].unique().size,
            1,
        )
    )
)

a looks like

>>> a
array([[[10],
        [ 6]],

       [[ 0],
        [ 2]]], dtype=int64)
>>> a.shape
(2, 2, 1)
>>> a.ndim
3

a is structured as time column × ID column × Value (and indexable accordingly). E.g. let’s get individuals’ 00:15:00-data

>>> a[1]  # <=> a[1, ...] <=> a[1, :, :]
array([[0],
       [2]], dtype=int64)

Let’s get the first and second individual’s time series, respectively,

>>> a[:, 0]  # <=> a[:, 0, :] <=> a[..., 0, :]
array([[10],
       [ 0]], dtype=int64)
>>> a[:, 1]
array([[6],
       [2]], dtype=int64)
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement