Most efficient way to place a Pandas data frame into a list of dictionaries with a certain format

Tags: , , ,



I have a Pandas data frame that contains one column and an index of timestamps. The code for the data frame looks something like this:

import pandas as pd
indx = pd.date_range(start = '12-12-2020 06:00:00',end = '12-12-2020 06:02:00',freq = 'T')
df = pd.DataFrame(data = [0.2,0.4,0.6],index = indx,columns = ['colname'])

I want to create a list of dictionaries from the rows of df in a certain way. For each row of the data frame, I want to create a dictionary with the keys “Timestamp” and “Value”. The value of the “Timestamp” key will be the index of that row. The value of the “Value” key will be the value of the row in the data frame columns. Each of these dictionaries will be appended to a list.

I know I can do this by looping over all of the rows of the data frame like this:

dict_list = []
for i in range(df.shape[0]):
    new_dict = {'Timestamp': df.index[i],'Value': df.iloc[i,0]}
    dict_list.append(new_dict)

However, the data frames I’m actually working with may be very large. Is there a faster, more efficient way of doing this other than using a for loop?

Answer

You need to rename your column and give your Index a name and turn it into a column. Then you want DataFrame.to_dict using the 'records' ('r') orientation.

df = df.rename(columns={'colname': 'Value'}).rename_axis(index='Timestamp').reset_index()

dict_list = df.to_dict('records')
#[{'Timestamp': Timestamp('2020-12-12 06:00:00'), 'Value': 0.2},
# {'Timestamp': Timestamp('2020-12-12 06:01:00'), 'Value': 0.4},
# {'Timestamp': Timestamp('2020-12-12 06:02:00'), 'Value': 0.6}]

For larger DataFrames it gets a bit faster than simple looping, but it still gets slow as things are large

import perfplot
import pandas as pd
import numpy as np

def loop(df):
    dict_list = []
    for i in range(df.shape[0]):
        new_dict = {'Timestamp': df.index[i],'Value': df.iloc[i,0]}
        dict_list.append(new_dict)
    return dict_list
        
        
def df_to_dict(df):
    df = df.rename(columns={'colname': 'Value'}).rename_axis(index='Timestamp').reset_index()
    return df.to_dict('records')

perfplot.show(
    setup=lambda n: pd.DataFrame({'colname': np.random.normal(0,1,n)},
                                 index=pd.date_range('12-12-2020', freq = 'T', periods=n)),
    kernels=[
        lambda df: loop(df),
        lambda df: df_to_dict(df),
    ],
    labels=['Loop', 'df.to_dict'],
    n_range=[2 ** k for k in range(20)],
    equality_check=None,
    xlabel='len(df)'
)

enter image description here



Source: stackoverflow