I have a Pandas data frame that contains one column and an index of timestamps. The code for the data frame looks something like this:
import pandas as pd indx = pd.date_range(start = '12-12-2020 06:00:00',end = '12-12-2020 06:02:00',freq = 'T') df = pd.DataFrame(data = [0.2,0.4,0.6],index = indx,columns = ['colname'])
I want to create a list of dictionaries from the rows of df in a certain way. For each row of the data frame, I want to create a dictionary with the keys “Timestamp” and “Value”. The value of the “Timestamp” key will be the index of that row. The value of the “Value” key will be the value of the row in the data frame columns. Each of these dictionaries will be appended to a list.
I know I can do this by looping over all of the rows of the data frame like this:
dict_list = [] for i in range(df.shape[0]): new_dict = {'Timestamp': df.index[i],'Value': df.iloc[i,0]} dict_list.append(new_dict)
However, the data frames I’m actually working with may be very large. Is there a faster, more efficient way of doing this other than using a for loop?
Advertisement
Answer
You need to rename your column and give your Index a name and turn it into a column. Then you want DataFrame.to_dict
using the 'records'
('r'
) orientation.
df = df.rename(columns={'colname': 'Value'}).rename_axis(index='Timestamp').reset_index() dict_list = df.to_dict('records') #[{'Timestamp': Timestamp('2020-12-12 06:00:00'), 'Value': 0.2}, # {'Timestamp': Timestamp('2020-12-12 06:01:00'), 'Value': 0.4}, # {'Timestamp': Timestamp('2020-12-12 06:02:00'), 'Value': 0.6}]
For larger DataFrames it gets a bit faster than simple looping, but it still gets slow as things are large
import perfplot import pandas as pd import numpy as np def loop(df): dict_list = [] for i in range(df.shape[0]): new_dict = {'Timestamp': df.index[i],'Value': df.iloc[i,0]} dict_list.append(new_dict) return dict_list def df_to_dict(df): df = df.rename(columns={'colname': 'Value'}).rename_axis(index='Timestamp').reset_index() return df.to_dict('records') perfplot.show( setup=lambda n: pd.DataFrame({'colname': np.random.normal(0,1,n)}, index=pd.date_range('12-12-2020', freq = 'T', periods=n)), kernels=[ lambda df: loop(df), lambda df: df_to_dict(df), ], labels=['Loop', 'df.to_dict'], n_range=[2 ** k for k in range(20)], equality_check=None, xlabel='len(df)' )