I want to do something similar to what pd.combine_first() does, but as a row-wise operation performed on a shared index. And to also add a new column in place of the old ones – while keeping the original_values of shared column names.
In this case the ‘ts’ column is one that I want to replace with time_now.
time_now = "2022-08-05" row1 = {'unique_id':5,'ts': '2022-08-02','id2':2,'val':300, 'ffo1':55, 'debt':200} row2 = {'unique_id':5,'ts': '2022-08-03' ,'id2':2, 'esg':True,'gov_rt':90} row3 = {'unique_id':5,'ts': '2022-08-04','id2':2, 'rank':5,'buy_or_sell':'sell'} df = pd.DataFrame([row1,row2,row3]) unique_id ts id2 val ffo1 debt esg gov_rt rank 0 5 2022-08-02 2 300.0 55.0 200.0 NaN NaN NaN 1 5 2022-08-03 2 NaN NaN NaN True 90.0 NaN 2 5 2022-08-04 2 NaN NaN NaN NaN NaN 5.0 buy_or_sell 0 NaN 1 NaN 2 sell
My desired output is below, using the new timestamp, but keeping the old ones based on their group index.
rows = [{'unique_id':5, 'ts':time_now ,'id2':2,'val':300, 'ffo1':55, 'debt':200,'esg':True,'gov_rt':90,'rank':5,'buy_or_sell':'sell', 'ts_1':'2022-08-02','ts_2':'2022-08-03', 'ts_3':'2022-08-04'}] output = pd.DataFrame(rows) unique_id ts id2 val ffo1 debt esg gov_rt rank 0 5 2022-08-05 2 300 55 200 True 90 5 buy_or_sell ts_1 ts_2 ts_3 0 sell 2022-08-02 2022-08-03 2022-08-04
The part below seems to work when run by itself. But I cannot get it to work inside of a function because of differences between index lengths.
df2 = df.set_index('ts').stack().reset_index() rows = dict(zip(df3['level_1'],df3[0])) ts = df2['ts'].unique().tolist() for cnt,value in enumerate(ts): rows['ts_{cnt}'] = value # drop all rows df2 = pd.DataFrame([rows]) df2['time'] = time df2
Advertisement
Answer
The problem was that I forgot to put the dictionary into a list to create a records oriented dataframe. Additionally when using a similar function, the index might need to be dropped to be reset, as duplicated columns might be created.
I still wonder if there’s a better way to do what I want, since it’s kind of slow.
def func(df): df2 = df.set_index('ts').stack().reset_index() rows = dict(zip(df2['level_1'],df2[0])) ts = df2['ts'].unique().tolist() for cnt,value in enumerate(ts): rows[f'ts_{cnt}'] = value # drop all rows df2 = pd.DataFrame([rows]) df2['time'] = time_now return df2 #run this df.groupby('unique_id').apply(func)