Skip to content
Advertisement

Pandas combine rows in groups to get rid of Nans

I want to do something similar to what pd.combine_first() does, but as a row-wise operation performed on a shared index. And to also add a new column in place of the old ones – while keeping the original_values of shared column names.

In this case the ‘ts’ column is one that I want to replace with time_now.

time_now = "2022-08-05"

row1 = {'unique_id':5,'ts': '2022-08-02','id2':2,'val':300, 'ffo1':55, 'debt':200}
row2 = {'unique_id':5,'ts': '2022-08-03' ,'id2':2, 'esg':True,'gov_rt':90}
row3 = {'unique_id':5,'ts': '2022-08-04','id2':2, 'rank':5,'buy_or_sell':'sell'}
df = pd.DataFrame([row1,row2,row3])

   unique_id          ts  id2    val  ffo1   debt   esg  gov_rt  rank  
0          5  2022-08-02    2  300.0  55.0  200.0   NaN     NaN   NaN   
1          5  2022-08-03    2    NaN   NaN    NaN  True    90.0   NaN   
2          5  2022-08-04    2    NaN   NaN    NaN   NaN     NaN   5.0   

  buy_or_sell  
0         NaN  
1         NaN  
2        sell  

My desired output is below, using the new timestamp, but keeping the old ones based on their group index.

rows = [{'unique_id':5, 'ts':time_now ,'id2':2,'val':300, 'ffo1':55, 'debt':200,'esg':True,'gov_rt':90,'rank':5,'buy_or_sell':'sell', 'ts_1':'2022-08-02','ts_2':'2022-08-03', 'ts_3':'2022-08-04'}]
output = pd.DataFrame(rows)

   unique_id          ts  id2  val  ffo1  debt   esg  gov_rt  rank  
0          5  2022-08-05    2  300    55   200  True      90     5   

  buy_or_sell        ts_1        ts_2        ts_3  
0        sell  2022-08-02  2022-08-03  2022-08-04  

The part below seems to work when run by itself. But I cannot get it to work inside of a function because of differences between index lengths.


df2 = df.set_index('ts').stack().reset_index()
rows = dict(zip(df3['level_1'],df3[0]))
ts = df2['ts'].unique().tolist()
for cnt,value in enumerate(ts):
    rows['ts_{cnt}'] = value
# drop all rows
df2 = pd.DataFrame([rows])
df2['time'] = time
df2

Advertisement

Answer

The problem was that I forgot to put the dictionary into a list to create a records oriented dataframe. Additionally when using a similar function, the index might need to be dropped to be reset, as duplicated columns might be created.

I still wonder if there’s a better way to do what I want, since it’s kind of slow.

def func(df):
    df2 = df.set_index('ts').stack().reset_index()
    rows = dict(zip(df2['level_1'],df2[0]))
    ts = df2['ts'].unique().tolist()
    for cnt,value in enumerate(ts):
        rows[f'ts_{cnt}'] = value
    # drop all rows
    df2 = pd.DataFrame([rows])
    df2['time'] = time_now
    return df2

#run this
df.groupby('unique_id').apply(func)

Advertisement