Suppose I have two Dataframes with different sizes:
df1 = pd.DataFrame(dict(xlow=np.linspace(0, 10, 11), xup=np.linspace(1.0, 11, 11))) df2 = pd.DataFrame(dict(x=np.random.uniform(low=1, high=10, size=(20,)), volume=np.random.randint(0, 10, size=20)))
to which I have:
df1: xlow xup 0 0.0 1.0 1 1.0 2.0 2 2.0 3.0 3 3.0 4.0 4 4.0 5.0 5 5.0 6.0 6 6.0 7.0 7 7.0 8.0 8 8.0 9.0 9 9.0 10.0 10 10.0 11.0
and:
df2: x volume 0 1.632789 8 1 8.346898 7 2 1.372285 2 3 1.946896 9 4 7.047305 0 5 3.851938 4 6 2.439664 7 7 8.823509 1 8 1.136700 1 9 8.766352 8 10 2.135441 8 11 8.092385 4 12 6.532898 3 13 7.199914 2 14 1.036684 0 15 9.714326 1 16 5.964111 0 17 9.625200 2 18 9.999818 6 19 9.891857 1
Now I want to add a third column to df1 say total_volume, where it is the summation of the volume that lie between individual row of xlow and xup of df1. I can do this using:
df1['total_volume']=df1.apply(lambda row: df2[(df2.x<=row['xup']) & (df2.x>row['xlow'])].volume.sum(),axis=1) which results in xlow xup total_volume 0 0.0 1.0 0 1 1.0 2.0 20 2 2.0 3.0 15 3 3.0 4.0 4 4 4.0 5.0 0 5 5.0 6.0 0 6 6.0 7.0 3 7 7.0 8.0 2 8 8.0 9.0 20 9 9.0 10.0 10 10 10.0 11.0 0
we can check the value of say the second row as:
df2[(df2.x<=2) & (df2.x>1) ].volume.sum()=20
In reality, my df1 can have up to hundreds of thousands of rows, so it can take up to tens of minutes to complete. Is there a more vectorize/pythonic way of doing this. I tried pandas merge and join but were not successful, most likely because I am still a novice.
Thank you for all your help!
Advertisement
Answer
If bins are not overlapping is possible use cut
with aggregate sum
and then add to df1
by DataFrame.join
:
df2['g'] = pd.cut(df2['x'], bins=[0] + df1['xup'].tolist(), labels=df1['xup']) df2 = df1.join(df2.groupby('g')['volume'].sum(), on='xup') print (df2) xlow xup volume 0 0.0 1 0 1 1.0 2 20 2 2.0 3 15 3 3.0 4 4 4 4.0 5 0 5 5.0 6 0 6 6.0 7 3 7 7.0 8 2 8 8.0 9 20 9 9.0 10 10 10 10.0 11 0