Skip to content
Advertisement

Pandas compare and sum values between two DataFrame with different size

Suppose I have two Dataframes with different sizes:

df1 = pd.DataFrame(dict(xlow=np.linspace(0, 10, 11), 
                           xup=np.linspace(1.0, 11, 11))) 
df2 = pd.DataFrame(dict(x=np.random.uniform(low=1, high=10, size=(20,)), 
                           volume=np.random.randint(0, 10, size=20)))

to which I have:

df1: 
    xlow   xup
0    0.0   1.0
1    1.0   2.0
2    2.0   3.0
3    3.0   4.0
4    4.0   5.0
5    5.0   6.0
6    6.0   7.0
7    7.0   8.0
8    8.0   9.0
9    9.0  10.0
10  10.0  11.0

and:

df2:
         x  volume
0   1.632789       8
1   8.346898       7
2   1.372285       2
3   1.946896       9
4   7.047305       0
5   3.851938       4
6   2.439664       7
7   8.823509       1
8   1.136700       1
9   8.766352       8
10  2.135441       8
11  8.092385       4
12  6.532898       3
13  7.199914       2
14  1.036684       0
15  9.714326       1
16  5.964111       0
17  9.625200       2
18  9.999818       6
19  9.891857       1

Now I want to add a third column to df1 say total_volume, where it is the summation of the volume that lie between individual row of xlow and xup of df1. I can do this using:

df1['total_volume']=df1.apply(lambda row: df2[(df2.x<=row['xup']) & (df2.x>row['xlow'])].volume.sum(),axis=1)

which results in 

    xlow   xup  total_volume
0    0.0   1.0             0
1    1.0   2.0            20
2    2.0   3.0            15
3    3.0   4.0             4
4    4.0   5.0             0
5    5.0   6.0             0
6    6.0   7.0             3
7    7.0   8.0             2
8    8.0   9.0            20
9    9.0  10.0            10
10  10.0  11.0             0

we can check the value of say the second row as:

df2[(df2.x<=2) & (df2.x>1) ].volume.sum()=20

In reality, my df1 can have up to hundreds of thousands of rows, so it can take up to tens of minutes to complete. Is there a more vectorize/pythonic way of doing this. I tried pandas merge and join but were not successful, most likely because I am still a novice.

Thank you for all your help!

Advertisement

Answer

If bins are not overlapping is possible use cut with aggregate sum and then add to df1 by DataFrame.join:

df2['g'] = pd.cut(df2['x'], bins=[0] + df1['xup'].tolist(), labels=df1['xup'])

df2 = df1.join(df2.groupby('g')['volume'].sum(), on='xup')
print (df2)

    xlow xup  volume
0    0.0   1       0
1    1.0   2      20
2    2.0   3      15
3    3.0   4       4
4    4.0   5       0
5    5.0   6       0
6    6.0   7       3
7    7.0   8       2
8    8.0   9      20
9    9.0  10      10
10  10.0  11       0
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement