Pandas compare and sum values between two DataFrame with different size

Suppose I have two Dataframes with different sizes:

df1 = pd.DataFrame(dict(xlow=np.linspace(0, 10, 11), 
                           xup=np.linspace(1.0, 11, 11))) 
df2 = pd.DataFrame(dict(x=np.random.uniform(low=1, high=10, size=(20,)), 
                           volume=np.random.randint(0, 10, size=20)))

JavaScript
​x
 
df1 = pd.DataFrame(dict(xlow=np.linspace(0, 10, 11), 
                           xup=np.linspace(1.0, 11, 11))) 
df2 = pd.DataFrame(dict(x=np.random.uniform(low=1, high=10, size=(20,)), 
                           volume=np.random.randint(0, 10, size=20)))
​

to which I have:

df1: 
    xlow   xup
0    0.0   1.0
1    1.0   2.0
2    2.0   3.0
3    3.0   4.0
4    4.0   5.0
5    5.0   6.0
6    6.0   7.0
7    7.0   8.0
8    8.0   9.0
9    9.0  10.0
10  10.0  11.0

JavaScript
 
df1: 
    xlow   xup
  0.0   1.0
  1.0   2.0
  2.0   3.0
  3.0   4.0
  4.0   5.0
  5.0   6.0
  6.0   7.0
  7.0   8.0
  8.0   9.0
  9.0  10.0
10.0  11.0
​

and:

df2:
         x  volume
0   1.632789       8
1   8.346898       7
2   1.372285       2
3   1.946896       9
4   7.047305       0
5   3.851938       4
6   2.439664       7
7   8.823509       1
8   1.136700       1
9   8.766352       8
10  2.135441       8
11  8.092385       4
12  6.532898       3
13  7.199914       2
14  1.036684       0
15  9.714326       1
16  5.964111       0
17  9.625200       2
18  9.999818       6
19  9.891857       1

JavaScript
 
df2:
         x  volume
 1.632789       8
 8.346898       7
 1.372285       2
 1.946896       9
 7.047305       0
 3.851938       4
 2.439664       7
 8.823509       1
 1.136700       1
 8.766352       8
2.135441       8
8.092385       4
6.532898       3
7.199914       2
1.036684       0
9.714326       1
5.964111       0
9.625200       2
9.999818       6
9.891857       1
​

Now I want to add a third column to df1 say total_volume, where it is the summation of the volume that lie between individual row of xlow and xup of df1. I can do this using:

df1['total_volume']=df1.apply(lambda row: df2[(df2.x<=row['xup']) & (df2.x>row['xlow'])].volume.sum(),axis=1)

which results in 

    xlow   xup  total_volume
0    0.0   1.0             0
1    1.0   2.0            20
2    2.0   3.0            15
3    3.0   4.0             4
4    4.0   5.0             0
5    5.0   6.0             0
6    6.0   7.0             3
7    7.0   8.0             2
8    8.0   9.0            20
9    9.0  10.0            10
10  10.0  11.0             0

JavaScript
 
df1['total_volume']=df1.apply(lambda row: df2[(df2.x<=row['xup']) & (df2.x>row['xlow'])].volume.sum(),axis=1)
​
which results in 
​
    xlow   xup  total_volume
0    0.0   1.0             0
1    1.0   2.0            20
2    2.0   3.0            15
3    3.0   4.0             4
4    4.0   5.0             0
5    5.0   6.0             0
6    6.0   7.0             3
7    7.0   8.0             2
8    8.0   9.0            20
9    9.0  10.0            10
10  10.0  11.0             0
​

we can check the value of say the second row as:

df2[(df2.x<=2) & (df2.x>1) ].volume.sum()=20

JavaScript
 
df2[(df2.x<=2) & (df2.x>1) ].volume.sum()=20
​

In reality, my df1 can have up to hundreds of thousands of rows, so it can take up to tens of minutes to complete. Is there a more vectorize/pythonic way of doing this. I tried pandas merge and join but were not successful, most likely because I am still a novice.

Thank you for all your help!

Answer

If bins are not overlapping is possible use cut with aggregate sum and then add to df1 by DataFrame.join:

df2['g'] = pd.cut(df2['x'], bins=[0] + df1['xup'].tolist(), labels=df1['xup'])

df2 = df1.join(df2.groupby('g')['volume'].sum(), on='xup')
print (df2)

    xlow xup  volume
0    0.0   1       0
1    1.0   2      20
2    2.0   3      15
3    3.0   4       4
4    4.0   5       0
5    5.0   6       0
6    6.0   7       3
7    7.0   8       2
8    8.0   9      20
9    9.0  10      10
10  10.0  11       0

JavaScript
 
df2['g'] = pd.cut(df2['x'], bins=[0] + df1['xup'].tolist(), labels=df1['xup'])
​
df2 = df1.join(df2.groupby('g')['volume'].sum(), on='xup')
print (df2)
​
    xlow xup  volume
0    0.0   1       0
1    1.0   2      20
2    2.0   3      15
3    3.0   4       4
4    4.0   5       0
5    5.0   6       0
6    6.0   7       3
7    7.0   8       2
8    8.0   9      20
9    9.0  10      10
10  10.0  11       0
​

Advertisement

Answer