Suppose I have two Dataframes with different sizes:
JavaScript
x
5
1
df1 = pd.DataFrame(dict(xlow=np.linspace(0, 10, 11),
2
xup=np.linspace(1.0, 11, 11)))
3
df2 = pd.DataFrame(dict(x=np.random.uniform(low=1, high=10, size=(20,)),
4
volume=np.random.randint(0, 10, size=20)))
5
to which I have:
JavaScript
1
14
14
1
df1:
2
xlow xup
3
0 0.0 1.0
4
1 1.0 2.0
5
2 2.0 3.0
6
3 3.0 4.0
7
4 4.0 5.0
8
5 5.0 6.0
9
6 6.0 7.0
10
7 7.0 8.0
11
8 8.0 9.0
12
9 9.0 10.0
13
10 10.0 11.0
14
and:
JavaScript
1
23
23
1
df2:
2
x volume
3
0 1.632789 8
4
1 8.346898 7
5
2 1.372285 2
6
3 1.946896 9
7
4 7.047305 0
8
5 3.851938 4
9
6 2.439664 7
10
7 8.823509 1
11
8 1.136700 1
12
9 8.766352 8
13
10 2.135441 8
14
11 8.092385 4
15
12 6.532898 3
16
13 7.199914 2
17
14 1.036684 0
18
15 9.714326 1
19
16 5.964111 0
20
17 9.625200 2
21
18 9.999818 6
22
19 9.891857 1
23
Now I want to add a third column to df1 say total_volume, where it is the summation of the volume that lie between individual row of xlow and xup of df1. I can do this using:
JavaScript
1
17
17
1
df1['total_volume']=df1.apply(lambda row: df2[(df2.x<=row['xup']) & (df2.x>row['xlow'])].volume.sum(),axis=1)
2
3
which results in
4
5
xlow xup total_volume
6
0 0.0 1.0 0
7
1 1.0 2.0 20
8
2 2.0 3.0 15
9
3 3.0 4.0 4
10
4 4.0 5.0 0
11
5 5.0 6.0 0
12
6 6.0 7.0 3
13
7 7.0 8.0 2
14
8 8.0 9.0 20
15
9 9.0 10.0 10
16
10 10.0 11.0 0
17
we can check the value of say the second row as:
JavaScript
1
2
1
df2[(df2.x<=2) & (df2.x>1) ].volume.sum()=20
2
In reality, my df1 can have up to hundreds of thousands of rows, so it can take up to tens of minutes to complete. Is there a more vectorize/pythonic way of doing this. I tried pandas merge and join but were not successful, most likely because I am still a novice.
Thank you for all your help!
Advertisement
Answer
If bins are not overlapping is possible use cut
with aggregate sum
and then add to df1
by DataFrame.join
:
JavaScript
1
18
18
1
df2['g'] = pd.cut(df2['x'], bins=[0] + df1['xup'].tolist(), labels=df1['xup'])
2
3
df2 = df1.join(df2.groupby('g')['volume'].sum(), on='xup')
4
print (df2)
5
6
xlow xup volume
7
0 0.0 1 0
8
1 1.0 2 20
9
2 2.0 3 15
10
3 3.0 4 4
11
4 4.0 5 0
12
5 5.0 6 0
13
6 6.0 7 3
14
7 7.0 8 2
15
8 8.0 9 20
16
9 9.0 10 10
17
10 10.0 11 0
18