I am attempting to bin data in one dataframe according to bins defined in a second dataframe. I am thinking that some combination of pd.bin and pd.merge might get me there?
This is basically the form each dataframe is currently in:
df = pd.DataFrame({'id':['a', 'b', 'c', 'd','e'], 'bin':[1, 2, 3, 3, 2], 'perc':[0.1,0.9,0.3,0.7,0.5]}) df2 = pd.DataFrame({'bin':[1, 1, 1, 2, 2, 2, 3, 3, 3], 'result':['low', 'medium','high','low', 'medium','high','low', 'medium','high'], 'cut_min':[0,0.2,0.6,0,0.3,0.7,0,0.4,0.8], 'cut_max':[0.2,0.6,1,0.3,0.7,1,0.4,0.8,1]})
df:
bin id perc 1 a 0.1 2 b 0.9 3 c 0.3 3 d 0.7 2 e 0.5
And this is the table with the bins, df2:
bin cut_max cut_min result 1 0.2 0.0 low 1 0.6 0.2 medium 1 1.0 0.6 high 2 0.3 0.0 low 2 0.7 0.3 medium 2 1.0 0.7 high 3 0.4 0.0 low 3 0.8 0.4 medium 3 1.0 0.8 high
I would like to match the bin, and find the appropriate result in df2 using the cut_min and cut_max that encompasses the perc value in df. So, I would like the resulting table to look like this:
bin id perc result 1 a 0.1 low 2 b 0.9 high 3 c 0.3 low 3 d 0.7 medium 2 e 0.5 medium
I originally wrote this in a SQL query which accomplished the task quite simply with a join:
select df.id , df.bin , df.perc , df2.result from df inner join df2 on df.bin = df2.bin and df.perc >= df2.cut_min and df.perc < df2.cut_max
If anyone knows a good way to do this using Pandas, it would be greatly appreciated! (And this is actually the first time I haven’t been able to find a solution just searching on stackoverflow, so my apologies if any of the above wasn’t explained well enough!)
Advertisement
Answer
First merge df
and df2
on the bin
column, and then select the rows where cut_min <= perc < cut_max
:
In [95]: result = pd.merge(df, df2, on='bin').query('cut_min <= perc < cut_max'); result Out[95]: bin id perc cut_max cut_min result 0 1 a 0.1 0.2 0.0 low 5 2 b 0.9 1.0 0.7 high 7 2 e 0.5 0.7 0.3 medium 9 3 c 0.3 0.4 0.0 low 13 3 d 0.7 0.8 0.4 medium In [97]: result = result[['bin', 'id', 'perc', 'result']] In [98]: result.sort('id') Out[98]: bin id perc result 0 1 a 0.1 low 5 2 b 0.9 high 9 3 c 0.3 low 13 3 d 0.7 medium 7 2 e 0.5 medium