Using Python Pandas to bin data in one df according to bins defined in a second df

I am attempting to bin data in one dataframe according to bins defined in a second dataframe. I am thinking that some combination of pd.bin and pd.merge might get me there?

This is basically the form each dataframe is currently in:

df = pd.DataFrame({'id':['a', 'b', 'c', 'd','e'],
                   'bin':[1, 2, 3, 3, 2],
                   'perc':[0.1,0.9,0.3,0.7,0.5]})

df2 = pd.DataFrame({'bin':[1, 1, 1, 2, 2, 2, 3, 3, 3], 
                    'result':['low', 'medium','high','low', 'medium','high','low', 'medium','high'],
                    'cut_min':[0,0.2,0.6,0,0.3,0.7,0,0.4,0.8],
                    'cut_max':[0.2,0.6,1,0.3,0.7,1,0.4,0.8,1]})

JavaScript
​x
 
df = pd.DataFrame({'id':['a', 'b', 'c', 'd','e'],
                   'bin':[1, 2, 3, 3, 2],
                   'perc':[0.1,0.9,0.3,0.7,0.5]})
​
df2 = pd.DataFrame({'bin':[1, 1, 1, 2, 2, 2, 3, 3, 3], 
                    'result':['low', 'medium','high','low', 'medium','high','low', 'medium','high'],
                    'cut_min':[0,0.2,0.6,0,0.3,0.7,0,0.4,0.8],
                    'cut_max':[0.2,0.6,1,0.3,0.7,1,0.4,0.8,1]})
​

df:

bin id  perc
1   a   0.1
2   b   0.9
3   c   0.3
3   d   0.7
2   e   0.5

JavaScript
 
bin id  perc
1   a   0.1
2   b   0.9
3   c   0.3
3   d   0.7
2   e   0.5
​

And this is the table with the bins, df2:

bin cut_max cut_min result
1   0.2     0.0     low
1   0.6     0.2     medium
1   1.0     0.6     high
2   0.3     0.0     low
2   0.7     0.3     medium
2   1.0     0.7     high
3   0.4     0.0     low
3   0.8     0.4     medium
3   1.0     0.8     high

JavaScript
 
bin cut_max cut_min result
 0.2     0.0     low
 0.6     0.2     medium
 1.0     0.6     high
 0.3     0.0     low
 0.7     0.3     medium
 1.0     0.7     high
 0.4     0.0     low
 0.8     0.4     medium
 1.0     0.8     high
​

I would like to match the bin, and find the appropriate result in df2 using the cut_min and cut_max that encompasses the perc value in df. So, I would like the resulting table to look like this:

bin id  perc    result
1   a   0.1     low
2   b   0.9     high
3   c   0.3     low
3   d   0.7     medium
2   e   0.5     medium

JavaScript
 
bin id  perc    result
1   a   0.1     low
2   b   0.9     high
3   c   0.3     low
3   d   0.7     medium
2   e   0.5     medium
​

I originally wrote this in a SQL query which accomplished the task quite simply with a join:

select
  df.id
  , df.bin
  , df.perc
  , df2.result
from df
inner join df2
  on df.bin = df2.bin
  and df.perc >= df2.cut_min 
  and df.perc < df2.cut_max

JavaScript
 
select
  df.id
  , df.bin
  , df.perc
  , df2.result
from df
inner join df2
  on df.bin = df2.bin
  and df.perc >= df2.cut_min 
  and df.perc < df2.cut_max
​

If anyone knows a good way to do this using Pandas, it would be greatly appreciated! (And this is actually the first time I haven’t been able to find a solution just searching on stackoverflow, so my apologies if any of the above wasn’t explained well enough!)

Answer

First merge df and df2 on the bin column, and then select the rows where cut_min <= perc < cut_max:

In [95]: result = pd.merge(df, df2, on='bin').query('cut_min <= perc < cut_max'); result
Out[95]: 
    bin id  perc  cut_max  cut_min  result
0     1  a   0.1      0.2      0.0     low
5     2  b   0.9      1.0      0.7    high
7     2  e   0.5      0.7      0.3  medium
9     3  c   0.3      0.4      0.0     low
13    3  d   0.7      0.8      0.4  medium

In [97]: result = result[['bin', 'id', 'perc', 'result']]

In [98]: result.sort('id')
Out[98]: 
    bin id  perc  result
0     1  a   0.1     low
5     2  b   0.9    high
9     3  c   0.3     low
13    3  d   0.7  medium
7     2  e   0.5  medium

JavaScript
 
In [95]: result = pd.merge(df, df2, on='bin').query('cut_min <= perc < cut_max'); result
Out[95]: 
    bin id  perc  cut_max  cut_min  result
0     1  a   0.1      0.2      0.0     low
5     2  b   0.9      1.0      0.7    high
7     2  e   0.5      0.7      0.3  medium
9     3  c   0.3      0.4      0.0     low
13    3  d   0.7      0.8      0.4  medium
​
In [97]: result = result[['bin', 'id', 'perc', 'result']]
​
In [98]: result.sort('id')
Out[98]: 
    bin id  perc  result
0     1  a   0.1     low
5     2  b   0.9    high
9     3  c   0.3     low
13    3  d   0.7  medium
7     2  e   0.5  medium
​

Advertisement

Answer