Skip to content
Advertisement

Using Python Pandas to bin data in one df according to bins defined in a second df

I am attempting to bin data in one dataframe according to bins defined in a second dataframe. I am thinking that some combination of pd.bin and pd.merge might get me there?

This is basically the form each dataframe is currently in:

df = pd.DataFrame({'id':['a', 'b', 'c', 'd','e'],
                   'bin':[1, 2, 3, 3, 2],
                   'perc':[0.1,0.9,0.3,0.7,0.5]})

df2 = pd.DataFrame({'bin':[1, 1, 1, 2, 2, 2, 3, 3, 3], 
                    'result':['low', 'medium','high','low', 'medium','high','low', 'medium','high'],
                    'cut_min':[0,0.2,0.6,0,0.3,0.7,0,0.4,0.8],
                    'cut_max':[0.2,0.6,1,0.3,0.7,1,0.4,0.8,1]})

df:

bin id  perc
1   a   0.1
2   b   0.9
3   c   0.3
3   d   0.7
2   e   0.5

And this is the table with the bins, df2:

bin cut_max cut_min result
1   0.2     0.0     low
1   0.6     0.2     medium
1   1.0     0.6     high
2   0.3     0.0     low
2   0.7     0.3     medium
2   1.0     0.7     high
3   0.4     0.0     low
3   0.8     0.4     medium
3   1.0     0.8     high

I would like to match the bin, and find the appropriate result in df2 using the cut_min and cut_max that encompasses the perc value in df. So, I would like the resulting table to look like this:

bin id  perc    result
1   a   0.1     low
2   b   0.9     high
3   c   0.3     low
3   d   0.7     medium
2   e   0.5     medium

I originally wrote this in a SQL query which accomplished the task quite simply with a join:

select
  df.id
  , df.bin
  , df.perc
  , df2.result
from df
inner join df2
  on df.bin = df2.bin
  and df.perc >= df2.cut_min 
  and df.perc < df2.cut_max

If anyone knows a good way to do this using Pandas, it would be greatly appreciated! (And this is actually the first time I haven’t been able to find a solution just searching on stackoverflow, so my apologies if any of the above wasn’t explained well enough!)

Advertisement

Answer

First merge df and df2 on the bin column, and then select the rows where cut_min <= perc < cut_max:

In [95]: result = pd.merge(df, df2, on='bin').query('cut_min <= perc < cut_max'); result
Out[95]: 
    bin id  perc  cut_max  cut_min  result
0     1  a   0.1      0.2      0.0     low
5     2  b   0.9      1.0      0.7    high
7     2  e   0.5      0.7      0.3  medium
9     3  c   0.3      0.4      0.0     low
13    3  d   0.7      0.8      0.4  medium

In [97]: result = result[['bin', 'id', 'perc', 'result']]

In [98]: result.sort('id')
Out[98]: 
    bin id  perc  result
0     1  a   0.1     low
5     2  b   0.9    high
9     3  c   0.3     low
13    3  d   0.7  medium
7     2  e   0.5  medium
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement