I have a dataframe with over 2,000 records that has multiple columns with various balances. Based on the balance amount I want to assign it to a bucket.
Trying to split each balance column into a quantile and have the following buckets 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
Concretely, translating the balances into the following buckets: top 10%, top 20%, top 30%, etc…
If I’m understanding correctly, so long as there are more than 10 records it should bucket each record in a percentile based on linear interpolation.
So I run the following:
score_quantiles = df.quantile(q=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]) score_quantiles.to_dict() # Arguments (x = value, p = field (i.e bal 1, bal2, bal3) , d = score_quantiles) def dlpScore(x,p,d): if pd.isnull(x) == True: return 0 elif int(x) == 0: return 0 elif x <= d[p][0.1]: return 1 elif x <= d[p][0.2]: return 2 elif x <= d[p][0.3]: return 3 elif x <= d[p][0.4]: return 4 elif x <= d[p][0.5]: return 5 elif x <= d[p][0.6]: return 6 elif x <= d[p][0.7]: return 7 elif x <= d[p][0.8]: return 8 elif x <= d[p][0.9]: return 9 else: return 10 df['SCORE_BAL1'] = df['bal1'].apply(dlpScore, args=('bal1',score_quantiles,))
Problem is, on some columns it gives me all the buckets, on others it only gives me a few:
Is there a way to ensure it creates all the buckets? I’m probably missing something here.
Advertisement
Answer
If you want to make sure that you are getting similar distributions among ‘buckets’ you might want to try the pandas qcut
function. The full documentation is here.
To use it in your code and get deciles for instance you could just do
n_buckets=10 df['quantile'] = pd.qcut(df['target_column'], q=n_buckets)
And if you want to apply a specific label you can just do something like this
n_buckets=10 df['quantile'] = pd.qcut(df['target_column'], q=n_buckets, labels=range(1,n_buckets+1))
PS: Just beware that for this latter case if qcut
is not able to generate the desired number of quantiles (e.g., because there are not enough values to create a quantile), you will get an exception for passing more labels than there are quantiles.