Skip to content
Advertisement

Pandas quantile function not returning the correct number of given quantiles

I have a dataframe with over 2,000 records that has multiple columns with various balances. Based on the balance amount I want to assign it to a bucket.

Trying to split each balance column into a quantile and have the following buckets 0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9 Concretely, translating the balances into the following buckets: top 10%, top 20%, top 30%, etc…

If I’m understanding correctly, so long as there are more than 10 records it should bucket each record in a percentile based on linear interpolation.

So I run the following:

score_quantiles =  df.quantile(q=[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9])
score_quantiles.to_dict()



# Arguments (x = value, p = field (i.e bal 1, bal2, bal3) , d = score_quantiles)

def dlpScore(x,p,d):
    if pd.isnull(x) == True:
        return 0 
    elif int(x) == 0:
        return 0
    elif x <= d[p][0.1]:
        return 1
    elif x <= d[p][0.2]:
        return 2
    elif x <= d[p][0.3]: 
        return 3
    elif x <= d[p][0.4]: 
        return 4
    elif x <= d[p][0.5]: 
        return 5
    elif x <= d[p][0.6]: 
        return 6
    elif x <= d[p][0.7]: 
        return 7
    elif x <= d[p][0.8]: 
        return 8
    elif x <= d[p][0.9]: 
        return 9
    else:
        return 10



df['SCORE_BAL1'] = df['bal1'].apply(dlpScore, args=('bal1',score_quantiles,))




Problem is, on some columns it gives me all the buckets, on others it only gives me a few:

enter image description here enter image description here

Is there a way to ensure it creates all the buckets? I’m probably missing something here.

Advertisement

Answer

If you want to make sure that you are getting similar distributions among ‘buckets’ you might want to try the pandas qcut function. The full documentation is here.

To use it in your code and get deciles for instance you could just do

n_buckets=10
df['quantile'] = pd.qcut(df['target_column'], q=n_buckets)

And if you want to apply a specific label you can just do something like this

n_buckets=10
df['quantile'] = pd.qcut(df['target_column'], q=n_buckets, labels=range(1,n_buckets+1))

PS: Just beware that for this latter case if qcut is not able to generate the desired number of quantiles (e.g., because there are not enough values to create a quantile), you will get an exception for passing more labels than there are quantiles.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement