Skip to content
Advertisement

How to Generate a dataset based on mean, median, 1st & 9th decile values?

I have the following values that describe a dataset:

Number of Samples: 5388
Mean: 4173
Median: 4072
1st Decile: 2720
9th Decile: 5676

I need to generate any datasets that will fit these values. All the examples I found require you to have the standard deviation which I don’t. How this can be done? Thanks!

Advertisement

Answer

Interesting question! Based on Scott’s suggestions I gave it a quick try.

Inputs:

import random
import pandas as pd
import numpy as np

# fixing the random seed
random.seed(a=1, version=2)
# formating floats
pd.options.display.float_format = '{:.1f}'.format

# given inputs
count = 5388
mean = 4173
median = 4072

lower_percentile = 10
lower_percentile_value = 2720

upper_percentile = 90
upper_percentile_value = 5676

max_value = 6325
min_value = 2101

The Function:

def generate_dataset(count, mean, median, lower_percentile, upper_percentile,
    lower_percentile_value, upper_percentile_value,
    min_value, max_value
    ):
        
    # Calculate the number of values that fall within each percentile
    p_1_size = int(float(lower_percentile) * float(count) / 100)
    p_4_size = int(count - (float(upper_percentile) * float(count) / 100))
    p_2_size = int((count / 2) - p_1_size)
    p_3_size = int((count / 2) - p_4_size)
    
    # can be used to adjust the mean
    mean_adjuster = 5790

    # randomly pick values of right size from a range 
    p_1 = random.choices(range(min_value, lower_percentile_value), k=p_1_size)
    p_2 = random.choices(range(lower_percentile_value, median), k=p_2_size)
    p_3 = random.choices(range(median, mean_adjuster), k=p_3_size)
    p_4 = random.choices(range(upper_percentile_value, max_value), k=p_4_size)
    
    return p_1 + p_2 + p_3 + p_4
    
dataset = generate_dataset(
    count, mean, median, lower_percentile, upper_percentile,
    lower_percentile_value, upper_percentile_value, min_value, max_value
    )

Comparaison:

# converting into DataFrame
df = pd.DataFrame({"x": dataset})

new_count = len(df)
new_mean = np.mean(df.x)
new_median = np.quantile(df.x, 0.5)
new_lower_percentile = np.quantile(df.x, lower_percentile/100)
new_upper_percentile = np.quantile(df.x, upper_percentile/100)

compare = pd.DataFrame(
    {
        "value": ["count", "mean", "median", "low_p", "high_p"],
        "original": [count, mean, median, lower_percentile_value, upper_percentile_value],
        "new":[new_count, new_mean, new_median, new_lower_percentile, new_upper_percentile]
    }
)

print(compare)

Output:

   value  original    new
0   count      5388 5388.0
1    mean      4173 4173.4
2  median      4072 4072.5
3   low_p      2720 2720.4
4  high_p      5676 5743.0

Getting the values to be exactly equal is a bit tricky when all your values are integers and not floats..

You can add another variable to control the mean with two numbers or change the random seed and see if you can get a closer values. Alternatively, you can write a function that changes the seed until the values are equal. (might take couple of minutes or couple of centuries:)

Cheers!

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement