I have the following values that describe a dataset:
Number of Samples: 5388 Mean: 4173 Median: 4072 1st Decile: 2720 9th Decile: 5676
I need to generate any datasets that will fit these values. All the examples I found require you to have the standard deviation which I don’t. How this can be done? Thanks!
Advertisement
Answer
Interesting question! Based on Scott’s suggestions I gave it a quick try.
Inputs:
import random import pandas as pd import numpy as np # fixing the random seed random.seed(a=1, version=2) # formating floats pd.options.display.float_format = '{:.1f}'.format # given inputs count = 5388 mean = 4173 median = 4072 lower_percentile = 10 lower_percentile_value = 2720 upper_percentile = 90 upper_percentile_value = 5676 max_value = 6325 min_value = 2101
The Function:
def generate_dataset(count, mean, median, lower_percentile, upper_percentile, lower_percentile_value, upper_percentile_value, min_value, max_value ): # Calculate the number of values that fall within each percentile p_1_size = int(float(lower_percentile) * float(count) / 100) p_4_size = int(count - (float(upper_percentile) * float(count) / 100)) p_2_size = int((count / 2) - p_1_size) p_3_size = int((count / 2) - p_4_size) # can be used to adjust the mean mean_adjuster = 5790 # randomly pick values of right size from a range p_1 = random.choices(range(min_value, lower_percentile_value), k=p_1_size) p_2 = random.choices(range(lower_percentile_value, median), k=p_2_size) p_3 = random.choices(range(median, mean_adjuster), k=p_3_size) p_4 = random.choices(range(upper_percentile_value, max_value), k=p_4_size) return p_1 + p_2 + p_3 + p_4 dataset = generate_dataset( count, mean, median, lower_percentile, upper_percentile, lower_percentile_value, upper_percentile_value, min_value, max_value )
Comparaison:
# converting into DataFrame df = pd.DataFrame({"x": dataset}) new_count = len(df) new_mean = np.mean(df.x) new_median = np.quantile(df.x, 0.5) new_lower_percentile = np.quantile(df.x, lower_percentile/100) new_upper_percentile = np.quantile(df.x, upper_percentile/100) compare = pd.DataFrame( { "value": ["count", "mean", "median", "low_p", "high_p"], "original": [count, mean, median, lower_percentile_value, upper_percentile_value], "new":[new_count, new_mean, new_median, new_lower_percentile, new_upper_percentile] } ) print(compare)
Output:
value original new 0 count 5388 5388.0 1 mean 4173 4173.4 2 median 4072 4072.5 3 low_p 2720 2720.4 4 high_p 5676 5743.0
Getting the values to be exactly equal is a bit tricky when all your values are integers and not floats..
You can add another variable to control the mean with two numbers or change the random seed and see if you can get a closer values. Alternatively, you can write a function that changes the seed until the values are equal. (might take couple of minutes or couple of centuries:)
Cheers!