Is there a way to do this? I cannot seem an easy way to interface pandas series with plotting a CDF.
Advertisement
Answer
In case you are also interested in the values, not just the plot.
import pandas as pd # If you are in jupyter %matplotlib inline
This will always work (discrete and continuous distributions)
# Define your series s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value') df = pd.DataFrame(s)
# Get the frequency, PDF and CDF for each value in the series # Frequency stats_df = df .groupby('value') ['value'] .agg('count') .pipe(pd.DataFrame) .rename(columns = {'value': 'frequency'}) # PDF stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency']) # CDF stats_df['cdf'] = stats_df['pdf'].cumsum() stats_df = stats_df.reset_index() stats_df
# Plot the discrete Probability Mass Function and CDF. # Technically, the 'pdf label in the legend and the table the should be 'pmf' # (Probability Mass Function) since the distribution is discrete. # If you don't have too many values / usually discrete case stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)
Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values:
# Define your series s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
# ... all the same calculation stuff to get the frequency, PDF, CDF
# Plot stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)
For continuous distributions only
Please note if it is very reasonable to make the assumption that there is only one occurence of each value in the sample (typically encountered in the case of continuous distributions) then the groupby()
+ agg('count')
is not necessary (since the count is always 1).
In this case, a percent rank can be used to get to the cdf directly.
Use your best judgment when taking this kind of shortcut! :)
# Define your series s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value') df = pd.DataFrame(s)
# Get to the CDF directly df['cdf'] = df.rank(method = 'average', pct = True)
# Sort and plot df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)