Plotting CDF of a pandas series in python

Question

Is there a way to do this? I cannot seem an easy way to interface pandas series with plotting a CDF. Answer In case you are also interested in the values, not just the plot. This will always work (discrete and continuous distributions) Alternative example with a sample drawn from a continuous distribution or you have a lot of individual

Accepted Answer

In case you are also interested in the values, not just the plot.import pandas as pd# If you are in jupyter%matplotlib inlineThis will always work (discrete and continuous distributions)# Define your seriess = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')df = pd.DataFrame(s)# Get the frequency, PDF and CDF for each value in the series# Frequencystats_df = df .groupby('value') ['value'] .agg('count') .pipe(pd.DataFrame) .rename(columns = {'value': 'frequency'})# PDFstats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])# CDFstats_df['cdf'] = stats_df['pdf'].cumsum()stats_df = stats_df.reset_index()stats_df# Plot the discrete Probability Mass Function and CDF.# Technically, the 'pdf label in the legend and the table the should be 'pmf'# (Probability Mass Function) since the distribution is discrete.# If you don't have too many values / usually discrete casestats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)Alternative example with a sample drawn from a continuous distribution or you have a lot of individual values:# Define your seriess = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')# ... all the same calculation stuff to get the frequency, PDF, CDF# Plotstats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)For continuous distributions onlyPlease note if it is very reasonable to make the assumption that there is only one occurence of each value in the sample (typically encountered in the case of continuous distributions) then the groupby() + agg('count') is not necessary (since the count is always 1).In this case, a percent rank can be used to get to the cdf directly.Use your best judgment when taking this kind of shortcut! :)# Define your seriess = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')df = pd.DataFrame(s)# Get to the CDF directlydf['cdf'] = df.rank(method = 'average', pct = True)# Sort and plotdf.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)

Advertisement

Answer

This will always work (discrete and continuous distributions)

For continuous distributions only