Skip to content
Advertisement

scipy.stats.cumfreq() isn’t the cumulative frequency I’m looking for

Reading a statistic book, I’m also training with Python.

My book asks me to calculate the cumulative workforce and the cumulative frequency of a simple list of jobs.

Secteur Nombre d’emplois
Agriculture 21143585
Construction 35197834
Industrie 69941779
Fabrication 64298386
Services 368931820

I wrote this Python program:

import numpy as np
import scipy.stats

if __name__ == '__main__':
    emploi_par_activité = [21143585, 35197834, 69941779, 64298386, 368931820]
    print("effectif cumulé : ", np.cumsum(emploi_par_activité))
    print("fréquences cumulées", scipy.stats.cumfreq(emploi_par_activité))

that responds me:

effectif cumulé :  [ 21143585  56341419 126283198 190581584 559513404]
fréquences cumulées CumfreqResult(cumcount=array([2., 4., 4., 4., 4., 4., 4., 4., 4., 5.]), lowerlimit=1822016.388888888, binsize=38643137.222222224, extrapoints=0)

And if my book agrees for the cumulative workforce, it doesn’t for the cumulative frequency. that should be:

0.03778924
0.10069717
0.22570183
0.34062023
1

meaning that I’ve been tricked by the name of the scipy.stats: cumfreq that looks having the name of the one doing what I would like, but doesn’t.

What is the proper method I should use instead?

Advertisement

Answer

cumfreq is for “raw” data; that is, data that has not been counted already or aggregated by some category. If you had a big data base with length 559513404, where each record corresponds to a distinct person, and a field in that record is a number that categorizes their job, with 0=Agriculture, 1=Construction, etc., then you could apply cumfreq to the data in that field. (But for data like that–very small integers–the function numpy.bincount is more appropriate, and much more efficient.)

Your data is already aggregated by job type. To get the result that you expected, compute the cumulative sum, and then divide each element in the cumulative sum by the total (which happens to be the last element of the cumulative sum):

In [215]: emploi_par_activité = [21143585, 35197834, 69941779, 64298386, 368931820]

In [216]: csum = np.cumsum(emploi_par_activité)

In [217]: csum
Out[217]: array([ 21143585,  56341419, 126283198, 190581584, 559513404])

In [218]: csum/csum[-1]  # fréquences cumulées
Out[218]: array([0.03778924, 0.10069717, 0.22570183, 0.34062023, 1.        ])
Advertisement