Reading a statistic book, I’m also training with Python.
My book asks me to calculate the cumulative workforce and the cumulative frequency of a simple list of jobs.
Secteur | Nombre d’emplois |
---|---|
Agriculture | 21143585 |
Construction | 35197834 |
Industrie | 69941779 |
Fabrication | 64298386 |
Services | 368931820 |
I wrote this Python program:
import numpy as np import scipy.stats if __name__ == '__main__': emploi_par_activité = [21143585, 35197834, 69941779, 64298386, 368931820] print("effectif cumulé : ", np.cumsum(emploi_par_activité)) print("fréquences cumulées", scipy.stats.cumfreq(emploi_par_activité))
that responds me:
effectif cumulé : [ 21143585 56341419 126283198 190581584 559513404] fréquences cumulées CumfreqResult(cumcount=array([2., 4., 4., 4., 4., 4., 4., 4., 4., 5.]), lowerlimit=1822016.388888888, binsize=38643137.222222224, extrapoints=0)
And if my book agrees for the cumulative workforce, it doesn’t for the cumulative frequency. that should be:
0.03778924 0.10069717 0.22570183 0.34062023 1
meaning that I’ve been tricked by the name of the scipy.stats
: cumfreq
that looks having the name of the one doing what I would like, but doesn’t.
What is the proper method I should use instead?
Advertisement
Answer
cumfreq
is for “raw” data; that is, data that has not been counted already or aggregated by some category. If you had a big data base with length 559513404, where each record corresponds to a distinct person, and a field in that record is a number that categorizes their job, with 0=Agriculture, 1=Construction, etc., then you could apply cumfreq
to the data in that field. (But for data like that–very small integers–the function numpy.bincount
is more appropriate, and much more efficient.)
Your data is already aggregated by job type. To get the result that you expected, compute the cumulative sum, and then divide each element in the cumulative sum by the total (which happens to be the last element of the cumulative sum):
In [215]: emploi_par_activité = [21143585, 35197834, 69941779, 64298386, 368931820] In [216]: csum = np.cumsum(emploi_par_activité) In [217]: csum Out[217]: array([ 21143585, 56341419, 126283198, 190581584, 559513404]) In [218]: csum/csum[-1] # fréquences cumulées Out[218]: array([0.03778924, 0.10069717, 0.22570183, 0.34062023, 1. ])