scipy.stats.cumfreq() isn’t the cumulative frequency I’m looking for

Question

Reading a statistic book, I&#8217;m also training with Python. My book asks me to calculate the cumulative workforce and the cumulative frequency of a simple list of jobs. Secteur Nombre d&#8217;emplois Agriculture 21143585 Construction 35197834 Industrie 69941779 Fabrication 64298386 Services 368931820 I wro…

Accepted Answer

cumfreq is for &#8220;raw&#8221; data; that is, data that has not been counted already or aggregated by some category.  If you had a big data base with length 559513404, where each record corresponds to a distinct person, and a field in that record is a number that categorizes their job, with 0=Agriculture, 1=Construction, etc., then you could apply cumfreq to the data in that field. (But for data like that&#8211;very small integers&#8211;the function numpy.bincount is more appropriate, and much more efficient.)Your data is already aggregated by job type.  To get the result that you expected, compute the cumulative sum, and then divide each element in the cumulative sum by the total (which happens to be the last element of the cumulative sum):In [215]: emploi_par_activité = [21143585, 35197834, 69941779, 64298386, 368931820]In [216]: csum = np.cumsum(emploi_par_activité)In [217]: csumOut[217]: array([ 21143585,  56341419, 126283198, 190581584, 559513404])In [218]: csum/csum[-1]  # fréquences cumuléesOut[218]: array([0.03778924, 0.10069717, 0.22570183, 0.34062023, 1.        ])

Secteur	Nombre d’emplois
Agriculture	21143585
Construction	35197834
Industrie	69941779
Fabrication	64298386
Services	368931820

Advertisement

Answer