I compared the performance of the `mean`

function of the `statistics`

module with the simple `sum(l)/len(l)`

method and found the `mean`

function to be very slow for some reason. I used `timeit`

with the two code snippets below to compare them, does anyone know what causes the massive difference in execution speed? I’m using Python 3.5.

from timeit import repeat print(min(repeat('mean(l)', '''from random import randint; from statistics import mean; l=[randint(0, 10000) for i in range(10000)]''', repeat=20, number=10)))

The code above executes in about 0.043 seconds on my machine.

from timeit import repeat print(min(repeat('sum(l)/len(l)', '''from random import randint; from statistics import mean; l=[randint(0, 10000) for i in range(10000)]''', repeat=20, number=10)))

The code above executes in about 0.000565 seconds on my machine.

Python’s `statistics`

module is not built for speed, but for precision

In the specs for this module, it appears that

The built-in sum can lose accuracy when dealing with floats of wildly differing magnitude. Consequently, the above naive mean fails this “torture test”

`assert mean([1e30, 1, 3, -1e30]) == 1`

returning 0 instead of 1, a purely computational error of 100%.

Using math.fsum inside mean will make it more accurate with float data, but it also has the side-effect of converting any arguments to float even when unnecessary. E.g. we should expect the mean of a list of Fractions to be a Fraction, not a float.

Conversely, if we take a look at the implementation of `_sum()`

in this module, the first lines of the method’s docstring seem to confirm that:

def _sum(data, start=0): """_sum(data [, start]) -> (type, sum, count) Return a high-precision sum of the given numeric data as a fraction, together with the type to be converted to and the count of items. [...] """

So yeah, `statistics`

implementation of `sum`

, instead of being a simple one-liner call to Python’s built-in `sum()`

function, takes about 20 lines by itself with a nested `for`

loop in its body.

This happens because `statistics._sum`

chooses to guarantee the maximum precision for all types of number it could encounter (even if they widely differ from one another), instead of simply emphasizing speed.

Hence, it appears normal that the built-in `sum`

proves a hundred times faster. The cost of it being a much lower precision in you happen to call it with exotic numbers.

**Other options**

If you need to prioritize speed in your algorithms, you should have a look at Numpy instead, the algorithms of which being implemented in C.

NumPy mean is not as precise as `statistics`

by a long shot but it implements (since 2013) a routine based on pairwise summation which is better than a naive `sum/len`

(more info in the link).

However…

import numpy as np import statistics np_mean = np.mean([1e30, 1, 3, -1e30]) statistics_mean = statistics.mean([1e30, 1, 3, -1e30]) print('NumPy mean: {}'.format(np_mean)) print('Statistics mean: {}'.format(statistics_mean)) > NumPy mean: 0.0 > Statistics mean: 1.0

## Recent Comments