Optimizing a standard deviation function Pandas Numpy Python

Question

The std pandas function below calculates the standard deviation of every nth value defined by number. So it would take the the values of PC_list with indexes [0,1,2,3,4,5] and calculate the standard deviation and then the indexes [1,2,3,4,5] and calculate the standard deviation until the end of PC_list. I am trying to optimize the code by trying to make it

Accepted Answer

numpy is a pandas dependency, which is why pandas vectorized functions are so fast, but for little more speed, use @numba.njit as a function decorator.Use numba to call the .std()Numba Performance TipsNumba translates Python functions to optimized machine code at runtime.njitpandas User Guide: Enhancing PerformanceAs shown in the guide, numba requires numpy arrays from pandas.Over the entire sample size range, using .std() with @numba.njit is 2.2x faster than .std() alone.import numbaimport numpyimport pandas as pdfrom collections import defaultdict@numba.njitdef test(d):    return d.std()data = defaultdict(list)for x in range(100, 596061):  # number of unique elements        # create array    random.seed(365)    a = np.random.rand(x, 1) * 1000        # timeit for std with numba    res1 = %timeit -r2 -n1 -q -o test(a)        # timeit for std without numba    res2 = %timeit -r2 -n1 -q -o a.std()    data['std_numba'].append(res1.average)    data['std'].append(res2.average)    data['idx'].append(x)# create a dataframe from datadf = pd.DataFrame(data).iloc[1:, :]# set the indexdf.set_index('idx', inplace=True)# calculate the rolling mean to smooth out the plotdf = df.rolling(1000).mean()# calculate the differencedf['diff'] = df['std'] - df['std_numba']# plotax = df.plot( xlabel='number of rows', ylabel='time (s)', figsize=(8, 6))ax.grid()

Advertisement

Answer