The std
pandas function below calculates the standard deviation of every nth value defined by number
. So it would take the the values of PC_list
with indexes [0,1,2,3,4,5]
and calculate the standard deviation and then the indexes [1,2,3,4,5]
and calculate the standard deviation until the end of PC_list
. I am trying to optimize the code by trying to make it run faster even though it is very fast as of now I want to see as to how much more I could increase the performance by. Is there a way I could maybe increase the performance by using np.split
to divide it into chunks or some other method that would decrease the runtime. The original PC_list
has over 2.6 million arrays and it takes the std
function about 150 ms
to run, The current PC_list
array is a portion of it.
import pandas as pd import numpy as np PC_list = np .array([417.88 417.88 418.24 417.88 418.6 418.6 418.6 418.6 418.6 418.75 418.75 418.75 418.75 418.56 418.56 419.19 418.95 419.19 419.38 419.38 419.43 418.75 418.57 419.31 419.51 416.08 416. 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74 416.74]) number = 5 std= pd.Series(PC_list).rolling(number).std().dropna().to_numpy()
Advertisement
Answer
numpy
is apandas
dependency, which is whypandas
vectorized functions are so fast, but for little more speed, use@numba.njit
as a function decorator.- Use numba to call the
.std()
- Numba Performance Tips
- Numba translates Python functions to optimized machine code at runtime
.njit
- pandas User Guide: Enhancing Performance
- As shown in the guide, numba requires numpy arrays from pandas.
- Over the entire sample size range, using
.std()
with@numba.njit
is2.2x
faster than.std()
alone.
import numba import numpy import pandas as pd from collections import defaultdict @numba.njit def test(d): return d.std() data = defaultdict(list) for x in range(100, 596061): # number of unique elements # create array random.seed(365) a = np.random.rand(x, 1) * 1000 # timeit for std with numba res1 = %timeit -r2 -n1 -q -o test(a) # timeit for std without numba res2 = %timeit -r2 -n1 -q -o a.std() data['std_numba'].append(res1.average) data['std'].append(res2.average) data['idx'].append(x) # create a dataframe from data df = pd.DataFrame(data).iloc[1:, :] # set the index df.set_index('idx', inplace=True) # calculate the rolling mean to smooth out the plot df = df.rolling(1000).mean() # calculate the difference df['diff'] = df['std'] - df['std_numba'] # plot ax = df.plot( xlabel='number of rows', ylabel='time (s)', figsize=(8, 6)) ax.grid()