Skip to content
Advertisement

Optimizing a standard deviation function Pandas Numpy Python

The std pandas function below calculates the standard deviation of every nth value defined by number. So it would take the the values of PC_list with indexes [0,1,2,3,4,5] and calculate the standard deviation and then the indexes [1,2,3,4,5] and calculate the standard deviation until the end of PC_list. I am trying to optimize the code by trying to make it run faster even though it is very fast as of now I want to see as to how much more I could increase the performance by. Is there a way I could maybe increase the performance by using np.split to divide it into chunks or some other method that would decrease the runtime. The original PC_list has over 2.6 million arrays and it takes the std function about 150 ms to run, The current PC_list array is a portion of it.

JavaScript

Advertisement

Answer

  • numpy is a pandas dependency, which is why pandas vectorized functions are so fast, but for little more speed, use @numba.njit as a function decorator.
  • Use numba to call the .std()
  • pandas User Guide: Enhancing Performance
    • As shown in the guide, numba requires numpy arrays from pandas.
  • Over the entire sample size range, using .std() with @numba.njit is 2.2x faster than .std() alone.
JavaScript

enter image description here

Advertisement