Calculating a for loop with different indexes simultaneosuly

Question

I have the following for function: This for loop takes a long time to calculate the values for the data frame as it has to loop 50 times for each row (it takes approximately 62 seconds) I tried to use multiprocessor pool from this question. My code looks like this now: I run the function asynchronously with different values for

Accepted Answer

Solution Using Numpy vectorizationIssueLine if(index-i > 0): should be if(index-i >= 0): otherwise we miss the difference of 1Use &#8216;Close&#8217; rather than &#8216;Trade Close&#8217; (doesn&#8217;t matter for performance but avoid renaming column after pulling data from web)Codeimport numpy as npimport pandas as pddef compute_using_np(df, start_index, end_index):    '''        Using numpy to vectorize computation    '''    nrows = len(df)                             ncols = end_index - start_index    # container for pairwise differences    pair_wise_diff = np.empty((nrows, ncols))  #np.zeros((nrows, ncols), dtype = float)    pair_wise_diff.fill(np.nan)    # Get values of Trading close column as numpy 1D array    values = df['Close'].values    # Compute differences for different offsets    for offset in range(startIndex, endIndex):        # Using numpy to compute vectorized difference (i.e. faster computation)        diff = np.abs(values[offset:] - values[:-offset])/2.0                                      # Update result        pair_wise_diff[offset:, offset-startIndex] = diff                                  # Place into DataFrame    columns = ["EMA%d"%i for i in range(start_index, end_index)]                                  df_result = pd.DataFrame(data = pair_wise_diff, index = np.arange(nrows), columns = columns)                # Add result to df merging on index    return df.join(df_result)Usagedf_result = compute_using_np(df, 1, 51)PerformanceSummaryPosted Code: 37.9 s ± 143 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)Numpy Code:  1.56 ms ± 27.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)Result: 20K times speed upTest Codeimport pandas_datareader as drimport pandas as pdimport numpy as npdef calculateEMAs(df, start_index, end_index):    '''       Posted code changed 1) use Python PEP 8 naming convention,                            2) corrected conditional    '''    for index,row in df.iterrows():        for i in range (start_index, end_index):            if(index-i >= 0):              df.loc[index,"EMA%d"%i] = abs(df.iloc[index-i]["Close"] - df.iloc[index]["Close"])/2 #replace this with EMA formula    return dfdef compute_using_np(df, start_index, end_index):    '''        Using numpy to vectorie computation    '''    nrows = len(df)                                 ncols = end_index - start_index    # container for pairwise differences    pair_wise_diff = np.empty((nrows, ncols))  #np.zeros((nrows, ncols), dtype = float)    pair_wise_diff.fill(np.nan)    # Get values of Trading close column as numpy 1D array    values = df['Close'].values    # Compute differences for different offsets    for offset in range(start_index, end_index):        # Using numpy to compute vectorized difference (i.e. faster computation)        diff = np.abs(values[offset:] - values[:-offset])/2.0                                      # Update result        pair_wise_diff[offset:, offset-start_index] = diff                                  # Place into DataFrame    columns = ["EMA%d"%i for i in range(start_index, end_index)]                                  df_result = pd.DataFrame(data = pair_wise_diff, index = np.arange(nrows), columns = columns)                # Add result to df merging on index    return df.join(df_result)# Get ibm closing stock pricing (777 DataFrame rows)df = dr.data.get_data_yahoo('ibm', start = '2017-09-01', end = '2020-10-02')df.reset_index(level=0, inplace = True)   # create index which is 0, 1, 2, ...# Time Original postdf1 = df.copy()                    # Copy data since operation is inplace%timeit calculateEMAs(df1, 1, 51)  # Jupyter Notebook Magic method# Time Numpy Version%timeit compute_using_np(df, 1, 51)  # Jupyter Notebook Magic method                                      # No need to copy since operation is not inplace

Advertisement

Answer