Computing the correlation coefficient between two multi-dimensional arrays

Question

I have two arrays that have the shapes N X T and M X T. I&#8217;d like to compute the correlation coefficient across T between every possible pair of rows n and m (from N and M, respectively). What&#8217;s the fastest, most pythonic way to do this? (Looping over N and M would seem to me to be neither fast

Accepted Answer

Correlation (default &#8216;valid&#8217; case) between two 2D arrays:You can simply use matrix-multiplication np.dot like so &#8211;out = np.dot(arr_one,arr_two.T)Correlation with the default "valid" case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.Row-wise Correlation Coefficient calculation for two 2D arrays:def corr2_coeff(A, B):    # Rowwise mean of input arrays & subtract from input arrays themeselves    A_mA = A - A.mean(1)[:, None]    B_mB = B - B.mean(1)[:, None]    # Sum of squares across rows    ssA = (A_mA**2).sum(1)    ssB = (B_mB**2).sum(1)    # Finally get corr coeff    return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLABBenchmarkingThis section compares runtime performance with the proposed approach against generate_correlation_map & loopy pearsonr based approach listed in the other answer.(taken from the function test_generate_correlation_map() without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.Case #1:In [106]: A = np.random.rand(1000, 100)In [107]: B = np.random.rand(1000, 100)In [108]: %timeit corr2_coeff(A, B)100 loops, best of 3: 15 ms per loopIn [109]: %timeit generate_correlation_map(A, B)100 loops, best of 3: 19.6 ms per loopCase #2:In [110]: A = np.random.rand(5000, 100)In [111]: B = np.random.rand(5000, 100)In [112]: %timeit corr2_coeff(A, B)1 loops, best of 3: 368 ms per loopIn [113]: %timeit generate_correlation_map(A, B)1 loops, best of 3: 493 ms per loopCase #3:In [114]: A = np.random.rand(10000, 10)In [115]: B = np.random.rand(10000, 10)In [116]: %timeit corr2_coeff(A, B)1 loops, best of 3: 1.29 s per loopIn [117]: %timeit generate_correlation_map(A, B)1 loops, best of 3: 1.83 s per loopThe other loopy pearsonr based approach seemed too slow, but here are the runtimes for one small datasize &#8211;In [118]: A = np.random.rand(1000, 100)In [119]: B = np.random.rand(1000, 100)In [120]: %timeit corr2_coeff(A, B)100 loops, best of 3: 15.3 ms per loopIn [121]: %timeit generate_correlation_map(A, B)100 loops, best of 3: 19.7 ms per loopIn [122]: %timeit pearsonr_based(A, B)1 loops, best of 3: 33 s per loop

Advertisement

Answer