I have two arrays that have the shapes N X T
and M X T
. I’d like to compute the correlation coefficient across T
between every possible pair of rows n
and m
(from N
and M
, respectively).
What’s the fastest, most pythonic way to do this? (Looping over N
and M
would seem to me to be neither fast nor pythonic.) I’m expecting the answer to involve numpy
and/or scipy
. Right now my arrays are numpy
array
s, but I’m open to converting them to a different type.
I’m expecting my output to be an array with the shape N X M
.
N.B. When I say “correlation coefficient,” I mean the Pearson product-moment correlation coefficient.
Here are some things to note:
- The
numpy
functioncorrelate
requires input arrays to be one-dimensional. - The
numpy
functioncorrcoef
accepts two-dimensional arrays, but they must have the same shape. - The
scipy.stats
functionpearsonr
requires input arrays to be one-dimensional.
Advertisement
Answer
Correlation (default ‘valid’ case) between two 2D arrays:
You can simply use matrix-multiplication np.dot
like so –
out = np.dot(arr_one,arr_two.T)
Correlation with the default "valid"
case between each pairwise row combinations (row1,row2) of the two input arrays would correspond to multiplication result at each (row1,row2) position.
Row-wise Correlation Coefficient calculation for two 2D arrays:
def corr2_coeff(A, B): # Rowwise mean of input arrays & subtract from input arrays themeselves A_mA = A - A.mean(1)[:, None] B_mB = B - B.mean(1)[:, None] # Sum of squares across rows ssA = (A_mA**2).sum(1) ssB = (B_mB**2).sum(1) # Finally get corr coeff return np.dot(A_mA, B_mB.T) / np.sqrt(np.dot(ssA[:, None],ssB[None]))
This is based upon this solution to How to apply corr2 functions in Multidimentional arrays in MATLAB
Benchmarking
This section compares runtime performance with the proposed approach against generate_correlation_map
& loopy pearsonr
based approach listed in the other answer.(taken from the function test_generate_correlation_map()
without the value correctness verification code at the end of it). Please note the timings for the proposed approach also include a check at the start to check for equal number of columns in the two input arrays, as also done in that other answer. The runtimes are listed next.
Case #1:
In [106]: A = np.random.rand(1000, 100) In [107]: B = np.random.rand(1000, 100) In [108]: %timeit corr2_coeff(A, B) 100 loops, best of 3: 15 ms per loop In [109]: %timeit generate_correlation_map(A, B) 100 loops, best of 3: 19.6 ms per loop
Case #2:
In [110]: A = np.random.rand(5000, 100) In [111]: B = np.random.rand(5000, 100) In [112]: %timeit corr2_coeff(A, B) 1 loops, best of 3: 368 ms per loop In [113]: %timeit generate_correlation_map(A, B) 1 loops, best of 3: 493 ms per loop
Case #3:
In [114]: A = np.random.rand(10000, 10) In [115]: B = np.random.rand(10000, 10) In [116]: %timeit corr2_coeff(A, B) 1 loops, best of 3: 1.29 s per loop In [117]: %timeit generate_correlation_map(A, B) 1 loops, best of 3: 1.83 s per loop
The other loopy pearsonr based
approach seemed too slow, but here are the runtimes for one small datasize –
In [118]: A = np.random.rand(1000, 100) In [119]: B = np.random.rand(1000, 100) In [120]: %timeit corr2_coeff(A, B) 100 loops, best of 3: 15.3 ms per loop In [121]: %timeit generate_correlation_map(A, B) 100 loops, best of 3: 19.7 ms per loop In [122]: %timeit pearsonr_based(A, B) 1 loops, best of 3: 33 s per loop