NumPy array row differences

Question

I have a NumPy array vectors = np.random.randn(rows, cols). I want to find differences between its rows according to some other array diffs which is sparse and "2-hot": containing a 1 in its column corresponding to the first row of vectors and a -1 corresponding to the second row. Perhaps an example shall make it clearer: then I can compute

Accepted Answer

I can compute the row differences by simply diffs @ vectors.This is very inefficient. A matrix multiplication runs in O(n*m*k) for a (n,m) multiplied by a (m,k) one. In your case, there is only two values per line and you do not actually need a multiplication by 1 or -1. Your problem can be computed in O(n*k) time (ie. m times faster).Unfortunately this is slow for diffs of 10_000x1000 and vectors 1000x15_000. I can get a speedup by using scipy.sparse.The thing is the input data representation is inefficient. When diff is an array of size (10_000,1000), this is not reasonable to use a dense matrix that would be ~1000 times bigger than needed nor a sparse matrix that is not optimized for having only two non-zero values (especially 1 and -1). You need to store the position of the non-zeros values in a 2D array called sel_rows of shape (2,n) where the first row contains the location of the 1 and the second one contains the location of the -1 in the diff 2D array. Then, you can extract the rows of vectors for example with vectors[sel_rows[0]]. You can perform the final operation with vectors[sel_rows[0,:]] - vectors[sel_rows[1,:]]. This approach should be drastically faster than a dense matrix product and it may be a bit faster than a sparse one regarding the target machine.While the above solution is simple, it create multiple temporary arrays that are not cache-friendly since your output array should take 10_000 * 15_000 * 8 = 1.1 GiB (which is quite huge). You can use Numba so to remove temporary array and so improve the performance. Multiple threads can be used to improve performance even further. Here is an untested code:import numba as nb@nb.njit('(int_[:,::1], float64[:,::1])', parallel=True)def compute(diffs, vectors):    n, k = diffs.shape[0], vectors.shape[1]    assert diffs.shape[1] == 2    res = np.empty((n, k))    for i in nb.prange(n):        a, b = diffs[i]        for j in range(k):            # Compute nb.abs here if needed so to avoid             # creating new temporary arrays            res[i, j] = vectors[a, j] - vectors[b, j]    return resThis above code should be nearly optimal. It should be memory bound and able to saturate the memory bandwidth. Note that writing such huge arrays in memory take some time as well as reading (twice) the input array. On x86-64 platforms, a basic implementation should move 4.4 GiB of data from/to the RAM. Thus, on a mainstream PC with a 20 GiB/s RAM, this takes 220 ms. In fact, the sparse matrix computation result was not so bad in practice for a sequential implementation.If this is not enough to you, then you can use simple-precision floating-point numbers instead of double-precision (twice faster). You could also use a low-level C/C++ implementation so to reduce the memory bandwidth used (thanks to non-temporal instructions &#8212; ~30% faster). There is no much more to do.

Advertisement

Answer