Dask Distributed: Reducing Multiple Dimensions into a Distance Matrix

Question

I want to calculate a large distance matrix, based on a higher dimensional vector. For instance, I have 1000 instances each represented by 20 vectors of length 10. The distance between each two instances is given by the mean distance between each of the 20 vectors associated to each vector. So I want to go fr…

Accepted Answer

I&#8217;m a big fan of dask, but this problem is way too small to need it. The runtime issue you&#8217;re seeing is because you are looping through each element in python rather than using vectorized operations in numpy.As with many packages in python, numpy relies on highly efficient compiled code written in other, faster languages such as C to carry out array operations. When you do something like an array operation A + B numpy calls these fast routines once, and the array operations are carried out within a highly optimized C routine. There is overhead involved with making calls to other languages, but this is overwhelmed by the performance gain due to the single call to a very fast routine. If instead you loop over every element, adding cell-wise, you have a (slow) python process, and on each element, this calls the C code, which adds overhead for each element of the array. Because of this, you actually would be better off not using numpy if you&#8217;re going to do this once for each element.To implement this in a vectorized manner, you can exploit numpy&#8217;s broadcasting rules to ensure the first dimensions of your two arrays expand to a new dimension. I don&#8217;t totally understand what&#8217;s going on in your distance function, but you could extend this simple version to do whatever you want:In [1]: import numpy as npIn [2]: A = np.random.random((1000, 20))   ...: B = np.random.random((1000, 20))In [3]: distance = np.abs(A[:, np.newaxis, :] - B[np.newaxis, :, :]).sum(axis=-1)In [4]: distanceOut[4]:array([[7.22985776, 7.76185666, 5.61824886, ..., 7.62092039, 6.35189562,        7.06365986],       [5.73359499, 5.8422105 , 7.2644021 , ..., 5.72230353, 6.79390303,        5.03074007],       [7.27871151, 8.6856818 , 5.97489449, ..., 8.86620029, 7.49875638,        6.57389575],       ...,       [7.67783107, 7.24419076, 4.17941596, ..., 8.68674754, 6.65078093,        5.67279811],       [7.1550136 , 6.10590227, 5.75417987, ..., 7.05953998, 5.8306628 ,        6.55112672],       [5.81748615, 6.79246838, 6.95053088, ..., 7.63994705, 6.77720511,        7.5663236 ]])In [5]: distance.shapeOut[5]: (1000, 1000)The performance difference can be seen clearly against a looped implementation:In [6]: %%timeit   ...: np.abs(A[:, np.newaxis, :] - B[np.newaxis, :, :]).sum(axis=-1)   ...:   ...:45 ms ± 326 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)In [7]: %%timeit   ...: distances = np.empty((1000, 1000))   ...: for i in range(1000):   ...:     for j in range(1000):   ...:         distances[i, j] = np.abs(A[i, :] - B[j, :]).sum()   ...:2.42 s ± 7.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)The looped version takes more than 50x as long!

Advertisement

Answer