FAST: 1D overlaps with rows in 2D?

Question

let say i have 2D array, f.e.: I want to calculate overlap with 1D vector, FAST. I can almost do it with (8ms on big array): The problem with it is that it only matches if both Position and Value match. F.e. 5 in 2nd column of 1d vec did not match with 5 in 3rd column on the 2nd

Accepted Answer

The main problem of all approach so fast is that they create huge temporary array while finally only 5 items are important. Numba can be used to compute the arrays on the fly (with efficient JIT-compiled loops) avoiding some temporary array. Moreover, a full sort is not required as only the top 5 items need to be retrieved. A partition can be used instead. It is even possible to use a faster approach since only the 5 selected items matters and not the others. Here is the resulting code:@nb.njit('int32[::1](int32[::1], int32[:,::1])')def computeScore(match, ary):    n, m = ary.shape    assert m == match.shape[0]    tmp = np.empty(n, dtype=np.int32)    for i in range(n):        s = 0        # Count the number of matching items (with repetition)        for j in range(m):            # Find a match            item = ary[i, j]            found = False            for k in range(m):                found |= item == match[k]            s += found        tmp[i] = s    return tmpdef best4(match, ary):    n, m = ary.shape    score = computeScore(match, ary)    bestItems = np.argpartition(score, n-5)[n-5:] # sadly not supported by Numba yet    order = np.argsort(-score[bestItems]) # bastItems is not sorted and likely needs to be    return bestItems[order]Note that best4 can provide results different to best2 when the matching score (stored in tmp) is equal between multiple items. This is due to the sorting algorithm which is not stable by default in Numpy (the kind parameter can be used to adapt this behavior). This is also true for the partition algorithm although Numpy does not seems to provide a stable partition algorithm yet.This code should be faster than other implementation, but not by a large margin. One of the issue is that Numba (and most C/C++ compilers like the one used to compile Numpy) do not succeed to generate a fast code since it does not know the value m at compile time. As a result, the most aggressive optimizations (eg. unrolling loops and using of SIMD instructions) can hardly be applied. You can help Numba using assertions or escaping conditionals.Moreover, the code can be parallelized using multiple threads to make it much faster on mainstream platforms. Note that the parallelized version may not faster on small data nor all platforms since creating threads introduces an overhead that could be bigger than the actual computation.Here is the resulting implementation:@nb.njit('int32[::1](int32[::1], int32[:,::1])', parallel=True)def computeScoreOpt(match, ary):    n, m = ary.shape    assert m == match.shape[0]    assert m == 10    tmp = np.empty(n, dtype=np.int32)    for i in nb.prange(n):        # Thie enable Numba to assume m=10 in the following code        # and generate a very efficient code for this specific case.        # The assert should be enough but the internals of Numba         # prevent the information to be propagatted to this portion        # of the code when it is parallelized.        if m != 10: continue        s = 0        for j in range(m):            item = ary[i, j]            found = False            for k in range(m):                found |= item == match[k]            s += found        tmp[i] = s    return tmpdef best5(match, ary):    n, m = ary.shape    score = computeScoreOpt(match, ary)    bestItems = np.argpartition(score, n-5)[n-5:]    order = np.argsort(-score[bestItems])    return bestItems[order]Here are the timings on my machine with the example dataset:best2:                            18.2 msbest3:                            17.8 msbest4 (sequential -- default):    12.0 msbest4 (parallel):                  3.1 msbest5 (sequential):                3.2 msbest5 (parallel -- default):       1.2 msThe fastest implementation is 15 times faster than the original reference implementation.Note that if m is greater than about 30, it should be better to use a more advanced set-based algorithm. An alternative solution is to sort match first and then use np.isin in the i-based loop in this case.

Advertisement

Answer