Skip to content
Advertisement

FAST: 1D overlaps with rows in 2D?

let say i have 2D array, f.e.:

In [136]: ary          
array([[6, 7, 9],
       [0, 2, 5],
       [3, 3, 4],
       [2, 2, 8],
       [3, 4, 9],
       [0, 5, 7],
       [2, 4, 9],
       [3, 5, 7],
       [7, 8, 8],
       [0, 2, 3]])

I want to calculate overlap with 1D vector, FAST.

I can almost do it with (8ms on big array):

 (ary == match)     # .sum(axis=1).argsort()[::-1]

The problem with it is that it only matches if both Position and Value match.

match == [6,5,4]                                                                                                                                                      
 
array([[ True, False, False],
       [False, False, False],
       [False, False,  True],
       [False, False, False],
       [False, False, False],
       [False,  True, False],
       [False, False, False],
       [False,  True, False],
       [False, False, False],
       [False, False, False]])

F.e. 5 in 2nd column of 1d vec did not match with 5 in 3rd column on the 2nd row.

It works with .isin()

  np.isin(ary,match,assume_unique=True).sum(axis=1).argsort()[::-1][:5]

but it is slow on big arrays (200000,10) ~20ms

Help me extend the first case so that it can match the Value in any position of 1D vector with the row.


the expected result is row-indexes ordered by OVERLAP COUNT, lets use [2,4,5] because it has more matches:

In [147]: np.isin(ary,[2,5,4],assume_unique=True)                                                                                                                             
Out[147]: 
array([[False, False, False],
       [False,  True,  True],
       [False, False,  True],
       [ True,  True, False],
       [False,  True, False],
       [False,  True, False],
       [ True,  True, False],
       [False,  True, False],
       [False, False, False],
       [False,  True, False]])

Overlap :

In [149]: np.isin(ary,[2,5,4],assume_unique=True).sum(axis=1)                                                                                                                 
Out[149]: array([0, 2, 1, 2, 1, 1, 2, 1, 0, 1])

Order by overlap :

In [148]: np.isin(ary,[2,5,4],assume_unique=True).sum(axis=1).argsort()[::-1]                                                                                                 
Out[148]: array([6, 3, 1, 9, 7, 5, 4, 2, 8, 0])

See rows : 6,3,1 have Overlap:2 that why they are first


Variants:

#could be from 1000,10,10 to 2000,100,20 .. ++ 
def data(cells=2000,seg=100,items=10): 
    ary = np.random.randint(0,cells,(cells*seg,items))
    rnd = np.random.randint(0,cells*seg)
    return ary, ary[rnd]

def best2(match,ary): #~20ms, (200000,10)
    return np.isin(ary,match,assume_unique=True).sum(axis=1).argsort()[::-1][:5]                                                                                                     

def best3(match,ary): #Corralien  ~20ms
    return np.logical_or.reduce(np.ravel(ary) == match[:, None], axis=0).reshape(ary.shape).sum(axis=1).argsort()[::-1][:5]

Can this be sped if using numba+cuda OR cupy on GPU ?

Advertisement

Answer

The main problem of all approach so fast is that they create huge temporary array while finally only 5 items are important. Numba can be used to compute the arrays on the fly (with efficient JIT-compiled loops) avoiding some temporary array. Moreover, a full sort is not required as only the top 5 items need to be retrieved. A partition can be used instead. It is even possible to use a faster approach since only the 5 selected items matters and not the others. Here is the resulting code:

@nb.njit('int32[::1](int32[::1], int32[:,::1])')
def computeScore(match, ary):
    n, m = ary.shape
    assert m == match.shape[0]
    tmp = np.empty(n, dtype=np.int32)
    for i in range(n):
        s = 0
        # Count the number of matching items (with repetition)
        for j in range(m):
            # Find a match
            item = ary[i, j]
            found = False
            for k in range(m):
                found |= item == match[k]
            s += found
        tmp[i] = s
    return tmp

def best4(match, ary):
    n, m = ary.shape
    score = computeScore(match, ary)
    bestItems = np.argpartition(score, n-5)[n-5:] # sadly not supported by Numba yet
    order = np.argsort(-score[bestItems]) # bastItems is not sorted and likely needs to be
    return bestItems[order]

Note that best4 can provide results different to best2 when the matching score (stored in tmp) is equal between multiple items. This is due to the sorting algorithm which is not stable by default in Numpy (the kind parameter can be used to adapt this behavior). This is also true for the partition algorithm although Numpy does not seems to provide a stable partition algorithm yet.

This code should be faster than other implementation, but not by a large margin. One of the issue is that Numba (and most C/C++ compilers like the one used to compile Numpy) do not succeed to generate a fast code since it does not know the value m at compile time. As a result, the most aggressive optimizations (eg. unrolling loops and using of SIMD instructions) can hardly be applied. You can help Numba using assertions or escaping conditionals.

Moreover, the code can be parallelized using multiple threads to make it much faster on mainstream platforms. Note that the parallelized version may not faster on small data nor all platforms since creating threads introduces an overhead that could be bigger than the actual computation.

Here is the resulting implementation:

@nb.njit('int32[::1](int32[::1], int32[:,::1])', parallel=True)
def computeScoreOpt(match, ary):
    n, m = ary.shape
    assert m == match.shape[0]
    assert m == 10
    tmp = np.empty(n, dtype=np.int32)
    for i in nb.prange(n):
        # Thie enable Numba to assume m=10 in the following code
        # and generate a very efficient code for this specific case.
        # The assert should be enough but the internals of Numba 
        # prevent the information to be propagatted to this portion
        # of the code when it is parallelized.
        if m != 10: continue

        s = 0
        for j in range(m):
            item = ary[i, j]
            found = False
            for k in range(m):
                found |= item == match[k]
            s += found
        tmp[i] = s
    return tmp

def best5(match, ary):
    n, m = ary.shape
    score = computeScoreOpt(match, ary)
    bestItems = np.argpartition(score, n-5)[n-5:]
    order = np.argsort(-score[bestItems])
    return bestItems[order]

Here are the timings on my machine with the example dataset:

best2:                            18.2 ms
best3:                            17.8 ms
best4 (sequential -- default):    12.0 ms
best4 (parallel):                  3.1 ms
best5 (sequential):                3.2 ms
best5 (parallel -- default):       1.2 ms

The fastest implementation is 15 times faster than the original reference implementation.

Note that if m is greater than about 30, it should be better to use a more advanced set-based algorithm. An alternative solution is to sort match first and then use np.isin in the i-based loop in this case.

Advertisement