Skip to content
Advertisement

2D Vectorization of unique values per row with condition

Consider the array and function definition shown:

import numpy as np

a = np.array([[2, 2, 5, 6, 2, 5],
              [1, 5, 8, 9, 9, 1],
              [0, 4, 2, 3, 7, 9],
              [1, 4, 1, 1, 5, 1],
              [6, 5, 4, 3, 2, 1],
              [3, 6, 3, 6, 3, 6],
              [0, 2, 7, 6, 3, 4],
              [3, 3, 7, 7, 3, 3]])

def grpCountSize(arr, grpCount, grpSize):    
    count = [np.unique(row, return_counts=True)  for row in arr]
    valid = [np.any(np.count_nonzero(row[1] == grpSize) == grpCount) for row in count]
    return valid

The point of the function is to return the rows of array a that have exactly grpCount groups of elements that each hold exactly grpSize identical elements.

For example:

# which rows have exactly 1 group that holds exactly 2 identical elements?
out = a[grpCountSize(a, 1, 2)]

As expected, the code outputs out = [[2, 2, 5, 6, 2, 5], [3, 3, 7, 7, 3, 3]]. The 1st output row has exactly 1 group of 2 (ie: 5,5), while the 2nd output row also has exactly 1 group of 2 (ie: 7,7).

Similarly:

# which rows have exactly 2 groups that each hold exactly 3 identical elements?
out = a[grpCountSize(a, 2, 3)]

This produces out = [[3, 6, 3, 6, 3, 6]], because only this row has exactly 2 groups each holding exactly 3 elements (ie: 3,3,3 and 6,6,6)

PROBLEM: My actual arrays have just 6 columns, but they can have many millions of rows. The code works perfectly as intended, but it is VERY SLOW for long arrays. Is there a way to speed this up?

Advertisement

Answer

np.unique sorts the array which makes it less efficient for your purpose. Use np.bincount and that way you most likely will save some time(depending on your array shape and values in the array). You also will not need np.any anymore:

def grpCountSize(arr, grpCount, grpSize):    
    count = [np.bincount(row) for row in arr]
    valid = [np.count_nonzero(row == grpSize) == grpCount for row in count]
    return valid

Another way that might even save more time is using same number of bins for all rows and create one array:

def grpCountSize(arr, grpCount, grpSize):
    m = arr.max()
    count = np.stack([np.bincount(row, minlength=m+1) for row in arr])
    return (count == grpSize).sum(1)==grpCount

Another yet upgrade is to use vectorized 2D bin count from this post. For example (note that Numba solutions tested in the post above is faster. I just provided the numpy solution for example. You can replace the function with any of the suggested ones in the post linked above):

def grpCountSize(arr, grpCount, grpSize):
    count = bincount2D_vectorized(arr)
    return (count == grpSize).sum(1)==grpCount
#from the post above
def bincount2D_vectorized(a):    
    N = a.max()+1
    a_offs = a + np.arange(a.shape[0])[:,None]*N
    return np.bincount(a_offs.ravel(), minlength=a.shape[0]*N).reshape(-1,N)

output of all solutions above:

a[grpCountSize2(a, 1, 2)]
#array([[2, 2, 5, 6, 2, 5],
#       [3, 3, 7, 7, 3, 3]])
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement