Consider the array and function definition shown:
import numpy as np a = np.array([[2, 2, 5, 6, 2, 5], [1, 5, 8, 9, 9, 1], [0, 4, 2, 3, 7, 9], [1, 4, 1, 1, 5, 1], [6, 5, 4, 3, 2, 1], [3, 6, 3, 6, 3, 6], [0, 2, 7, 6, 3, 4], [3, 3, 7, 7, 3, 3]]) def grpCountSize(arr, grpCount, grpSize): count = [np.unique(row, return_counts=True) for row in arr] valid = [np.any(np.count_nonzero(row[1] == grpSize) == grpCount) for row in count] return valid
The point of the function is to return the rows of array a
that have exactly grpCount
groups of elements that each hold exactly grpSize
identical elements.
For example:
# which rows have exactly 1 group that holds exactly 2 identical elements? out = a[grpCountSize(a, 1, 2)]
As expected, the code outputs out = [[2, 2, 5, 6, 2, 5], [3, 3, 7, 7, 3, 3]]
.
The 1st output row has exactly 1 group of 2 (ie: 5,5), while the 2nd output row also has exactly 1 group of 2 (ie: 7,7).
Similarly:
# which rows have exactly 2 groups that each hold exactly 3 identical elements? out = a[grpCountSize(a, 2, 3)]
This produces out = [[3, 6, 3, 6, 3, 6]]
, because only this row has exactly 2 groups each holding exactly 3 elements (ie: 3,3,3 and 6,6,6)
PROBLEM: My actual arrays have just 6 columns, but they can have many millions of rows. The code works perfectly as intended, but it is VERY SLOW for long arrays. Is there a way to speed this up?
Advertisement
Answer
np.unique
sorts the array which makes it less efficient for your purpose. Use np.bincount
and that way you most likely will save some time(depending on your array shape and values in the array). You also will not need np.any
anymore:
def grpCountSize(arr, grpCount, grpSize): count = [np.bincount(row) for row in arr] valid = [np.count_nonzero(row == grpSize) == grpCount for row in count] return valid
Another way that might even save more time is using same number of bins for all rows and create one array:
def grpCountSize(arr, grpCount, grpSize): m = arr.max() count = np.stack([np.bincount(row, minlength=m+1) for row in arr]) return (count == grpSize).sum(1)==grpCount
Another yet upgrade is to use vectorized 2D bin count from this post. For example (note that Numba solutions tested in the post above is faster. I just provided the numpy solution for example. You can replace the function with any of the suggested ones in the post linked above):
def grpCountSize(arr, grpCount, grpSize): count = bincount2D_vectorized(arr) return (count == grpSize).sum(1)==grpCount #from the post above def bincount2D_vectorized(a): N = a.max()+1 a_offs = a + np.arange(a.shape[0])[:,None]*N return np.bincount(a_offs.ravel(), minlength=a.shape[0]*N).reshape(-1,N)
output of all solutions above:
a[grpCountSize2(a, 1, 2)] #array([[2, 2, 5, 6, 2, 5], # [3, 3, 7, 7, 3, 3]])