I’m trying to get the index of all repeated elements in a numpy array, but the solution I found for the moment is REALLY inefficient for a large (>20000 elements) input array (it takes more or less 9 seconds). The idea is simple:
records_array
is a numpy array of timestamps (datetime
) from which we want to extract the indexes of repeated timestampstime_array
is a numpy array containing all the timestamps that are repeated inrecords_array
records
is a django QuerySet (which can easily converted to a list) containing some Record objects. We want to create a list of couples formed by all possible combinations of tagId attributes of Record corresponding to the repeated timestamps found fromrecords_array
.
Here is the working (but inefficient) code I have for the moment:
tag_couples = []; for t in time_array: users_inter = np.nonzero(records_array == t)[0] # Get all repeated timestamps in records_array for time t l = [str(records[i].tagId) for i in users_inter] # Create a temporary list containing all tagIds recorded at time t if l.count(l[0]) != len(l): #remove tuples formed by the first tag repeated tag_couples +=[x for x in itertools.combinations(list(set(l)),2)] # Remove duplicates with list(set(l)) and append all possible couple combinations to tag_couples
I’m quite sure this can be optimized by using Numpy, but I can’t find a way to compare records_array
with each element of time_array
without using a for loop (this can’t be compared by just using ==
, since they are both arrays).
Advertisement
Answer
A vectorized solution with numpy, on the magic of unique()
.
import numpy as np # create a test array records_array = np.array([1, 2, 3, 1, 1, 3, 4, 3, 2]) # creates an array of indices, sorted by unique element idx_sort = np.argsort(records_array) # sorts records array so all unique elements are together sorted_records_array = records_array[idx_sort] # returns the unique values, the index of the first occurrence of a value, and the count for each element vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True) # splits the indices into separate arrays res = np.split(idx_sort, idx_start[1:]) #filter them with respect to their size, keeping only items occurring more than once vals = vals[count > 1] res = filter(lambda x: x.size > 1, res)
The following code was the original answer, which required a bit more memory, using numpy
broadcasting and calling unique
twice:
records_array = array([1, 2, 3, 1, 1, 3, 4, 3, 2]) vals, inverse, count = unique(records_array, return_inverse=True, return_counts=True) idx_vals_repeated = where(count > 1)[0] vals_repeated = vals[idx_vals_repeated] rows, cols = where(inverse == idx_vals_repeated[:, newaxis]) _, inverse_rows = unique(rows, return_index=True) res = split(cols, inverse_rows[1:])
with as expected res = [array([0, 3, 4]), array([1, 8]), array([2, 5, 7])]