Skip to content
Advertisement

Deduplicate numpy array by another array

I have two numpy arrays:

a = np.array([0, 1, 2, 2, 3])
b = np.array([0.9, 0.6, 0.5, 0.8, 1.0])

a is the index of items, and b is the score of corresponding items. Now I want to sort these items descendingly by the scores in b while only keeping the largest score of a single item. The results should be the non-duplicated item index a_new and the score of these items b_new.

In the example above, I need:

a_new = np.array([3, 0, 2, 1])
b_new = np.array([1.0, 0.9, 0.8, 0.6])

I know I can do this with scatter_max however it’s a little slow. Is there any easier and faster solutions?

Note that I don’t want to transform the array to a dictionary, which is a trivial solution. I need a batched solution because I have millions of such arrays.

Advertisement

Answer

After ordering the arrays in descending order using ordering, repeated values could be removed by np.unique:

ordering = np.argsort(b)[::-1]
a = a[ordering]
b = b[ordering]
undup_ind = np.unique(a, return_index=True)[1]
b = b[np.sort(undup_ind)]

This will be the fastest or one of the fastest ways to reach the goal; It ran in 0.5 seconds in my tested case by 1.000.000 data volume.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement