I have two numpy arrays:
a = np.array([0, 1, 2, 2, 3]) b = np.array([0.9, 0.6, 0.5, 0.8, 1.0])
a
is the index of items, and b
is the score of corresponding items. Now I want to sort these items descendingly by the scores in b
while only keeping the largest score of a single item. The results should be the non-duplicated item index a_new
and the score of these items b_new
.
In the example above, I need:
a_new = np.array([3, 0, 2, 1]) b_new = np.array([1.0, 0.9, 0.8, 0.6])
I know I can do this with scatter_max
however it’s a little slow. Is there any easier and faster solutions?
Note that I don’t want to transform the array to a dictionary, which is a trivial solution. I need a batched solution because I have millions of such arrays.
Advertisement
Answer
After ordering the arrays in descending order using ordering
, repeated values could be removed by np.unique
:
ordering = np.argsort(b)[::-1] a = a[ordering] b = b[ordering] undup_ind = np.unique(a, return_index=True)[1] b = b[np.sort(undup_ind)]
This will be the fastest or one of the fastest ways to reach the goal; It ran in 0.5 seconds in my tested case by 1.000.000 data volume.