Skip to content
Advertisement

Extending numpy.digitize to multi-dimensional data

I have a set of large arrays (about 6 million elements each) that I want to basically perform a np.digitize but over multiple axes. I am looking for some suggestions on both how to effectively do this but also on how to store the results.

I need all the indices (or all the values, or a mask) of array A where the values of array B are in a range and the values of array C are in another range and D in yet another. I want either the values, indices, or mask so that I can do some as of yet undecided statistics on the values of the A array in each bin. I will also need the number of elements in each bin but len() can do that.

Here is one example I worked up that seems reasonable:

JavaScript

This however makes me nervous that I will run out of memory on large arrays.

I could also do it this way:

JavaScript

At least here the output is of known and constant size.

Does anyone have any better thoughts on how to do this smarter? Or clarification that is needed?


Based on the answer by HYRY this is the path I decided to take.

JavaScript

This method seems lightning fast even for huge arrays.

Advertisement

Answer

How about use groupby in Pandas. Fix some problem in your code first:

JavaScript

output:

JavaScript

Here is the code that return the same result by Pandas:

JavaScript

output:

JavaScript

If you want to calculate some statistics of every group, you can call the method of g, for example:

JavaScript

returns:

JavaScript
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement