Skip to content
Advertisement

When to use xarray over numpy for medium rank multidimensional data?

I have some multidimensional data and was wondering if i should use xarray when speed is one, albeit not highest, of my concerns.

I have a 4D array so it’s not so big as to preclude me from using numpy. The coordinates/indices are vital for one dimension but not so for all the others. I’ll have to do slight book-keeping but as the primary developer, this is ok for me. For the developers who continue to iterate the code after me, using integer indexing might be more confusing than using a label-based (xarray/pandas) approach. Regardless, i can still use numpy if i document the process well. But i would like to use xarray for readability.

After implementing a solution, i noticed that the operations/indexing below will complete in about 5 seconds on my machine.

JavaScript

If i do the same thing in an integer-based approach on xarray, it takes about 2 seconds.

JavaScript

Lastly, i noticed that if i do the same integer-based indexing in numpy, it takes less than half a second

JavaScript

Speed is not my biggest concern here but if the label-based approach in xarray is more than 8 times slower and the integer-based approach in xarray is 4 times slower than the standard numpy integer-based approach, it’s dissuading me from digging deeper into xarray for medium-rank multidimensional data.

Any thoughts, advice, etc?

Advertisement

Answer

we can’t really tell you which package to use, and certainly can’t without knowing much more about your data and use case.

For what it’s worth, while the performance of xarray will always lag compared to numpy, the difference is most acute when performing small operations like this. You are assigning a tiny amount of data, using indexing, within a triple for loop, which is kryptonite for xarray. If you make all assignments at the same time, you’ll see the penalty decrease meaningfully as the indexing overhead becomes less significant relative to the underlying numpy operations. Performance in xarray is all about understanding how to minimize the overhead and leverage the performance of the backend as much as possible while still giving the convenience of labels-based indexing.

See this simple example. I’ve created a 3-D DataArray with 1 million float64s, indexed by (x, y, z):

JavaScript

Looping through x and y and then assigning along the first four elements of z incurs a huge penalty, with xarray coming in at just over 100x numpy’s runtime for the same operation:

JavaScript

If the same operation is restructured to assign all elements at once, xarray’s performance penalty decreases to be about 6x the numpy runtime.

JavaScript

Assigning the whole array reduces xarray’s overhead to about 2x:

JavaScript

It’s up to you whether this is worth the cost. But whichever you choose, don’t use nested for loops for assignment :)

Advertisement