When to use xarray over numpy for medium rank multidimensional data?

Question

I have some multidimensional data and was wondering if i should use xarray when speed is one, albeit not highest, of my concerns. I have a 4D array so it&#8217;s not so big as to preclude me from using numpy. The coordinates/indices are vital for one dimension but not so for all the others. I&#8217;ll have to…

Accepted Answer

we can&#8217;t really tell you which package to use, and certainly can&#8217;t without knowing much more about your data and use case.For what it&#8217;s worth, while the performance of xarray will always lag compared to numpy, the difference is most acute when performing small operations like this. You are assigning a tiny amount of data, using indexing, within a triple for loop, which is kryptonite for xarray. If you make all assignments at the same time, you&#8217;ll see the penalty decrease meaningfully as the indexing overhead becomes less significant relative to the underlying numpy operations. Performance in xarray is all about understanding how to minimize the overhead and leverage the performance of the backend as much as possible while still giving the convenience of labels-based indexing.See this simple example. I&#8217;ve created a 3-D DataArray with 1 million float64s, indexed by (x, y, z):In [11]: da = xr.DataArray(    ...:     np.random.random(size=(100, 100, 100)),    ...:     dims=list('xyz'),    ...:     coords=[pd.Index([f'{d}{i}' for i in range(100)], name=d) for d in 'xyz'],    ...: )Looping through x and y and then assigning along the first four elements of z incurs a huge penalty, with xarray coming in at just over 100x numpy&#8217;s runtime for the same operation:In [12]: %%time    ...: for xi, x in enumerate(da.x.values):    ...:     for yi, y in enumerate(da.y.values):    ...:         da.loc[{'x': x, 'y': y, 'z': ['z0', 'z1', 'z2', 'z3']}] = [1, 2, 3, 4]    ...:CPU times: user 2.96 s, sys: 38.3 ms, total: 3 sWall time: 2.97 sIn [13]: %%time    ...: for xi, x in enumerate(da.x.values):    ...:     for yi, y in enumerate(da.y.values):    ...:         da.values[xi, yi, :4] = [1, 2, 3, 4]    ...:CPU times: user 25.7 ms, sys: 508 µs, total: 26.3 msWall time: 25.8 msIf the same operation is restructured to assign all elements at once, xarray&#8217;s performance penalty decreases to be about 6x the numpy runtime.In [15]: %%time    ...: da.loc[{'z': ['z0', 'z1', 'z2', 'z3']}] = np.tile([1, 2, 3, 4], (100, 100, 1))    ...:    ...:CPU times: user 1.4 ms, sys: 675 µs, total: 2.07 msWall time: 2.99 msIn [16]: %%time    ...: da.values[:, :, :4] = np.tile([1, 2, 3, 4], (100, 100, 1))    ...:    ...:CPU times: user 488 µs, sys: 222 µs, total: 710 µsWall time: 428 µsAssigning the whole array reduces xarray&#8217;s overhead to about 2x:In [19]: %%time    ...: da.loc[{'z': da.z}] = np.tile(np.random.random(100), (100, 100, 1))    ...:    ...:CPU times: user 11.2 ms, sys: 9.43 ms, total: 20.7 msWall time: 20.9 msIn [20]: %%time    ...: da.values[:, :, :] = np.tile(np.random.random(100), (100, 100, 1))    ...:    ...:CPU times: user 3.08 ms, sys: 4.61 ms, total: 7.7 msWall time: 6.72 msIt&#8217;s up to you whether this is worth the cost. But whichever you choose, don&#8217;t use nested for loops for assignment :)

Advertisement

Answer