1D netcdf to 2D lat lon using xarray and Dask

Question

I have a large netcdf dataset which has two dimensions - 'time' and a single spatial dimension 'x'. There is also a 'lat' and 'lon' coord for each 'x'. This needs to be mapped onto a global half degree 2D grid, such that the dimensions are 'time', 'lat and 'lon'. Not all the points on the global half degree grid

Accepted Answer

Even though your data is gridded, because your data does not currently have the full set of lat/lon combinations you’d like to see in your final grid, you can’t simply reshape your data and need to explicitly reindex the data first. This can be done pretty easily by assigning a MultiIndex in place of your stacked index, and then unstacking it. A major caveat is that this will explode the size of your data, and also you need to make sure your data is chunked along the (not-stacked) time dimension.In this answer, I’m assuming you need your data to be a normal output array. If you could work with sparse arrays, this would save significant memory/storage, but would require a different approach.WalkthroughSet up MRE# set up grid specx = np.arange(-179.75, 180, 0.5)y = np.arange(-89.75, 90, 0.5)xx, yy = np.meshgrid(x, y)# filter points to simulate your sparse data structuredummy_land_mask = (np.random.random(size=(xx.shape)) > 0.9)x_in_data = xx.ravel()[dummy_land_mask.flat]y_in_data = yy.ravel()[dummy_land_mask.flat]# construct random dask array with chunks along time onlyarr = dda.random.random( (60630, len(x_in_data)), chunks=(1000, len(x_in_data)),).astype('float32')# build xr.Dataset with dims (time, x) and additional (lat, lon) coordsds = xr.Dataset({ 'snow_blow_gb': xr.DataArray( arr, dims=['time', 'x'], coords={ 'time': pd.date_range('1850-01-02', freq='D', periods=60630), 'latitude': (('x',), y_in_data), 'longitude': (('x',), x_in_data), }, ),})So this should look similar to your data. Note that I chunked the data along time only – it’s important that the x dimension not be chunked when you read in the data so that dask does not have to reshape across chunks:In [28]: dsOut[28]:Dimensions: (time: 60630, x: 25928)Coordinates: * time (time) datetime64[ns] 1850-01-02 1850-01-03 ... 2016-01-01 latitude (x) float64 -89.75 -89.75 -89.75 -89.75 ... 89.75 89.75 89.75 longitude (x) float64 -172.8 -172.2 -163.2 -160.2 ... 167.8 169.2 179.2Dimensions without coordinates: xData variables: snow_blow_gb (time, x) float32 dask.arrayYou can re-assign the x coordinate to include the x and y dimensions with pd.MultiIndex.from_arrays giving the current elements of lat, lon along the x dimension:In [29]: lats = ds.latitude.values ...: lons = ds.longitude.values ...: ds = ds.drop(['latitude', 'longitude']) ...: ds.coords['x'] = pd.MultiIndex.from_arrays([lats, lons], names=['latitude', 'longitude'])In [30]: dsOut[30]:Dimensions: (time: 60630, x: 25928)Coordinates: * time (time) datetime64[ns] 1850-01-02 1850-01-03 ... 2016-01-01 * x (x) MultiIndex - latitude (x) float64 -89.75 -89.75 -89.75 -89.75 ... 89.75 89.75 89.75 - longitude (x) float64 -172.8 -172.2 -163.2 -160.2 ... 167.8 169.2 179.2Data variables: snow_blow_gb (time, x) float32 dask.arrayNow, you can unstack x to get the full rank array. Note that this will produce a bunch of performance warnings. These are to be expected because you are, in fact, reshaping to create a large chunk. You can suppress these (or just ignore them) or use smaller chunks – up to you. The current chunking scheme will produce roughly 1GB chunks in the output dataset:In [31]: reshaped = ds.unstack('x')/opt/miniconda3/envs/myenv/lib/python3.10/site-packages/xarray/core/indexing.py:1228: PerformanceWarning: Slicing is producing a large chunk. To accept the largechunk and silence this warning, set the option >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}): ... array[indexer]To avoid creating the large chunks, set the option >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}): ... array[indexer] return self.array[key]/opt/miniconda3/envs/myenv/lib/python3.10/site-packages/xarray/core/dataset.py:4212: PerformanceWarning: Reshaping is producing a large chunk. To accept the largechunk and silence this warning, set the option >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}): ... array.reshape(shape)To avoid creating the large chunks, set the option >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}): ... array.reshape(shape)Explictly passing ``limit`` to ``reshape`` will also silence this warning >>> array.reshape(shape, limit='128 MiB') result = result._unstack_full_reindex(dim, fill_value, sparse)In [32]: reshapedOut[32]:Dimensions: (time: 60630, latitude: 360, longitude: 720)Coordinates: * time (time) datetime64[ns] 1850-01-02 1850-01-03 ... 2016-01-01 * latitude (latitude) float64 -89.75 -89.25 -88.75 ... 88.75 89.25 89.75 * longitude (longitude) float64 -179.8 -179.2 -178.8 ... 178.8 179.2 179.8Data variables: snow_blow_gb (time, latitude, longitude) float32 dask.arrayAt this point you could do clean up such as filling NaNs, extending the dimensions to the full set of x, y coordinates, etc. But importantly, the data is still chunked with a chunksize of 1000 along time. Note that this MRE would produce an unstacked dataset upwards of 58 GB (if computed), of which only the original 6 GB is actual data (the rest will be NaNs which you could fill with zeros). If your model accepts sparse arrays as an input, this would certainly be a more efficient approach.Also, I’m using float32 – your data looks like it’s float64 – if you really need this precision than you should double all my size estimates (so you’ll have a 116GB final product). If not, I’d recommend converting with ds['snow_blow_gb'] = ds['snow_blow_gb'].astype('float32') prior to unstacking.

Advertisement

Answer

Walkthrough