I have a large netcdf dataset which has two dimensions – ‘time’ and a single spatial dimension ‘x’. There is also a ‘lat’ and ‘lon’ coord for each ‘x’.
This needs to be mapped onto a global half degree 2D grid, such that the dimensions are ‘time’, ‘lat and ‘lon’. Not all the points on the global half degree grid are in the original dataset, since the original dataset only has land points, so any points not in the original dataset should have a value of 0.0.
The original dataset looks like this:
import xarray as xa original_dataset = xa.load_dataset('original_dataset.nc') print(original_dataset['snow_blow_gb']) <xarray.DataArray 'snow_blow_gb' (time: 60630, x: 25993)> [1575955590 values with dtype=float64] Coordinates: * time (time) datetime64[ns] 1850-01-02 1850-01-03 ... 2016-01-01 latitude (x) float32 83.75 83.75 83.75 83.75 ... 50.25 50.25 50.25 50.25 longitude (x) float32 -36.75 -36.25 -35.75 -35.25 ... 155.2 155.8 156.2 Dimensions without coordinates: x Attributes: long_name: Fraction of gridbox snowfall redistributed units: 1 cell_methods: time : mean
The output file should look like this (but is currently empty):
new_dataset = xa.open_dataset('new_dataset.nc') print(new_dataset) <xarray.Dataset> Dimensions: (time: 60630, lon: 720, lat: 360) Coordinates: * time (time) datetime64[ns] 1850-01-02 1850-01-03 ... 2016-01-01 * lon (lon) float32 -179.8 -179.2 -178.8 ... 178.8 179.2 179.8 * lat (lat) float32 89.75 89.25 88.75 88.25 ... -88.75 -89.25 -89.75 Data variables: snow_blow_gb (time, lon, lat) float64 ...
I first tried:
new_dataset['snow_blow_gb'].loc[{'lat':original_dataset['snow_blow_gb'].latitude, 'lon':original_dataset['snow_blow_gb'].longitude}] = original_dataset['snow_blow_gb']
However, there wasn’t enough memory, so I tried using dask and loading the new dataset in chunks first:
new_dataset = xa.open_dataset(f'new_dataset.nc', chunks = {'lat':36,'lon':72}) new_dataset['snow_blow_gb'].loc[{'lat':original_dataset['snow_blow_gb'].latitude, 'lon':original_dataset['snow_blow_gb'].longitude}] = original_dataset['snow_blow_gb']
I then discovered that you can’t use assignment with multi-dimensional indexing in dask.
Using a for-loop to assign each coordinate point in turn will take a long time, isn’t very elegant, and may eventually run out of memory.
What should I do?
Advertisement
Answer
Even though your data is gridded, because your data does not currently have the full set of lat/lon combinations you’d like to see in your final grid, you can’t simply reshape your data and need to explicitly reindex the data first. This can be done pretty easily by assigning a MultiIndex in place of your stacked index, and then unstacking it. A major caveat is that this will explode the size of your data, and also you need to make sure your data is chunked along the (not-stacked) time dimension.
In this answer, I’m assuming you need your data to be a normal output array. If you could work with sparse arrays, this would save significant memory/storage, but would require a different approach.
Walkthrough
Set up MRE
# set up grid spec x = np.arange(-179.75, 180, 0.5) y = np.arange(-89.75, 90, 0.5) xx, yy = np.meshgrid(x, y) # filter points to simulate your sparse data structure dummy_land_mask = (np.random.random(size=(xx.shape)) > 0.9) x_in_data = xx.ravel()[dummy_land_mask.flat] y_in_data = yy.ravel()[dummy_land_mask.flat] # construct random dask array with chunks along time only arr = dda.random.random( (60630, len(x_in_data)), chunks=(1000, len(x_in_data)), ).astype('float32') # build xr.Dataset with dims (time, x) and additional (lat, lon) coords ds = xr.Dataset({ 'snow_blow_gb': xr.DataArray( arr, dims=['time', 'x'], coords={ 'time': pd.date_range('1850-01-02', freq='D', periods=60630), 'latitude': (('x',), y_in_data), 'longitude': (('x',), x_in_data), }, ), })
So this should look similar to your data. Note that I chunked the data along time only – it’s important that the x dimension not be chunked when you read in the data so that dask does not have to reshape across chunks:
In [28]: ds Out[28]: <xarray.Dataset> Dimensions: (time: 60630, x: 25928) Coordinates: * time (time) datetime64[ns] 1850-01-02 1850-01-03 ... 2016-01-01 latitude (x) float64 -89.75 -89.75 -89.75 -89.75 ... 89.75 89.75 89.75 longitude (x) float64 -172.8 -172.2 -163.2 -160.2 ... 167.8 169.2 179.2 Dimensions without coordinates: x Data variables: snow_blow_gb (time, x) float32 dask.array<chunksize=(1000, 25928), meta=np.ndarray>
You can re-assign the x coordinate to include the x and y dimensions with pd.MultiIndex.from_arrays
giving the current elements of lat, lon
along the x
dimension:
In [29]: lats = ds.latitude.values ...: lons = ds.longitude.values ...: ds = ds.drop(['latitude', 'longitude']) ...: ds.coords['x'] = pd.MultiIndex.from_arrays([lats, lons], names=['latitude', 'longitude']) In [30]: ds Out[30]: <xarray.Dataset> Dimensions: (time: 60630, x: 25928) Coordinates: * time (time) datetime64[ns] 1850-01-02 1850-01-03 ... 2016-01-01 * x (x) MultiIndex - latitude (x) float64 -89.75 -89.75 -89.75 -89.75 ... 89.75 89.75 89.75 - longitude (x) float64 -172.8 -172.2 -163.2 -160.2 ... 167.8 169.2 179.2 Data variables: snow_blow_gb (time, x) float32 dask.array<chunksize=(1000, 25928), meta=np.ndarray>
Now, you can unstack x
to get the full rank array. Note that this will produce a bunch of performance warnings. These are to be expected because you are, in fact, reshaping to create a large chunk. You can suppress these (or just ignore them) or use smaller chunks – up to you. The current chunking scheme will produce roughly 1GB chunks in the output dataset:
In [31]: reshaped = ds.unstack('x') /opt/miniconda3/envs/myenv/lib/python3.10/site-packages/xarray/core/indexing.py:1228: PerformanceWarning: Slicing is producing a large chunk. To accept the large chunk and silence this warning, set the option >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}): ... array[indexer] To avoid creating the large chunks, set the option >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}): ... array[indexer] return self.array[key] /opt/miniconda3/envs/myenv/lib/python3.10/site-packages/xarray/core/dataset.py:4212: PerformanceWarning: Reshaping is producing a large chunk. To accept the large chunk and silence this warning, set the option >>> with dask.config.set(**{'array.slicing.split_large_chunks': False}): ... array.reshape(shape) To avoid creating the large chunks, set the option >>> with dask.config.set(**{'array.slicing.split_large_chunks': True}): ... array.reshape(shape)Explictly passing ``limit`` to ``reshape`` will also silence this warning >>> array.reshape(shape, limit='128 MiB') result = result._unstack_full_reindex(dim, fill_value, sparse) In [32]: reshaped Out[32]: <xarray.Dataset> Dimensions: (time: 60630, latitude: 360, longitude: 720) Coordinates: * time (time) datetime64[ns] 1850-01-02 1850-01-03 ... 2016-01-01 * latitude (latitude) float64 -89.75 -89.25 -88.75 ... 88.75 89.25 89.75 * longitude (longitude) float64 -179.8 -179.2 -178.8 ... 178.8 179.2 179.8 Data variables: snow_blow_gb (time, latitude, longitude) float32 dask.array<chunksize=(1000, 360, 720), meta=np.ndarray>
At this point you could do clean up such as filling NaNs, extending the dimensions to the full set of x, y coordinates, etc. But importantly, the data is still chunked with a chunksize of 1000 along time
. Note that this MRE would produce an unstacked dataset upwards of 58 GB (if computed), of which only the original 6 GB is actual data (the rest will be NaNs which you could fill with zeros). If your model accepts sparse arrays as an input, this would certainly be a more efficient approach.
Also, I’m using float32 – your data looks like it’s float64 – if you really need this precision than you should double all my size estimates (so you’ll have a 116GB final product). If not, I’d recommend converting with ds['snow_blow_gb'] = ds['snow_blow_gb'].astype('float32')
prior to unstacking.