I have a rather large dataset, and need to find in that dataset extreme values, including coordinates.
The real dataset is much larger, but let’s take this one for testing:
import xarray as xr import numpy as np import pandas as pd values = np.array( [[[3, 1, 1], [1, 1, 1], [1, 1, 1]], [[1, 1, 1], [1, 1, 1], [1, 1, 4]], [[1, 1, 1], [1, 1, 1], [1, 1, 5]]] ) da = xr.DataArray(values, dims=('time', 'lat', 'lon'), coords={'time': list(range(3)), 'lat': list(range(3)), 'lon':list(range(3))})
I want to find in this dataarray all values larger than 2. I found on here this solution:
da.where(da>2, drop=True)
but even in this small example, this produces a lot more nans than values:
array([[[ 3., nan], [nan, nan]], [[nan, nan], [nan, 4.]], [[nan, nan], [nan, 5.]]])
and it’s worse in my actual dataset.
I’ve tried to write a helper function to convert it to a pandas dataframe, like this:
def find_val(da): res = pd.DataFrame(columns=['Time', 'Latitude', 'Longitude', 'Value']) for time_idx, time in enumerate(da['time']): for lat_idx, lat in enumerate(da['lat']): for lon_idx, lon in enumerate(da['lon']): value = da.isel(time=time_idx, lat=lat_idx, lon=lon_idx).item() if not np.isnan(value): res.loc[len(res.index)] = [time.item(), lat.item(), lon.item(), value] return res find_val(da.where(da>2, drop=True))
This produces the output I want, but 3 nested for loops seems excessive.
Time Latitude Longitude Value 0 0.0 0.0 0.0 3.0 1 1.0 1.0 1.0 4.0 2 2.0 1.0 1.0 5.0
Any good suggestions on how to improve this?
Advertisement
Answer
There is already an implementation of converting to Pandas
DataArray.to_dataframe(name=None, dim_order=None)
https://docs.xarray.dev/en/stable/generated/xarray.DataArray.to_dataframe.html
On a side note, if you are looking to remove extreme values with no specific range then you might want to check out outlier detection https://scikit-learn.org/stable/modules/outlier_detection.html