I have a rather large dataset, and need to find in that dataset extreme values, including coordinates.
The real dataset is much larger, but let’s take this one for testing:
import xarray as xr
import numpy as np
import pandas as pd
values = np.array(
[[[3, 1, 1],
[1, 1, 1],
[1, 1, 1]],
[[1, 1, 1],
[1, 1, 1],
[1, 1, 4]],
[[1, 1, 1],
[1, 1, 1],
[1, 1, 5]]]
)
da = xr.DataArray(values, dims=('time', 'lat', 'lon'),
coords={'time': list(range(3)), 'lat': list(range(3)), 'lon':list(range(3))})
I want to find in this dataarray all values larger than 2. I found on here this solution:
da.where(da>2, drop=True)
but even in this small example, this produces a lot more nans than values:
array([[[ 3., nan],
[nan, nan]],
[[nan, nan],
[nan, 4.]],
[[nan, nan],
[nan, 5.]]])
and it’s worse in my actual dataset.
I’ve tried to write a helper function to convert it to a pandas dataframe, like this:
def find_val(da):
res = pd.DataFrame(columns=['Time', 'Latitude', 'Longitude', 'Value'])
for time_idx, time in enumerate(da['time']):
for lat_idx, lat in enumerate(da['lat']):
for lon_idx, lon in enumerate(da['lon']):
value = da.isel(time=time_idx, lat=lat_idx, lon=lon_idx).item()
if not np.isnan(value):
res.loc[len(res.index)] = [time.item(), lat.item(), lon.item(), value]
return res
find_val(da.where(da>2, drop=True))
This produces the output I want, but 3 nested for loops seems excessive.
Time Latitude Longitude Value
0 0.0 0.0 0.0 3.0
1 1.0 1.0 1.0 4.0
2 2.0 1.0 1.0 5.0
Any good suggestions on how to improve this?
Advertisement
Answer
There is already an implementation of converting to Pandas
DataArray.to_dataframe(name=None, dim_order=None)
https://docs.xarray.dev/en/stable/generated/xarray.DataArray.to_dataframe.html
On a side note, if you are looking to remove extreme values with no specific range then you might want to check out outlier detection https://scikit-learn.org/stable/modules/outlier_detection.html