I have a rather large dataset, and need to find in that dataset extreme values, including coordinates. The real dataset is much larger, but let’s take this one for testing: I want to find in this dataarray all values larger than 2. I found on here this solution: but even in this small example, this prod…

extract extreme values from xarray dataarray

I have a rather large dataset, and need to find in that dataset extreme values, including coordinates.

The real dataset is much larger, but let’s take this one for testing:

import xarray as xr
import numpy as np
import pandas as pd

values = np.array(
    [[[3, 1, 1],
      [1, 1, 1], 
      [1, 1, 1]],

     [[1, 1, 1],
      [1, 1, 1],
      [1, 1, 4]],

     [[1, 1, 1],
      [1, 1, 1],
      [1, 1, 5]]]
)

da = xr.DataArray(values, dims=('time', 'lat', 'lon'), 
    coords={'time': list(range(3)), 'lat': list(range(3)), 'lon':list(range(3))})

I want to find in this dataarray all values larger than 2. I found on here this solution:

da.where(da>2, drop=True)

but even in this small example, this produces a lot more nans than values:

array([[[ 3., nan],
        [nan, nan]],

       [[nan, nan],
        [nan,  4.]],

       [[nan, nan],
        [nan,  5.]]])

and it’s worse in my actual dataset.

I’ve tried to write a helper function to convert it to a pandas dataframe, like this:

def find_val(da):
    res = pd.DataFrame(columns=['Time', 'Latitude', 'Longitude', 'Value'])
    for time_idx, time in enumerate(da['time']):
        for lat_idx, lat in enumerate(da['lat']):
            for lon_idx, lon in enumerate(da['lon']):
                value = da.isel(time=time_idx, lat=lat_idx, lon=lon_idx).item()
                if not np.isnan(value):
                    res.loc[len(res.index)] = [time.item(), lat.item(), lon.item(), value]
    return res

find_val(da.where(da>2, drop=True))

This produces the output I want, but 3 nested for loops seems excessive.

    Time  Latitude  Longitude  Value
0   0.0   0.0       0.0        3.0
1   1.0   1.0       1.0        4.0
2   2.0   1.0       1.0        5.0

Any good suggestions on how to improve this?

Answer

There is already an implementation of converting to Pandas

 DataArray.to_dataframe(name=None, dim_order=None)

https://docs.xarray.dev/en/stable/generated/xarray.DataArray.to_dataframe.html

On a side note, if you are looking to remove extreme values with no specific range then you might want to check out outlier detection https://scikit-learn.org/stable/modules/outlier_detection.html

Advertisement

Answer