Skip to content
Advertisement

How to use np.unique on big arrays?

I work with geospatial images in tif format. Thanks to the rasterio lib I can exploit these images as numpy arrays of dimension (nb_bands, x, y). Here I manipulate an image that contains patches of unique values that I would like to count. (they were generated with the scipy.ndimage.label function).

My idea was to use the unique method of numpy to retrieve the information from these patches as follows:

# identify the clumps
with rio.open(mask) as f:
    mask_raster = f.read(1)

class_, indices, count = np.unique(mask_raster, return_index=True, return_counts=True) 
del mask_raster
        
# identify the value
with rio.open(src) as f:
    src_raster = f.read(1)

src_flat = src_raster.flatten()
del src_raster 
    
values = [src_flat[index] for index in indices]
    
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})

My problem is this: For an image of shape 69940, 70936, (84.7 mB on my disk), np.unique tries to allocate an array of the same dim in int64 and I get the following error:

Unable to allocate 37.0 GiB for an array with shape (69940, 70936) and data type uint64

  • Is it normal that unique reformats my painting in int64?
  • Is it possible to force it to use a more optimal format? (even if all my patches were 1 pixel np.int32would be sufficent)
  • Is there another solution using a function I don’t know?

Advertisement

Answer

I dig into the scipy.ndimage lib and effectivly find a solution that avoid memory explosion. As it’s slicing the initial raster is faster than I thought :

from scipy import ndimage
import numpy as np 

# open the files 
with rio.open(mask) as f_mask, rio.open(src) as f_src: 
    mask_raster = f_mask.read(1)
    src_raster = f_src.read(1)
    
# use patches as slicing material 
indices = [i for i in range(1, np.max(mask_raster))]
counts = []
values = []
for i, loc in enumerate(ndimage.find_objects(mask_raster)):
    loc_values, loc_counts = np.unique(mask_raster[loc], return_counts=True)
    
    # the value of the patch is the value with the highest count 
    idx = np.argmax(loc_counts)
    counts.append(loc_counts[idx])
    values.append(loc_values[idx])
    
df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})
Advertisement