I work with geospatial images in tif format. Thanks to the rasterio
lib I can exploit these images as numpy
arrays of dimension (nb_bands, x, y). Here I manipulate an image that contains patches of unique values that I would like to count. (they were generated with the scipy.ndimage.label
function).
My idea was to use the unique
method of numpy
to retrieve the information from these patches as follows:
# identify the clumps with rio.open(mask) as f: mask_raster = f.read(1) class_, indices, count = np.unique(mask_raster, return_index=True, return_counts=True) del mask_raster # identify the value with rio.open(src) as f: src_raster = f.read(1) src_flat = src_raster.flatten() del src_raster values = [src_flat[index] for index in indices] df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})
My problem is this:
For an image of shape 69940, 70936, (84.7 mB on my disk), np.unique
tries to allocate an array of the same dim in int64 and I get the following error:
Unable to allocate 37.0 GiB for an array with shape (69940, 70936) and data type uint64
- Is it normal that unique reformats my painting in int64?
- Is it possible to force it to use a more optimal format? (even if all my patches were 1 pixel
np.int32
would be sufficent) - Is there another solution using a function I don’t know?
Advertisement
Answer
I dig into the scipy.ndimage lib and effectivly find a solution that avoid memory explosion. As it’s slicing the initial raster is faster than I thought :
from scipy import ndimage import numpy as np # open the files with rio.open(mask) as f_mask, rio.open(src) as f_src: mask_raster = f_mask.read(1) src_raster = f_src.read(1) # use patches as slicing material indices = [i for i in range(1, np.max(mask_raster))] counts = [] values = [] for i, loc in enumerate(ndimage.find_objects(mask_raster)): loc_values, loc_counts = np.unique(mask_raster[loc], return_counts=True) # the value of the patch is the value with the highest count idx = np.argmax(loc_counts) counts.append(loc_counts[idx]) values.append(loc_values[idx]) df = pd.DataFrame({'patchId': indices, 'nb_pixel': count, 'value': values})