I have a large .tiff file (4.4gB, 79530 x 54980 values) with 1 band. Since only 16% of the values are valid, I was thinking it’s better to import the file as sparse matrix, to save RAM. When I first open it as `np.array`

and then transform it into a sparse matrix using `csr_matrix()`

, my kernel already crashes. See code below.

from osgeo import gdal import numpy as np from scipy.sparse import csr_matrix ds = gdal.Open("file.tif") band = ds.GetRasterBand(1) array = np.array(band.ReadAsArray()) csr_matrix(array)

Is there a better way to work with this file? In the end I have to make calculations based on the values in the raster. (Unfortunately, due to confidentiality, I cannot attach the relevant file.)

## Advertisement

## Answer

Can you tell where the crash occurs?

band = ds.GetRasterBand(1) temp = band.ReadAsArray() array = np.array(temp) # if temp is already an array, you don't need this csr_matrix(array)

If `array`

is 4.4gB, (79530, 54980)

In [62]: (79530 * 54980) / 1e9 Out[62]: 4.3725594 # 4.4gB makes sense for 1 byte/element In [63]: (79530 * 54980) * 0.16 # 16% density Out[63]: 699609504.0 # number of nonzero values

creating `csr`

requires doing `np.nonzero(array)`

to get the indices. That will produce 2 arrays of this 0.7 * 8 Gb size (indices are 8 byte ints). `coo`

format actually requires those 2 arrays plus 0.7 for the nonzero values – about 12 Gb . Converted to `csr`

, the `row`

attribute is reduced to 79530 elements – so about 7 Gb . (corrected for 8 bytes/element)

So at 16% density, the sparse format is, at it’s best, is still larger than the dense version.

Memory error when converting matrix to sparse matrix, specified dtype is invalid

is a recent case of a memory error – which occurred in `nonzero`

step.