According to my previous question here (Python – Multiply sparse matrix row with non-sparse vector by index) direct indexing of sparse matrices is not possible (at least not if you don’t want to work with the three arrays by which the sparse.csr
matrix is defined, data
, indices
, indptr
).
But I just found out, that given a csr-sparse matrix A
, this operation works fine and produces correct results: A[i, j]
.
What I also noticed: It is horribly slow, even slower than working with dense matrices.
I couldn’t find any information about this indexing method so I am wondering: What exactly is A[i, j]
doing?
If you like me to take a guess I would say it is producing a dense matrix and then indexing it like you normally would.
Advertisement
Answer
In [211]: M = sparse.csr_matrix(np.eye(3)) In [212]: M Out[212]: <3x3 sparse matrix of type '<class 'numpy.float64'>' with 3 stored elements in Compressed Sparse Row format>
Indexing with [0] produces a new sparse matrix, (1,3) shape:
In [213]: M[0] Out[213]: <1x3 sparse matrix of type '<class 'numpy.float64'>' with 1 stored elements in Compressed Sparse Row format>
Trying to index that again gives another sparse matrix, or error. That’s because it is still a 2d object (1,3) shape.
In [214]: M[0][1] --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-214-0661a1f27e52> in <module> ----> 1 M[0][1] /usr/local/lib/python3.6/dist-packages/scipy/sparse/csr.py in __getitem__(self, key) 290 # [i, 1:2] 291 elif isinstance(col, slice): --> 292 return self._get_row_slice(row, col) 293 # [i, [1, 2]] 294 elif issequence(col): /usr/local/lib/python3.6/dist-packages/scipy/sparse/csr.py in _get_row_slice(self, i, cslice) 397 398 if i < 0 or i >= M: --> 399 raise IndexError('index (%d) out of range' % i) 400 401 start, stop, stride = cslice.indices(N) IndexError: index (1) out of range
Indexing with the [0,1] syntax does work, with the two numbers applying to the two different dimensions:
In [215]: M[0,1] Out[215]: 0.0
A[0][1]
does work with a np.ndarray
, but that’s because the first [0]
produces an array with 1 less dimension. But np.matrix
, and sparse
returns a 2d matrix, not a 1d one. It’s one reason we don’t recommend np.matrix
. With sparse
the matrix nature goes deeper, so we can’t simply depricate it.
We can get an idea of the code involved in selecting an element from a sparse matrix by triggering an error:
In [216]: M[0,4] --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-216-4919ae565782> in <module> ----> 1 M[0,4] /usr/local/lib/python3.6/dist-packages/scipy/sparse/csr.py in __getitem__(self, key) 287 # [i, j] 288 if isintlike(col): --> 289 return self._get_single_element(row, col) 290 # [i, 1:2] 291 elif isinstance(col, slice): /usr/local/lib/python3.6/dist-packages/scipy/sparse/compressed.py in _get_single_element(self, row, col) 868 if not (0 <= row < M) or not (0 <= col < N): 869 raise IndexError("index out of bounds: 0<=%d<%d, 0<=%d<%d" % --> 870 (row, M, col, N)) 871 872 major_index, minor_index = self._swap((row, col)) IndexError: index out of bounds: 0<=0<3, 0<=4<3
===
Yes, indexing an item in a sparse matrix is slower than indexing in a dense array. It’s not because it first converts to dense. With a dense array indexing an item just requires converting the n-d index to a flat one, and selecting the required bytes in the 1d flat data buffer – and most of that is done in fast compiled code. But as you can see from the traceback, selecting an item from sparse matrix is more involved, and a lot of it Python.
Sparse lil
format is designed to be faster for indexing (and especially for setting). But even that is quite a bit slower than indexing a dense array. Don’t use sparse matrices if you need to iterate, or otherwise repeatedly access individual elements.
===
To give an idea of what’s involved with indexing M
, look at its key attributes:
In [224]: M.data,M.indices,M.indptr Out[224]: (array([1., 1., 1.]), array([0, 1, 2], dtype=int32), array([0, 1, 2, 3], dtype=int32))
To pick row 0, we have to use indptr
to select a slice from the others:
In [225]: slc = slice(M.indptr[0],M.indptr[1]) In [226]: M.data[slc], M.indices[slc] Out[226]: (array([1.]), array([0], dtype=int32))
then to pick col 1, we have to check whether that values is in indices[slc]
. If it is, return the corresponding element in data[slc]
. If not return 0.
For more complex indexing, sparse actually uses matrix multiplication, having created an extractor
matrix from the indices. It also uses multiplication to perform row or column sums.
Matrix multiplication is a sparse matrix strength – provided the matrix is sparse enough. The mathematical roots of sparse formats, especially csr
are in sparse linear equation problems, such as finite difference and finite element PDES.
===
Here’s the underlying attributes for a lil
matrix
In [227]: ml=M.tolil() In [228]: ml.data Out[228]: array([list([1.0]), list([1.0]), list([1.0])], dtype=object) In [229]: ml.rows Out[229]: array([list([0]), list([1]), list([2])], dtype=object) In [230]: ml.data[0],ml.rows[0] Out[230]: ([1.0], [0]) # cf Out[226]