Skip to content
Advertisement

What’s the fastest way in Python to calculate cosine similarity given sparse matrix data?

Given a sparse matrix listing, what’s the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times.

Say the input matrix is:

JavaScript

The sparse representation is:

JavaScript

In Python, it’s straightforward to work with the matrix-input format:

JavaScript

Gives:

JavaScript

That’s fine for a full-matrix input, but I really want to start with the sparse representation (due to the size and sparsity of my matrix). Any ideas about how this could best be accomplished?

Advertisement

Answer

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:

JavaScript

Results:

JavaScript

If you want column-wise cosine similarities simply transpose your input matrix beforehand:

JavaScript
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement