What’s the fastest way in Python to calculate cosine similarity given sparse matrix data?

Question

Given a sparse matrix listing, what's the best way to calculate the cosine similarity between each of the columns (or rows) in the matrix? I would rather not iterate n-choose-two times. Say the input matrix is: The sparse representation is: In Python, it's straightforward to work with the matrix-input format: Gives: That's fine for a full-matrix input, but I really

Accepted Answer

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn.  As of version 0.17 it also supports sparse output:from sklearn.metrics.pairwise import cosine_similarityfrom scipy import sparseA =  np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])A_sparse = sparse.csr_matrix(A)similarities = cosine_similarity(A_sparse)print('pairwise dense output:n {}n'.format(similarities))#also can output sparse matricessimilarities_sparse = cosine_similarity(A_sparse,dense_output=False)print('pairwise sparse output:n {}n'.format(similarities_sparse))Results:pairwise dense output:[[ 1.          0.40824829  0.40824829][ 0.40824829  1.          0.33333333][ 0.40824829  0.33333333  1.        ]]pairwise sparse output:(0, 1)  0.408248290464(0, 2)  0.408248290464(0, 0)  1.0(1, 0)  0.408248290464(1, 2)  0.333333333333(1, 1)  1.0(2, 1)  0.333333333333(2, 0)  0.408248290464(2, 2)  1.0If you want column-wise cosine similarities simply transpose your input matrix beforehand: A_sparse.transpose()

Advertisement

Answer