I have a pandas DataFrame containing two columns ['A', 'B']. Each column is made up of integers. I want to construct a sparse matrix with the following properties: row index is all integers from 0 to the max value in the dataframe column index is the same as row index entry i,j = 1 if [i,j] or [j,i] is a

I have a pandas DataFrame containing two columns [‘A’, ‘B’]. Each column is made up of integers.

I want to construct a sparse matrix with the following properties:

row index is all integers from 0 to the max value in the dataframe
column index is the same as row index
entry i,j = 1 if [i,j] or [j,i] is a row of my dataframe (1 should be the max value of the matrix).

Most importantly, I want to do this using

coo_matrix((data, (i, j)))

JavaScript
​x
 
coo_matrix((data, (i, j)))
​

from scipy.sparse as I’m trying to understand this constructor and this particular way of using it. I have never worked with sparse matrices before. I’ve tried a few things but none of them is working.

EDIT

Sample code

Defining the dataframe

In [96]: df = pd.DataFrame(np.random.randint(5, size=(10,2)))

In [97]: df.columns = ['a', 'b']

In [98]: df
Out[98]: 
   a  b
0  0  3
1  1  4
2  3  3
3  2  0
4  0  2
5  1  0
6  1  1
7  2  3
8  3  4
9  3  2

JavaScript
 
In [96]: df = pd.DataFrame(np.random.randint(5, size=(10,2)))
​
In [97]: df.columns = ['a', 'b']
​
In [98]: df
Out[98]: 
   a  b
0  0  3
1  1  4
2  3  3
3  2  0
4  0  2
5  1  0
6  1  1
7  2  3
8  3  4
9  3  2
​

The closest I’ve come to a solution

In [100]: scipy.sparse.coo_matrix((np.ones_like(df['a']), (df['a'].array, df['b'
     ...: ].array))).toarray()
Out[100]: 
array([[0, 0, 1, 1, 0],
       [1, 1, 0, 0, 1],
       [1, 0, 0, 1, 0],
       [0, 0, 1, 1, 1]])

JavaScript
 
In [100]: scipy.sparse.coo_matrix((np.ones_like(df['a']), (df['a'].array, df['b'
     ...: ].array))).toarray()
Out[100]: 
array([[0, 0, 1, 1, 0],
       [1, 1, 0, 0, 1],
       [1, 0, 0, 1, 0],
       [0, 0, 1, 1, 1]])
​

The problem is this isn’t a symmetric matrix (as it doesn’t add to both i,j and j,i for a given row) and I think it would give values greater than 1 if there were duplicate rows.

Answer

import numpy as np
import pandas as pd
from scipy.sparse import coo_matrix

df = pd.DataFrame(np.random.default_rng(seed=100).integers(5, size=(10,2)))
df.columns = ['a', 'b']

arr = coo_matrix((np.ones_like(df.a), (df.a.values, df.b.values)))

JavaScript
 
import numpy as np
import pandas as pd
from scipy.sparse import coo_matrix
​
df = pd.DataFrame(np.random.default_rng(seed=100).integers(5, size=(10,2)))
df.columns = ['a', 'b']
​
arr = coo_matrix((np.ones_like(df.a), (df.a.values, df.b.values)))
​

This is what you’ve got. It gives you i,j >= 1 if [i,j] is in df.

arr = arr + arr.T

array([[0, 1, 2, 2, 0],
       [1, 0, 0, 0, 0],
       [2, 0, 0, 1, 2],
       [2, 0, 1, 0, 1],
       [0, 0, 2, 1, 2]])

JavaScript
 
arr = arr + arr.T
​
array([[0, 1, 2, 2, 0],
       [1, 0, 0, 0, 0],
       [2, 0, 0, 1, 2],
       [2, 0, 1, 0, 1],
       [0, 0, 2, 1, 2]])
​

Now i,j >= 1 if [i,j] or [j,i] is in df.

arr.data = np.ones_like(arr.data)

JavaScript
 
arr.data = np.ones_like(arr.data)
​

Now i,j = 1 if [i,j] or [j,i] is in df.

array([[0, 1, 1, 1, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 1, 1],
       [1, 0, 1, 0, 1],
       [0, 0, 1, 1, 1]])

JavaScript
 
array([[0, 1, 1, 1, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 1, 1],
       [1, 0, 1, 0, 1],
       [0, 0, 1, 1, 1]])
​

How do I construct an incidence matrix from two dataframe columns using scipy.sparse.coo_matrix((data, (i, j)))?

EDIT

Sample code

Defining the dataframe

The closest I’ve come to a solution

Advertisement

Answer