I have ecommerce data with about 6000 SKUs and 250,000 obs. Simple version below but a lot more sparse. There is only one SKU per line as each line is a transaction.
What I have:
|Index| ID | SKU1| SKU2 | SKU3| |:----|:----|:----|:-----|:----| | 1 | 55 | 1 | 0 | 0 | | 2 | 55 | 0 | 1 | 0 | | 3 | 55 | 0 | 0 | 1 | | 4 | 66 | 0 | 1 | 0 | | 5 | 66 | 1 | 0 | 0 | | 6 | 77 | 0 | 1 | 0 |
I want to create a weighted undirected adjacency matrix so that I can do some graph analysis on the market baskets. It would look like the below, where SKU2 and SKU1 were bought together in baskets 55 and 66 and therefore have a total weight of 2.
What I want:
|Index| SKU1| SKU2| SKU3 | SKU4| |:----|:----|:----|:-----|:----| | SKU1| 0 | 2 | 1 | 0 | | SKU2| 2 | 0 | 0 | 0 | | SKU3| 1 | 0 | 0 | 0 | | SKU4| 0 | 0 | 0 | 0 |
I have tried a for loop iterating through the original DF but it crashes immediately.
Ideally I would collapse the first dataframe by the ID column but without aggregating, as there are no duplicate transactions for the same item and same ID. However, when I try to collapse using df.groupby(['ID']).count() I get the following. When I remove .count() there is no output. I’m sure there is another way to do this but can’t seem to find it in the documentation.
What I tried: df.groupby(['ID']).count()
| ID | SKU1| SKU2 | SKU3| |:----|:----|:---- |:----| | 55 | 3 | 3 | 3 | | 66 | 2 | 2 | 2 | | 77 | 1 | 1 | 1 |
Anyone know how I can generate the sparse matrix without immediately crashing my computer?
Advertisement
Answer
Count also counts zeros. Aggregate by sum instead and then convert to 0s and 1s.
agg = df.groupby('ID').agg('sum')
agg = (agg > 0).astype(int)
    SKU1    SKU2    SKU3
ID          
55  1       1       1
66  1       1       0
77  0       1       0
Turn it into a occurrence table and fill the diagonal with 0s for whatever reason.
occurrence = np.dot(agg.T, agg) np.fill_diagonal(occurrence, 0)
Turn it back into a dataframe
pd.DataFrame(occurrence, columns=df.columns[1:], index=df.columns[1:])
        SKU1    SKU2    SKU3
SKU1    0       2       1
SKU2    2       0       1
SKU3    1       1       0