I have ecommerce data with about 6000 SKUs and 250,000 obs. Simple version below but a lot more sparse. There is only one SKU per line as each line is a transaction.
What I have:
|Index| ID | SKU1| SKU2 | SKU3| |:----|:----|:----|:-----|:----| | 1 | 55 | 1 | 0 | 0 | | 2 | 55 | 0 | 1 | 0 | | 3 | 55 | 0 | 0 | 1 | | 4 | 66 | 0 | 1 | 0 | | 5 | 66 | 1 | 0 | 0 | | 6 | 77 | 0 | 1 | 0 |
I want to create a weighted undirected adjacency matrix so that I can do some graph analysis on the market baskets. It would look like the below, where SKU2 and SKU1 were bought together in baskets 55 and 66 and therefore have a total weight of 2.
What I want:
|Index| SKU1| SKU2| SKU3 | SKU4| |:----|:----|:----|:-----|:----| | SKU1| 0 | 2 | 1 | 0 | | SKU2| 2 | 0 | 0 | 0 | | SKU3| 1 | 0 | 0 | 0 | | SKU4| 0 | 0 | 0 | 0 |
I have tried a for loop iterating through the original DF but it crashes immediately.
Ideally I would collapse the first dataframe by the ID column but without aggregating, as there are no duplicate transactions for the same item and same ID. However, when I try to collapse using df.groupby(['ID']).count()
I get the following. When I remove .count() there is no output. I’m sure there is another way to do this but can’t seem to find it in the documentation.
What I tried: df.groupby(['ID']).count()
| ID | SKU1| SKU2 | SKU3| |:----|:----|:---- |:----| | 55 | 3 | 3 | 3 | | 66 | 2 | 2 | 2 | | 77 | 1 | 1 | 1 |
Anyone know how I can generate the sparse matrix without immediately crashing my computer?
Advertisement
Answer
Count also counts zeros. Aggregate by sum instead and then convert to 0s and 1s.
agg = df.groupby('ID').agg('sum') agg = (agg > 0).astype(int) SKU1 SKU2 SKU3 ID 55 1 1 1 66 1 1 0 77 0 1 0
Turn it into a occurrence table and fill the diagonal with 0s for whatever reason.
occurrence = np.dot(agg.T, agg) np.fill_diagonal(occurrence, 0)
Turn it back into a dataframe
pd.DataFrame(occurrence, columns=df.columns[1:], index=df.columns[1:]) SKU1 SKU2 SKU3 SKU1 0 2 1 SKU2 2 0 1 SKU3 1 1 0