I need to create a Pandas DataFrame that contains two columns:
- Combination – contains tuples that describe a combination of products in the binary table (e.g., (“bread”, “eggs”))
- Count – contains the number of times that this combination appeared in the binary table
The binary table that I keep referring to looks like this. 1 just means that the product is in that specific product (or in this case, it is present in the combination), and 0 if otherwise.
bread cheese eggs flour jam 1000 0 0 1 0 0 1001 1 0 0 0 0 1002 1 0 1 1 0 1003 1 0 1 0 1 1004 0 0 1 0 0 ... ... ... ... ... ... 1495 1 0 1 1 0 1496 1 1 1 0 0 1497 0 0 0 0 1 1498 1 0 0 0 0 1499 1 0 1 0 0 500 rows × 5 columns
I already figured out how to create the combinations column, I just don’t know how to create the count column using the data from the binary table. Here’s my code so far:
import pandas as pd import itertools combinations_list = [] products = ["bread","cheese","eggs","flour","jam"] for p in range(2, len(products)+1): for subset in itertools.combinations(products, p): combinations_list.append(str(subset)) #code for count column here report = pd.DataFrame(combinations_list,columns=['Combinations']) report
This is what the code looks like but I still need to add the count column.
>> Combinations 0 ('bread', 'cheese') 1 ('bread', 'eggs') 2 ('bread', 'flour') 3 ('bread', 'jam') 4 ('cheese', 'eggs') 5 ('cheese', 'flour') 6 ('cheese', 'jam') 7 ('eggs', 'flour') 8 ('eggs', 'jam') 9 ('flour', 'jam') 10 ('bread', 'cheese', 'eggs') 11 ('bread', 'cheese', 'flour') 12 ('bread', 'cheese', 'jam') 13 ('bread', 'eggs', 'flour') 14 ('bread', 'eggs', 'jam') 15 ('bread', 'flour', 'jam') 16 ('cheese', 'eggs', 'flour') 17 ('cheese', 'eggs', 'jam') 18 ('cheese', 'flour', 'jam') 19 ('eggs', 'flour', 'jam') 20 ('bread', 'cheese', 'eggs', 'flour') 21 ('bread', 'cheese', 'eggs', 'jam') 22 ('bread', 'cheese', 'flour', 'jam') 23 ('bread', 'eggs', 'flour', 'jam') 24 ('cheese', 'eggs', 'flour', 'jam') 25 ('bread', 'cheese', 'eggs', 'flour', 'jam')
Can anyone please help me? Thank you!
Advertisement
Answer
Here is one solution:
d={} for x in range(2,len(df.columns)+1): for y in itertools.combinations(df.columns,x): d[y]=0 for x in range(2,len(df.columns)+1): for i in df.index: s=[k for k in df.columns if df.loc[i, k]==1] p=[j for j in itertools.combinations(s,x)] for w in p: d[w]+=1 res=pd.DataFrame({'comb':d.keys(), 'count':d.values()})
For the visible part of your df (the rows you provided in the question) this code returns:
>>>print(res) comb count 0 (bread, cheese) 1 1 (bread, eggs) 5 2 (bread, flour) 2 3 (bread, jam) 1 4 (cheese, eggs) 1 5 (cheese, flour) 0 6 (cheese, jam) 0 7 (eggs, flour) 2 8 (eggs, jam) 1 9 (flour, jam) 0 10 (bread, cheese, eggs) 1 11 (bread, cheese, flour) 0 12 (bread, cheese, jam) 0 13 (bread, eggs, flour) 2 14 (bread, eggs, jam) 1 15 (bread, flour, jam) 0 16 (cheese, eggs, flour) 0 17 (cheese, eggs, jam) 0 18 (cheese, flour, jam) 0 19 (eggs, flour, jam) 0 20 (bread, cheese, eggs, flour) 0 21 (bread, cheese, eggs, jam) 0 22 (bread, cheese, flour, jam) 0 23 (bread, eggs, flour, jam) 0 24 (cheese, eggs, flour, jam) 0 25 (bread, cheese, eggs, flour, jam) 0