I have a mock dataframe, df1 with 6 columns and 5 rows, i.e., with shape (5 x 6).
Each column represents the price of an area, and rows are time.
DK1 DK2 NO1 NO2 NO3 NO4 0 10 10 12 15 15 10 1 15 10 15 10 10 10 2 15 15 15 15 15 15 3 10 10 12 15 15 10 4 10 10 10 10 15 15
Now, I want to identify – in each row – the areas with the same price as the first column “DK1”, and then being able to sum up how often any combination of columns are equal, meaning that my desired output for this would be:
Combo Occurrence 0 DK1-DK2-NO4 2 1 DK1-DK2-NO1-NO2-NO3-NO4 1 2 DK1-NO1 1 3 DK1-DK2-NO1-NO2 1
Preferably, the solution should be applicable to any sized DataFrame. I started trying with the .apply() method, but couldn’t quite get started. Hope you can help.
Advertisement
Answer
Compare DataFrame by first column by DataFrame.eq
, then use matrix multiplication by DataFrame.dot
with columns names with separator and last count with Series.value_counts
and convert to DataFrame
:
df = (df.eq(df['DK1'], axis=0) .dot(df.columns + ',') .str[:-1] .value_counts() .rename_axis('Combo') .reset_index(name='Occurrence')) print (df) Combo Occurrence 0 DK1,DK2,NO4 2 1 DK1,NO1 1 2 DK1,DK2,NO1,NO2 1 3 DK1,DK2,NO1,NO2,NO3,NO4 1
EDIT: For groups is possible create dictionary by all values and then call replace
:
s = df.columns.to_series() s.index = s.index.str.replace('d+','', regex=True) d = s.groupby(level=0).agg(','.join).to_dict() d = {v:k for k, v in d.items()} print (d) {'DK1,DK2': 'DK', 'NO1,NO2,NO3,NO4': 'NO'} df = (df.eq(df['DK1'], axis=0) .dot(df.columns + ',') .str[:-1] .value_counts() .rename_axis('Combo') .reset_index(name='Occurrence')) df['Combo'] = df['Combo'].replace(d, regex=True) print (df) Combo Occurrence 0 DK,NO4 2 1 DK1,NO1 1 2 DK,NO1,NO2 1 3 DK,NO 1