Skip to content
Advertisement

Identify number of occurrence of any combination of equal elements in DataFrame

I have a mock dataframe, df1 with 6 columns and 5 rows, i.e., with shape (5 x 6).

Each column represents the price of an area, and rows are time.

   DK1   DK2   NO1   NO2   NO3   NO4
0  10    10    12    15    15    10
1  15    10    15    10    10    10
2  15    15    15    15    15    15
3  10    10    12    15    15    10
4  10    10    10    10    15    15

Now, I want to identify – in each row – the areas with the same price as the first column “DK1”, and then being able to sum up how often any combination of columns are equal, meaning that my desired output for this would be:

       Combo                  Occurrence
0   DK1-DK2-NO4                   2
1   DK1-DK2-NO1-NO2-NO3-NO4       1
2   DK1-NO1                       1
3   DK1-DK2-NO1-NO2               1

Preferably, the solution should be applicable to any sized DataFrame. I started trying with the .apply() method, but couldn’t quite get started. Hope you can help.

Advertisement

Answer

Compare DataFrame by first column by DataFrame.eq, then use matrix multiplication by DataFrame.dot with columns names with separator and last count with Series.value_counts and convert to DataFrame:

df = (df.eq(df['DK1'], axis=0)
        .dot(df.columns + ',')
        .str[:-1]
        .value_counts()
        .rename_axis('Combo')
        .reset_index(name='Occurrence'))
print (df)
                     Combo  Occurrence
0              DK1,DK2,NO4           2
1                  DK1,NO1           1
2          DK1,DK2,NO1,NO2           1
3  DK1,DK2,NO1,NO2,NO3,NO4           1

EDIT: For groups is possible create dictionary by all values and then call replace:

s = df.columns.to_series()
s.index = s.index.str.replace('d+','', regex=True)

d = s.groupby(level=0).agg(','.join).to_dict()
d = {v:k for k, v in d.items()}
print (d)
{'DK1,DK2': 'DK', 'NO1,NO2,NO3,NO4': 'NO'}

df = (df.eq(df['DK1'], axis=0)
        .dot(df.columns + ',')
        .str[:-1]
        .value_counts()
        .rename_axis('Combo')
        .reset_index(name='Occurrence'))

df['Combo'] = df['Combo'].replace(d, regex=True)
print (df)
        Combo  Occurrence
0      DK,NO4           2
1     DK1,NO1           1
2  DK,NO1,NO2           1
3       DK,NO           1
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement