Skip to content
Advertisement

Dataframe count set of conditions passed by several columns on a per row basis

I have a dataframe which looks something like this:

date   A_1  A_2  A_3  B_1  B_2  B_3  C_1  C_2  C_3  D_1  D_2  D_3
xxx    4    5    6    2    3    1    5    7    2    4    3    1
xxx    3    3    2    4    5    2    6    2    3    2    4    2
xxx    5    7    5    1    3    3    4    5    4    8    2    2
xxx    6    1    8    6    1    4    1    2    7    4    3    5

I am trying to compute a value based on a condition for every row which will apply across the column groupings of A, B, C, D, etc. and count how many of those groups passed the condition, for example, some pseudo-code:

count = 0
for each (A, B, C, D) as col:
    if col_1 > 3 and (col_2 > col_3):
        count++

dataframe['count'] = count   

Expected output:

date   A_1  A_2  A_3  B_1  B_2  B_3  C_1  C_2  C_3  D_1  D_2  D_3  count
xxx    4    5    6    2    3    1    5    7    2    4    3    1    2
xxx    3    3    2    4    5    2    6    2    3    2    4    2    1
xxx    5    7    5    1    3    3    4    5    4    8    2    2    2
xxx    6    1    8    6    1    4    1    2    7    4    3    5    0

This would mean the example dataframe would end up with a new column with a value of 2 because in this example column C and D pass the set of conditions.

Later I plan to do something for each row with a count value above some number.

The best way I was able to come up with looked something like this:

for col in cols:
    conditions.append(
        (
            (dataframe[f'{col}_1'] > min_corr_coef) & 
            (dataframe[f'{col}_2'] < (dataframe[f'{col_3}])
        ) | 
        (dataframe[f'{col}_1'] <= min_corr_coef)
    )

conditions.append(
    (dataframe[dataframe.loc[:,cols] > min_corr_coef].count(axis=1) >= min_corr_pair)
)

if conditions:
    dataframe.loc[
        reduce(lambda x, y: x & y, conditions),
        'pass'] = 1

Not only is this incorrect, but it is also extremely slow. What I like about it, is that it is readable and adding additional conditions if there are more columns/different data is relatively straightforward.

I am not sure if this type of operation is too complex to perform on a dataframe the way I am trying to and perhaps I need to change my data structure or not, but I figured I’d see if there is some way of going about this before I start to re-write things.

Ideally the logic being applied to the column groups (A, B, C, D in this case) can be modular and whatever method I use to do this should be easy to add/remove conditions because I want to be able to add/remove data columns and logic conditions over time.

What I am trying to do is test an arbitrary set of conditions when one of the columns for a group (which is a correlation coefficient) is above a certain value. If enough of the column groups pass the conditions, do something.

Thanks in advance, I am a python and pandas novice and this has been giving me a headache for days.

Advertisement

Answer

Let’s ignore date and handle the other columns first.

Remove date, create a MultiIndex using str.split, and stack to long form:

new_df = df.drop(columns='date')
new_df.columns = new_df.columns.str.split('_', expand=True)
new_df = new_df.stack(level=0)
     1  2  3
0 A  4  5  6
  B  2  3  1
  C  5  7  2
  D  4  3  1
1 A  3  3  2
  B  4  5  2
  C  6  2  3
  D  2  4  2
2 A  5  7  5
  B  1  3  3
  C  4  5  4
  D  8  2  2
3 A  6  1  8
  B  6  1  4
  C  1  2  7
  D  4  3  5

Apply the condition row-wise:

new_df['condition'] = new_df['1'].gt(3) & new_df['2'].gt(new_df['3'])

Then sum level 0 and assign back to the original df:

df['count'] = new_df['condition'].sum(level=0)

(Alternatively sum the conditions directly rather than assigning to both new_df and df)

df['count'] = (new_df['1'].gt(3) & new_df['2'].gt(new_df['3'])).sum(level=0)

df:

  date  A_1  A_2  A_3  B_1  B_2  B_3  C_1  C_2  C_3  D_1  D_2  D_3  count
0  xxx    4    5    6    2    3    1    5    7    2    4    3    1      2
1  xxx    3    3    2    4    5    2    6    2    3    2    4    2      1
2  xxx    5    7    5    1    3    3    4    5    4    8    2    2      2
3  xxx    6    1    8    6    1    4    1    2    7    4    3    5      0

Complete Working Example:

import pandas as pd

df = pd.DataFrame({
    'date': ['xxx', 'xxx', 'xxx', 'xxx'], 'A_1': [4, 3, 5, 6],
    'A_2': [5, 3, 7, 1], 'A_3': [6, 2, 5, 8],
    'B_1': [2, 4, 1, 6], 'B_2': [3, 5, 3, 1],
    'B_3': [1, 2, 3, 4], 'C_1': [5, 6, 4, 1],
    'C_2': [7, 2, 5, 2], 'C_3': [2, 3, 4, 7],
    'D_1': [4, 2, 8, 4], 'D_2': [3, 4, 2, 3],
    'D_3': [1, 2, 2, 5]
})

new_df = df.drop(columns='date')
new_df.columns = new_df.columns.str.split('_', expand=True)
new_df = new_df.stack(level=0)
new_df['condition'] = new_df['1'].gt(3) & new_df['2'].gt(new_df['3'])
df['count'] = new_df['condition'].sum(level=0)
print(df)
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement