I want to select all the rows of a dataset where exactly M columns satisfy a condition based on N columns (where N >= M). Consider the following dataset The code below selects conditions where at least one (or more) of the columns (y0, y1, y2, y3) are True. However, I want to select rows where exactly 2 (analogously 1

Select all rows of a dataframe where exactly M columns in any order satisfy a condition based on N columns

I want to select all the rows of a dataset where exactly M columns satisfy a condition based on N columns (where N >= M).

Consider the following dataset

import numpy as np
import pandas as pd
import random

years = [2000, 2001, 2002]
products = ["A", "B", "C"]

num_combos = 10

years = [random.choice(years) for i in range(num_combos)]
products = [random.choice(products) for i in range(num_combos)]

sum_values = list(range(0, num_combos))
sum_values1 = list(range(0, num_combos))

bools = [bool(random.getrandbits(1)) for i in range(num_combos)]
bool1 = [bool(random.getrandbits(1)) for i in range(num_combos)]
bool2 = [bool(random.getrandbits(1)) for i in range(num_combos)]
bool3 = [bool(random.getrandbits(1)) for i in range(num_combos)]



random.shuffle(sum_values)
av_values = [random.randrange(0, num_combos, 1) for i in range(num_combos)]

cols = {"years": years,
        "products": products,
        "y0": bools,
        "y1": bool1,
        "y2": bool2, 
        "y3": bool3,
        "value": av_values}

df = pd.DataFrame(cols)

The code below selects conditions where at least one (or more) of the columns (y0, y1, y2, y3) are True. However, I want to select rows where exactly 2 (analogously 1 or 3) of the columns (y0, y1, y2, y3) are True. Of course one could find all the combinations and use that as a mask, but is there a smarter solution?

cols = ["y0", "y1", "y2", "y3"]

# At least one
df1 = df[(df[cols] == True).any(axis=1)]

# All of the columns
df2 = df[(df[cols] == True).all(axis=1)]

Answer

You can use sum to count the number of True (they are implicitly converted to 1s), and eq to compare to the desired number:

M = 2
df1 = df[df[cols].sum(axis=1).eq(M)]

NB. If you already have booleans, no need to compare to True.

Output for M=2:

   years products     y0     y1     y2    y3  value
2   2000        A  False  False   True  True      3
7   2002        C   True  False  False  True      0

Advertisement

Answer