I want to select all the rows of a dataset where exactly M columns satisfy a condition based on N columns (where N >= M).
Consider the following dataset
import numpy as np
import pandas as pd
import random
years = [2000, 2001, 2002]
products = ["A", "B", "C"]
num_combos = 10
years = [random.choice(years) for i in range(num_combos)]
products = [random.choice(products) for i in range(num_combos)]
sum_values = list(range(0, num_combos))
sum_values1 = list(range(0, num_combos))
bools = [bool(random.getrandbits(1)) for i in range(num_combos)]
bool1 = [bool(random.getrandbits(1)) for i in range(num_combos)]
bool2 = [bool(random.getrandbits(1)) for i in range(num_combos)]
bool3 = [bool(random.getrandbits(1)) for i in range(num_combos)]
random.shuffle(sum_values)
av_values = [random.randrange(0, num_combos, 1) for i in range(num_combos)]
cols = {"years": years,
        "products": products,
        "y0": bools,
        "y1": bool1,
        "y2": bool2, 
        "y3": bool3,
        "value": av_values}
df = pd.DataFrame(cols)
The code below selects conditions where at least one (or more) of the columns (y0, y1, y2, y3) are True. However, I want to select rows where exactly 2 (analogously 1 or 3) of the columns (y0, y1, y2, y3) are True. Of course one could find all the combinations and use that as a mask, but is there a smarter solution?
cols = ["y0", "y1", "y2", "y3"] # At least one df1 = df[(df[cols] == True).any(axis=1)] # All of the columns df2 = df[(df[cols] == True).all(axis=1)]
Advertisement
Answer
You can use sum to count the number of True (they are implicitly converted to 1s), and eq to compare to the desired number:
M = 2 df1 = df[df[cols].sum(axis=1).eq(M)]
NB. If you already have booleans, no need to compare to True.
Output for M=2:
years products y0 y1 y2 y3 value 2 2000 A False False True True 3 7 2002 C True False False True 0