I want to select all the rows of a dataset where exactly M columns satisfy a condition based on N columns (where N >= M).
Consider the following dataset
import numpy as np import pandas as pd import random years = [2000, 2001, 2002] products = ["A", "B", "C"] num_combos = 10 years = [random.choice(years) for i in range(num_combos)] products = [random.choice(products) for i in range(num_combos)] sum_values = list(range(0, num_combos)) sum_values1 = list(range(0, num_combos)) bools = [bool(random.getrandbits(1)) for i in range(num_combos)] bool1 = [bool(random.getrandbits(1)) for i in range(num_combos)] bool2 = [bool(random.getrandbits(1)) for i in range(num_combos)] bool3 = [bool(random.getrandbits(1)) for i in range(num_combos)] random.shuffle(sum_values) av_values = [random.randrange(0, num_combos, 1) for i in range(num_combos)] cols = {"years": years, "products": products, "y0": bools, "y1": bool1, "y2": bool2, "y3": bool3, "value": av_values} df = pd.DataFrame(cols)
The code below selects conditions where at least one (or more) of the columns (y0, y1, y2, y3) are True. However, I want to select rows where exactly 2 (analogously 1 or 3) of the columns (y0, y1, y2, y3) are True. Of course one could find all the combinations and use that as a mask, but is there a smarter solution?
cols = ["y0", "y1", "y2", "y3"] # At least one df1 = df[(df[cols] == True).any(axis=1)] # All of the columns df2 = df[(df[cols] == True).all(axis=1)]
Advertisement
Answer
You can use sum
to count the number of True (they are implicitly converted to 1s), and eq
to compare to the desired number:
M = 2 df1 = df[df[cols].sum(axis=1).eq(M)]
NB. If you already have booleans, no need to compare to True
.
Output for M=2
:
years products y0 y1 y2 y3 value 2 2000 A False False True True 3 7 2002 C True False False True 0