Consider the dataframe below, where there are several variables, each with the same number of values (in this case, 4).
var1 var2 var3 var4 ... 1 10 -5 35 ... 2 11 -6 36 ... 3 12 -7 45 ... 4 13 -8 -6 ...
I would like to create a function that returns the proportion of values that are greater/less than the specified threshold values for several variables. The main goal is to create a function with the ability to enter however many variables, their corresponding operator (‘<‘ or ‘>’), and their corresponding threshold values as desired.
For example, I was thinking of a function with the format (it doesn’t have to be this format):
def proportion(variables,operators,thresholds):
where
proportion([var1,var2],['>','>'],[2, 10])
would yield 2/4, or 0.5, since the last two row values in the dataframe both exceed 2 for var1 and 10 for var2.
I’ve been stuck on this as I have no idea how to create such a function without hardcoding the function to take in a specific number of variables. I’ve attempted to use a for loop, which clearly isn’t correct and not heading towards the right direction.
def proportion(variables,operators,thresholds): x=np.arange(len(variables)) for i in x: if operators[i] == '>': return sum(x2 > thresholds[i] for x2 in df[variables[i]])/4 else: return sum(x2 < thresholds[i] for x2 in df[variables[i]])/4
Advertisement
Answer
We can create a function using the signature you specify, however instead of using loops we will use vectorized operations on each column and then use boolean conditions to filter down to the results we want. Explanations of what is happening is captured as comments in the answer.
import pandas as pd # We create an operations map in order to map the function input to # the attribute of a Series. Both DataFrames and Series have gt, lt, and eq # operations that we can utilize which vectorize the operation for us # instead of looping over every row. op_map = {">": "gt", "<": "lt", "=": "eq"} def proportions(variables, ops, thresholds, df): # zip the inputs in to a list of tuples where the format of # each item looks like `("var1", "<", 1)` constraints = zip(variables, ops, thresholds) results = [] # iterating over our zipped tuples that we created above for constraint in constraints: v, op, t = constraint # map the operation input to the attribute name attr = op_map[op] # grab the relevant column from the DataFrame series = df[v] # grab the function from the Series we just created fn = getattr(series, attr) # find which rows meet the criteria and mark them True # otherwise they get marked as False result = fn(t) # append the final results to a list results.append(result) # concat all the columns together and sum up where every row in # the DataFrame is True, i.e all the thresholds were met total = pd.concat(results, axis=1).all(axis="columns").sum() # obtain the final answer by dividing by the length of the DataFrame return total / len(df) proportions(["var1", "var2"], [">", ">"], [2, 10], df) > 0.5 proportions(["var1", "var2", "var3"], [">", ">", "="], [2, 10, -7], df) > 0.25