Function that returns proportion of values exceeding threshold with several variables

Question

Consider the dataframe below, where there are several variables, each with the same number of values (in this case, 4). I would like to create a function that returns the proportion of values that are greater/less than the specified threshold values for several variables. The main goal is to create a function…

Accepted Answer

We can create a function using the signature you specify, however instead of using loops we will use vectorized operations on each column and then use boolean conditions to filter down to the results we want. Explanations of what is happening is captured as comments in the answer.import pandas as pd# We create an operations map in order to map the function input to# the attribute of a Series. Both DataFrames and Series have gt, lt, and eq# operations that we can utilize which vectorize the operation for us# instead of looping over every row.op_map = {">": "gt", "<": "lt", "=": "eq"}def proportions(variables, ops, thresholds, df): # zip the inputs in to a list of tuples where the format of # each item looks like `("var1", "<", 1)` constraints = zip(variables, ops, thresholds) results = [] # iterating over our zipped tuples that we created above for constraint in constraints: v, op, t = constraint # map the operation input to the attribute name attr = op_map[op] # grab the relevant column from the DataFrame series = df[v] # grab the function from the Series we just created fn = getattr(series, attr) # find which rows meet the criteria and mark them True # otherwise they get marked as False result = fn(t) # append the final results to a list results.append(result) # concat all the columns together and sum up where every row in # the DataFrame is True, i.e all the thresholds were met total = pd.concat(results, axis=1).all(axis="columns").sum() # obtain the final answer by dividing by the length of the DataFrame return total / len(df)proportions(["var1", "var2"], [">", ">"], [2, 10], df)> 0.5proportions(["var1", "var2", "var3"], [">", ">", "="], [2, 10, -7], df)> 0.25

Advertisement

Answer