I want to have a function that will take as input column names and grouping conditions and based on that for each column it will return the count of non zero values for each column.
Something like this, but include non-zero condition as well.
def count_non_zero (df, features, grouping): exp_count = {x:'count' for x in features} df = df.groupBy(*grouping).agg(exp_count) # rename column names to exclude brackets and name of applied aggregation for item in df.columns: df = df.withColumnRenamed(item, item[item.find('(')+1: None if item.find(')')==-1 else item.find(')')]) return df
Advertisement
Answer
You can use a list comprehension to generate the list of aggregation expressions:
import pyspark.sql.functions as F def count_non_zero (df, features, grouping): return df.groupBy(*grouping).agg(*[F.count(F.when(F.col(c) != 0, 1)).alias(c) for c in features])