Skip to content
Advertisement

PySpark write a function to count non zero values of given columns

I want to have a function that will take as input column names and grouping conditions and based on that for each column it will return the count of non zero values for each column.

Something like this, but include non-zero condition as well.

def count_non_zero (df, features, grouping):
    exp_count = {x:'count' for x in features}
    df = df.groupBy(*grouping).agg(exp_count)
    # rename column names to exclude brackets and name of applied aggregation
    for item in df.columns:
        df = df.withColumnRenamed(item, item[item.find('(')+1: None if item.find(')')==-1 else item.find(')')])
    return df

Advertisement

Answer

You can use a list comprehension to generate the list of aggregation expressions:

import pyspark.sql.functions as F

def count_non_zero (df, features, grouping):
    return df.groupBy(*grouping).agg(*[F.count(F.when(F.col(c) != 0, 1)).alias(c) for c in features])
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement