I have a data frame with the following column:
raw_col ['a','b','c'] ['b'] ['a','b'] ['c']
I want to return a column with single value based on a conditional statement. I wrote the following function:
def filter_func(elements): if "a" in elements: return "a" else: return "Other"
When running the function on the column df.withColumn("col", filter_func("raw_col"))
I have the following error col should be Column
What’s wrong here? What should I do?
Advertisement
Answer
You can use array_contains
function:
import pyspark.sql.functions as f df = df.withColumn("col", f.when(f.array_contains("raw_col", f.lit("a")), f.lit("a")).otherwise(f.lit("Other")))
But if you have a complex logic and need necessary use the filter_func
, it’s needed to create an UDF:
@f.udf() def filter_func(elements): if "a" in elements: return "a" else: return "Other" df = df.withColumn("col", filter_func("raw_col"))