I have a data frame with the following column:
JavaScript
x
6
1
raw_col
2
['a','b','c']
3
['b']
4
['a','b']
5
['c']
6
I want to return a column with single value based on a conditional statement. I wrote the following function:
JavaScript
1
6
1
def filter_func(elements):
2
if "a" in elements:
3
return "a"
4
else:
5
return "Other"
6
When running the function on the column df.withColumn("col", filter_func("raw_col"))
I have the following error col should be Column
What’s wrong here? What should I do?
Advertisement
Answer
You can use array_contains
function:
JavaScript
1
4
1
import pyspark.sql.functions as f
2
3
df = df.withColumn("col", f.when(f.array_contains("raw_col", f.lit("a")), f.lit("a")).otherwise(f.lit("Other")))
4
But if you have a complex logic and need necessary use the filter_func
, it’s needed to create an UDF:
JavaScript
1
9
1
@f.udf()
2
def filter_func(elements):
3
if "a" in elements:
4
return "a"
5
else:
6
return "Other"
7
8
df = df.withColumn("col", filter_func("raw_col"))
9