PySpark udf returns null when function works in Pandas dataframe

Question

I'm trying to create a user-defined function that takes a cumulative sum of an array and compares the value to another column. Here is a reproducible example: In Pandas, this is the output: In Spark using temp_sdf.withColumn('len', test_function_udf('x_ary', 'y')), all of len ends up being null. Would anyone know why this is the case? Also, replacing cumsum_array = np.cumsum(np.flip(x_ary)) fails

Accepted Answer

Since test_function returns integer not List/Array. You will get null values as have you mentioned wrong return type. So please remove &#8220;ArrayType from udf&#8221; or replace return type as LongType() then it will work as given below. :Note: You can optionally set the return type of your UDF else the default return type is StringType.Option1:test_function_udf = udf(test_function) # Returns String typeOption2:test_function_udf = udf(test_function, LongType())  #Returns Long/integer typetemp_sdf = temp_sdf.withColumn('len',            test_function_udf('x_ary', 'y'))temp_sdf.show()

Advertisement

Answer