I am having the pyspark dataframe (df) having below sample table (table1): id, col1, col2, col3 1, abc, null, def 2, null, def, abc 3, def, abc, null
I am trying to get new column (final) by appending the all the columns by ignoring null values. I have tried pyspark code and used f.array(col1, col2, col3). Values are getting appended but it not ignoring null values. I have also tried UDF to append only non null columns but it is not working.
JavaScript
x
43
43
1
import pyspark.sql.functions as f
2
df = spark.table('table1')
3
df = df.withColumn('final', f.array(col1,col2,col3))
4
5
Actual result:
6
id, col1, col2, col3, final
7
1, abc, null, def, [abc,,def]
8
2, null, def, abc, [,def, abc]
9
3, def, abc, null, [def,abc,,]
10
11
expected result:
12
id, col1, col2, col3, final
13
1, abc, null, def, [abc,def]
14
2, null, def, abc, [def, abc]
15
3, def, abc, null, [def,abc]
16
17
18
my col1, col2, col3 schema are as below:
19
where as col1 name is applications
20
21
22
applications: struct (nullable = false)
23
applicationid: string (nullable = true)
24
createdat: string (nullable = true)
25
updatedat: string (nullable = true)
26
source_name: string (nullable = true)
27
status: string (nullable = true)
28
creditbureautypeid: string (nullable = true)
29
score: integer (nullable = true)
30
applicationcreditreportid: string (nullable = true)
31
firstname: string (nullable = false)
32
lastname: string (nullable = false)
33
dateofbirth: string (nullable = false)
34
accounts: array (nullable = true)
35
element: struct (containsNull = true)
36
applicationcreditreportaccountid: string (nullable = true)
37
currentbalance: integer (nullable = true)
38
institutionid: string (nullable = true)
39
accounttypeid: string (nullable = true)
40
dayspastdue: integer (nullable = true)
41
institution_name: string (nullable = true)
42
account_type_name: string (nullable = true)
43
please let me know if question is not clear or any more info is required. Any help would be appreciated. :)
Advertisement
Answer
Since Spark 2.4 you can use Higher Order Functions to do that (there is no UDF needed). In PySpark the query can look like this:
JavaScript
1
7
1
result = (
2
df
3
.withColumn("temp", f.array("col1", "col2", "col3"))
4
.withColumn("final", f.expr("FILTER(temp, x -> x is not null)"))
5
.drop("temp")
6
)
7