I have PySpark dataframe (source_df
) in which there is a column with values that are comma-separated.
I am trying to replace those values with a lookup based on another dataframe (lookup_df
)
source_df
JavaScript
x
5
1
A B T followed by N unrelated columns
2
foo a,b,c sam
3
bar k,a,c bob
4
faz b,a,f sam
5
lookup_df
JavaScript
1
5
1
C D
2
a h1
3
b h2
4
c h3
5
output dataframe:
JavaScript
1
5
1
A T B new_col followed by N unrelated columns
2
foo sam a,b,c h1,h2,h3
3
bar bob h,a,c EMPTY,h1,h3
4
faz sam b,a,f h2,h1,EMPTY
5
Column A
is a primary key and is always unique. Column T
is unique for a given value of A
.
Advertisement
Answer
You can split and explode the column B and do a left join. Then collect the D values and concat with comma.
JavaScript
1
26
26
1
import pyspark.sql.functions as F
2
3
result = source_df.withColumn(
4
'B_split',
5
F.explode(F.split('B', ','))
6
).alias('s').join(
7
lookup_df.alias('l'),
8
F.expr('s.B_split = l.C'),
9
'left'
10
).drop('C').na.fill(
11
'EMPTY', ['D']
12
).groupBy(
13
source_df.columns
14
).agg(
15
F.concat_ws(',', F.collect_list('D')).alias('new_col')
16
)
17
18
result.show()
19
+---+-----+---+-----------+
20
| A| B| T| new_col|
21
+---+-----+---+-----------+
22
|foo|a,b,c|sam| h1,h2,h3|
23
|faz|b,a,f|sam|h2,h1,EMPTY|
24
|bar|k,a,c|bob|EMPTY,h1,h3|
25
+---+-----+---+-----------+
26