Create a new column by replacing comma-separated column’s values with a lookup based on another dataframe

Question

I have PySpark dataframe (source_df) in which there is a column with values that are comma-separated. I am trying to replace those values with a lookup based on another dataframe (lookup_df) source_df lookup_df output dataframe: Column A is a primary key and is always unique. Column T is unique for a given value of A. Answer You can split and

Accepted Answer

You can split and explode the column B and do a left join. Then collect the D values and concat with comma.import pyspark.sql.functions as Fresult = source_df.withColumn(    'B_split',    F.explode(F.split('B', ','))).alias('s').join(    lookup_df.alias('l'),    F.expr('s.B_split = l.C'),    'left').drop('C').na.fill(    'EMPTY', ['D']).groupBy(    source_df.columns).agg(    F.concat_ws(',', F.collect_list('D')).alias('new_col'))result.show()+---+-----+---+-----------+|  A|    B|  T|    new_col|+---+-----+---+-----------+|foo|a,b,c|sam|   h1,h2,h3||faz|b,a,f|sam|h2,h1,EMPTY||bar|k,a,c|bob|EMPTY,h1,h3|+---+-----+---+-----------+

Advertisement

Answer