Skip to content
Advertisement

Pyspark find existing set of rows in a dataframe and replace it with values from another dataframe

I have a Pyspark dataframe_Old (dfo) as below:

Id neighbor_sid neighbor division
a1 1100 Naalehu Hawaii
a2 1101 key-west-fl Miami
a3 1102 lubbock Texas
a10 1202 bay-terraces California

I have a Pyspark dataframe_new (dfn) as below:

Id neighbor_sid neighbor division
a1 1100 Naalehu Hawaii
a2 1111 key-largo-fl Miami
a3 1103 grapevine Texas
a4 1115 meriden-ct Connecticut
a12 2002 east-louisville Kentucky

My objective is to find the Ids from dataframe_new in dataframe_old and replace them with the new values from dataframe_new

Final expected Pyspark dataframe updated – dataframe_old

Id neighbor_sid neighbor division
a1 1100 Naalehu Hawaii
a2 1111 key-largo-fl Miami
a3 1103 grapevine Texas
a4 1115 meriden-ct Connecticut
a10 1202 bay-terraces California
a12 2002 east-louisville Kentucky

My wrong attempt at solving it as it is comparing column wise instead of row

JavaScript

Please help – would really appreciate any guidance!

Advertisement

Answer

We can do an outer join on the id fields and then use coalesce() to prioritize the fields from dfn.

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement