Pyspark find existing set of rows in a dataframe and replace it with values from another dataframe

Question

I have a Pyspark dataframe_Old (dfo) as below: Id neighbor_sid neighbor division a1 1100 Naalehu Hawaii a2 1101 key-west-fl Miami a3 1102 lubbock Texas a10 1202 bay-terraces California I have a Pyspark dataframe_new (dfn) as below: Id neighbor_sid neighbor division a1 1100 Naalehu Hawaii a2 1111 key-largo-fl Miami a3 1103 grapevine Texas a4 1115 meriden-ct Connecticut a12 2002 east-louisville Kentucky

Accepted Answer

We can do an outer join on the id fields and then use coalesce() to prioritize the fields from dfn.columns = ['id', 'neighbor_sid', 'neighbor', 'division']dfo.     join(dfn, 'id', 'outer').     select(*['id'] + [func.coalesce(dfn[k], dfo[k]).alias(k) for k in columns if k != 'id']).     orderBy('id').     show()# +---+------------+------------+-----------+# | id|neighbor_sid|    neighbor|   division|# +---+------------+------------+-----------+# | a1|        1100|     Naalehu|     Hawaii|# |a10|        1202|bay-terraces| California|# | a2|        1111|key-largo-fl|      Miami|# | a3|        1103|   grapevine|      Texas|# | a4|        1115|  meriden-ct|Connecticut|# +---+------------+------------+-----------+

Id	neighbor_sid	neighbor	division
a1	1100	Naalehu	Hawaii
a2	1101	key-west-fl	Miami
a3	1102	lubbock	Texas
a10	1202	bay-terraces	California

Advertisement

Answer