join two patrition dataframe pyspark

Question

I have two dataframes with partition level 2. Dataframes are small probably around 100 rows each. df1 : df2: my final df will be join of df1 and df2 based on columnindex. But when I am joining two data frames as per below it looks it is shuffling and giving me the incorrect results. Is there any way I can

Accepted Answer

this depends on what do you mean by shuffling.join1 = spark.createDataFrame([(None, 1), (None, 2), (None, 3), (100, 5), (101, 6), (105, 10)], ['col1', 'columnindex'])join2 = spark.createDataFrame([(100, 1), (200, 2), (None, 3), (100, 5), (101, 6), (None, 10)], ['col2', 'columnindex'])joined = join1.join(join2, ['columnindex'], 'inner').select(['columnindex', 'col1', 'col2'])joined.show()results in:+-----------+----+----+|columnindex|col1|col2|+-----------+----+----+|          2|null| 200||          5| 100| 100||          3|null|null||          6| 101| 101||          1|null| 100||         10| 105|null|+-----------+----+----+Which is a correct result &#8211; each columnindex corresponds to proper values from both dataframes and if you do any further computations, this shouldn&#8217;t be a problem.However, if you want values to be ordered by columnindex, you can do it with orderByjoined.orderBy('columnindex').show()+-----------+----+----+|columnindex|col1|col2|+-----------+----+----+|          1|null| 100||          2|null| 200||          3|null|null||          5| 100| 100||          6| 101| 101||         10| 105|null|+-----------+----+----+A quick note on join &#8211; if you use df1.columnindex == df2.columnindex, this is going to result in duplicated columnindex column, which you will have to solve before sorting it with orderBy, that&#8217;s why it&#8217;s easier to pass column name as a list argument to join as above.

Advertisement

Answer