Skip to content
Advertisement

join two patrition dataframe pyspark

I have two dataframes with partition level 2. Dataframes are small probably around 100 rows each.

df1 :

JavaScript

df2:

JavaScript

my final df will be join of df1 and df2 based on columnindex.

JavaScript

But when I am joining two data frames as per below it looks it is shuffling and giving me the incorrect results. Is there any way I can do which avoid shuffling.

JavaScript

Advertisement

Answer

this depends on what do you mean by shuffling.

JavaScript

results in:

JavaScript

Which is a correct result – each columnindex corresponds to proper values from both dataframes and if you do any further computations, this shouldn’t be a problem. However, if you want values to be ordered by columnindex, you can do it with orderBy

JavaScript
JavaScript

A quick note on join – if you use df1.columnindex == df2.columnindex, this is going to result in duplicated columnindex column, which you will have to solve before sorting it with orderBy, that’s why it’s easier to pass column name as a list argument to join as above.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement