Suppose we have two dataframes df1
and df2
where df1
has columns [a, b, c, p, q, r]
and df2
has columns [d, e, f, a, b, c]
. Suppose the common columns are stored in a list common_cols = ['a', 'b', 'c']
.
How do you join the two dataframes using the common_cols
list within a sql command? The code below attempts to do this.
JavaScript
x
6
1
common_cols = ['a', 'b', 'c']
2
filter_df = spark.sql("""
3
select * from df1 inner join df2
4
on df1.common_cols = df2.common_cols
5
""")
6
Advertisement
Answer
Demo setup
JavaScript
1
21
21
1
df1 = spark.createDataFrame([(1,2,3,4,5,6)],['a','b','c','p','q','r'])
2
df2 = spark.createDataFrame([(7,8,9,1,2,3)],['d','e','f','a','b','c'])
3
common_cols = ['a','b','c']
4
5
df1.show()
6
7
+---+---+---+---+---+---+
8
| a| b| c| p| q| r|
9
+---+---+---+---+---+---+
10
| 1| 2| 3| 4| 5| 6|
11
+---+---+---+---+---+---+
12
13
14
df2.show()
15
16
+---+---+---+---+---+---+
17
| d| e| f| a| b| c|
18
+---+---+---+---+---+---+
19
| 7| 8| 9| 1| 2| 3|
20
+---+---+---+---+---+---+
21
Solution, based on using (SQL syntax for join)
JavaScript
1
11
11
1
df1.createOrReplaceTempView('df1')
2
df2.createOrReplaceTempView('df2')
3
common_cols_csv = ','.join(common_cols)
4
5
query = f'''
6
select *
7
from df1 inner join df2 using ({common_cols_csv})
8
'''
9
10
11
JavaScript
1
5
1
print(query)
2
3
select *
4
from df1 inner join df2 using (a,b,c)
5
JavaScript
1
10
10
1
filter_df = spark.sql(query)
2
3
filter_df.show()
4
5
+---+---+---+---+---+---+---+---+---+
6
| a| b| c| p| q| r| d| e| f|
7
+---+---+---+---+---+---+---+---+---+
8
| 1| 2| 3| 4| 5| 6| 7| 8| 9|
9
+---+---+---+---+---+---+---+---+---+
10