How to select rows from list in PySpark

Question

Suppose we have two dataframes df1 and df2 where df1 has columns [a, b, c, p, q, r] and df2 has columns [d, e, f, a, b, c]. Suppose the common columns are stored in a list common_cols = ['a', 'b', 'c']. How do you join the two dataframes using the common_cols list within a sql command? The code below

Accepted Answer

Demo setupdf1 = spark.createDataFrame([(1,2,3,4,5,6)],['a','b','c','p','q','r'])df2 = spark.createDataFrame([(7,8,9,1,2,3)],['d','e','f','a','b','c'])common_cols = ['a','b','c']df1.show()+---+---+---+---+---+---+|  a|  b|  c|  p|  q|  r|+---+---+---+---+---+---+|  1|  2|  3|  4|  5|  6|+---+---+---+---+---+---+df2.show()+---+---+---+---+---+---+|  d|  e|  f|  a|  b|  c|+---+---+---+---+---+---+|  7|  8|  9|  1|  2|  3|+---+---+---+---+---+---+Solution, based on using (SQL syntax for join)df1.createOrReplaceTempView('df1')df2.createOrReplaceTempView('df2')common_cols_csv = ','.join(common_cols)query = f'''select  * from    df1 inner join df2 using ({common_cols_csv})'''       print(query)select  * from    df1 inner join df2 using (a,b,c)filter_df = spark.sql(query)filter_df.show()+---+---+---+---+---+---+---+---+---+|  a|  b|  c|  p|  q|  r|  d|  e|  f|+---+---+---+---+---+---+---+---+---+|  1|  2|  3|  4|  5|  6|  7|  8|  9|+---+---+---+---+---+---+---+---+---+

Advertisement

Answer