PySpark – Combine a list of filtering conditions

Question

For starters, let me define a sample dataframe and import the sql functions: This returns the following dataframe: Now lets say I have a list of filtering conditions, for example, a list of filtering conditions detailing that columns A and B shall be equal to 1 I can combine these two conditions as follows and then filter the dataframe, obtaining

Accepted Answer

You could use reduce, or a loop.  The execution plan in spark will be the same for both, so I believe it&#8217;s just a matter of preferencefor c in l:  test_df = test_df.where(c)test_df.explain()Produces== Physical Plan ==*(1) Filter ((isnotnull(A#11487L) AND isnotnull(B#11488L)) AND ((A#11487L = 1) AND (B#11488L = 1)))+- *(1) Scan ExistingRDD[A#11487L,B#11488L,C#11489L]andtest_df = test_df.where(reduce(lambda x, y: x & y, l))test_df.explain()Produces== Physical Plan ==*(1) Filter ((isnotnull(A#11487L) AND isnotnull(B#11488L)) AND ((A#11487L = 1) AND (B#11488L = 1)))+- *(1) Scan ExistingRDD[A#11487L,B#11488L,C#11489L]

Advertisement

Answer