Skip to content
Advertisement

Pyspark- Subquery in a case statement

I am trying to run a subquery inside a case statement in Pyspark and it is throwing an exception. I am trying to create a new flag if id in one table is present in a different table.

Is this even possible in pyspark?

JavaScript

Here is the error:

JavaScript

I am using Spark 2.2.1.

Advertisement

Answer

This appears to be the latest detailed documentation regarding subqueries – it relates to Spark 2.0, but I haven’t seen a major update in this area since then.

The linked notebook in that reference makes it clear that indeed predicate subqueries are currently supported only within WHERE clauses. i.e. this would work (but of course would not yield the desired result):

JavaScript

You could get the same result by using a left JOIN – that’s what IN subqueries are generally translated into (for more details on that refer to the aforementioned linked notebook).

For example:

JavaScript

Or, using pyspark sql functions rather than sql syntax:

JavaScript
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement