Skip to content
Advertisement

Using pyspark.sql.functions without sparkContext import problem

I have situation which can be trivialized to example with two files.

filters.py

from pyspark.sql import functions as F
condition = F.col('a') == 1

main.py

from filters import condition
from pyspark.sql import SparkSession

def main():
    spark = SparkSession.builder.getOrCreate()
    table = spark.table('foo').filter(condition)

It appears, that F.col object cannot be created without active sparkSession/sparkContext object, so import fails.

Is there any way to keep filters separated from other files and how i can import them?

My situation is a little bit more complicated, this filters is used in many different functions across project, so i cant import it inside every function. I need a way to import it safely to global namespace.

Advertisement

Answer

You could create conditions as strings:

filters.py

condition = "F.col('a') == 123"

And then use eval to run is as code:

main.py

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from filters import condition


if __name__ == "__main__":
    spark = SparkSession.builder.getOrCreate()
    data = [
        {"id": 1, "a": 123},
        {"id": 2, "a": 23},
    ]
    df = spark.createDataFrame(data=data)
    df = df.filter(eval(condition))

The result in this example is, as expected:

+---+---+
|  a| id|
+---+---+
|123|  1|
+---+---+
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement