I have situation which can be trivialized to example with two files.
filters.py
from pyspark.sql import functions as F condition = F.col('a') == 1
main.py
from filters import condition from pyspark.sql import SparkSession def main(): spark = SparkSession.builder.getOrCreate() table = spark.table('foo').filter(condition)
It appears, that F.col
object cannot be created without active sparkSession/sparkContext object, so import fails.
Is there any way to keep filters separated from other files and how i can import them?
My situation is a little bit more complicated, this filters is used in many different functions across project, so i cant import it inside every function. I need a way to import it safely to global namespace.
Advertisement
Answer
You could create conditions as strings:
filters.py
condition = "F.col('a') == 123"
And then use eval
to run is as code:
main.py
from pyspark.sql import SparkSession import pyspark.sql.functions as F from filters import condition if __name__ == "__main__": spark = SparkSession.builder.getOrCreate() data = [ {"id": 1, "a": 123}, {"id": 2, "a": 23}, ] df = spark.createDataFrame(data=data) df = df.filter(eval(condition))
The result in this example is, as expected:
+---+---+ | a| id| +---+---+ |123| 1| +---+---+