I have situation which can be trivialized to example with two files.
filters.py
JavaScript
x
3
1
from pyspark.sql import functions as F
2
condition = F.col('a') == 1
3
main.py
JavaScript
1
7
1
from filters import condition
2
from pyspark.sql import SparkSession
3
4
def main():
5
spark = SparkSession.builder.getOrCreate()
6
table = spark.table('foo').filter(condition)
7
It appears, that F.col
object cannot be created without active sparkSession/sparkContext object, so import fails.
Is there any way to keep filters separated from other files and how i can import them?
My situation is a little bit more complicated, this filters is used in many different functions across project, so i cant import it inside every function. I need a way to import it safely to global namespace.
Advertisement
Answer
You could create conditions as strings:
filters.py
JavaScript
1
2
1
condition = "F.col('a') == 123"
2
And then use eval
to run is as code:
main.py
JavaScript
1
14
14
1
from pyspark.sql import SparkSession
2
import pyspark.sql.functions as F
3
from filters import condition
4
5
6
if __name__ == "__main__":
7
spark = SparkSession.builder.getOrCreate()
8
data = [
9
{"id": 1, "a": 123},
10
{"id": 2, "a": 23},
11
]
12
df = spark.createDataFrame(data=data)
13
df = df.filter(eval(condition))
14
The result in this example is, as expected:
JavaScript
1
6
1
+---+---+
2
| a| id|
3
+---+---+
4
|123| 1|
5
+---+---+
6