PySpark 2.x: Programmatically adding Maven JAR Coordinates to Spark

Question

The following is my PySpark startup snippet, which is pretty reliable (I've been using it a long time). Today I added the two Maven Coordinates shown in the spark.jars.packages option (effectively "plugging" in Kafka support). Now that normally triggers dependency downloads (performed by Spark automatically): However the plugins aren't downloading and/or loading when I run the snippet (e.g. ./python -i

Accepted Answer

This is the kind of post where the QUESTION will be worth more than the ANSWER, because the code above works but isn&#8217;t anywhere to be found in Spark 2.x documentation or examples.The above is how I&#8217;ve programmatically added functionality to Spark 2.x by way of Maven Coordinates. I had this working but then it stopped working. Why?When I ran the above code in a jupyter notebook, the notebook had &#8212; behind the scenes &#8212; already run that identical code snippet by way of my PYTHONSTARTUP script. That PYTHONSTARTUP script has the same code as the above, but omits the maven coordinates (by intent).Here, then, is how this subtle problem emerges:spark_sesn = SparkSession.builder.config(conf = spark_conf).getOrCreate()Because a Spark Session already existed, the above statement simply reused that existing session (.getOrCreate()), which did not have the jars/libraries loaded (again, because my PYTHONSTARTUP script intentionally omits them). This is why it is a good idea to put print statements in PYTHONSTARTUP scripts (which are otherwise silent).In the end, I simply forgot to do this: $ unset PYTHONSTARTUP before starting the JupyterLab / Notebook daemon.I hope the Question helps others because that&#8217;s how to programmatically add functionality  to Spark 2.x (in this case Kafka). Note that you&#8217;ll need an internet connection for the one-time download of the specified jars and recursive dependencies from Maven Central.

Advertisement

Answer