Why do we use pyspark UDF when python functions are faster than them? (Note. Not worrying about spark SQL commands)

Question

I have a dataframe: Output: Now, a simple &#8211; Add 1 to &#8216;v&#8217; can be done via SQL functions and UDF. If we ignore the SQL (best performant) We can create a UDF as: and call it: Time: 16.5sec But here is my question: if I DO NOT use udf and directly write: Time Taken &#8211; 352ms In a nutshell,

Accepted Answer

Your scenario works because actually you don&#8217;t add 1 in python, it&#8217;s added in Java in a way very similar to one used when you do it with SQL.Let&#8217;s split the case apart:You do plus_one(df.v) which is equal to just passing df.v + 1Try to type df.v + 1 in your favorite repl and you&#8217;ll see that it returns object of type Column.How can it be? Column class has __radd__ magic method overwritten(along with some others) and returns new Column instance with the instruction to add 1 to the specified column.In summary: withColumn always accepts objects of type Column as the second argument and trick with adding 1 to your column is the magic of python.That&#8217;s why it works faster than udf and vectorized udf: they need to run python process, serialize/deserialize data(vectorized udfs can work faster with arrow to avoid serializing/deserializing), compute in slower python process.

Advertisement

Answer