Debugging PySpark udf (lambda function using datetime)

Question

I came across- the below lambda code line in PySpark while browsing a long python Jupyter notebook, I am trying to understand this piece of line. Can you explain what it does in a best possible way? Answer udf in PySpark assigns a Python function which is run for every row of Spark df. Creates a user defined function (UDF).

Accepted Answer

udf(    lambda x: (datetime.datetime.utcnow() - timedelta(hours=x)).isoformat()[:-3] + 'Z',    StringType())udf in PySpark assigns a Python function which is run for every row of Spark df.Creates a user defined function (UDF).New in version 1.3.0.Parameters:f : functionpython function if used as a standalone functionreturnType : pyspark.sql.types.DataType or strthe return type of the user-defined function. The value can be either a pyspark.sql.types.DataType object or a DDL-formatted type string.The returnType will be a string. Removing it, we get the function body we&#8217;re interested in:lambda x: (datetime.datetime.utcnow() - timedelta(hours=x)).isoformat()[:-3] + 'Z'In order to find out what the given lambda function does, you can create a regular function from it. You may need to add imports too.import datetimefrom datetime import timedeltadef func(x):    return (datetime.datetime.utcnow() - timedelta(hours= x)).isoformat()[:-3] + 'Z'To really see what&#8217;s going on you can create variables out of every element and print them.import datetimefrom datetime import timedeltadef my_func(x):    v1 = datetime.datetime.utcnow()    v2 = timedelta(hours=x)    v3 = v1 - v2    v4 = v3.isoformat()    v5 = v4[:-3]    v6 = v5 + 'Z'    [print(e) for e in (v1, v2, v3, v4, v5)]        return v6print(my_func(3))# 2022-06-17 07:16:36.212566# 3:00:00# 2022-06-17 04:16:36.212566# 2022-06-17T04:16:36.212566# 2022-06-17T04:16:36.212# 2022-06-17T04:16:36.212ZThis way you see how result changes after every step. You can print whatever you want at any step you need. E.g. print(type(v4))

Advertisement

Answer