I came across- the below lambda code line in PySpark while browsing a long python Jupyter notebook, I am trying to understand this piece of line. Can you explain what it does in a best possible way?
parse = udf (lambda x: (datetime.datetime.utcnow() - timedelta(hours= x)).isoformat()[:-3] + 'Z', StringType())
Advertisement
Answer
udf( lambda x: (datetime.datetime.utcnow() - timedelta(hours=x)).isoformat()[:-3] + 'Z', StringType() )
udf
in PySpark assigns a Python function which is run for every row of Spark df.
Creates a user defined function (UDF).
New in version 1.3.0.
Parameters:
- f : function
python function if used as a standalone function- returnType : pyspark.sql.types.DataType or str
the return type of the user-defined function. The value can be either a pyspark.sql.types.DataType object or a DDL-formatted type string.
The returnType
will be a string. Removing it, we get the function body we’re interested in:
lambda x: (datetime.datetime.utcnow() - timedelta(hours=x)).isoformat()[:-3] + 'Z'
In order to find out what the given lambda function does, you can create a regular function from it. You may need to add imports too.
import datetime from datetime import timedelta def func(x): return (datetime.datetime.utcnow() - timedelta(hours= x)).isoformat()[:-3] + 'Z'
To really see what’s going on you can create variables out of every element and print
them.
import datetime from datetime import timedelta def my_func(x): v1 = datetime.datetime.utcnow() v2 = timedelta(hours=x) v3 = v1 - v2 v4 = v3.isoformat() v5 = v4[:-3] v6 = v5 + 'Z' [print(e) for e in (v1, v2, v3, v4, v5)] return v6 print(my_func(3)) # 2022-06-17 07:16:36.212566 # 3:00:00 # 2022-06-17 04:16:36.212566 # 2022-06-17T04:16:36.212566 # 2022-06-17T04:16:36.212 # 2022-06-17T04:16:36.212Z
This way you see how result changes after every step. You can print
whatever you want at any step you need. E.g. print(type(v4))