Skip to content
Advertisement

Debugging PySpark udf (lambda function using datetime)

I came across- the below lambda code line in PySpark while browsing a long python Jupyter notebook, I am trying to understand this piece of line. Can you explain what it does in a best possible way?

parse =  udf (lambda x: (datetime.datetime.utcnow() - timedelta(hours= x)).isoformat()[:-3] + 'Z', StringType())

Advertisement

Answer

udf(
    lambda x: (datetime.datetime.utcnow() - timedelta(hours=x)).isoformat()[:-3] + 'Z',
    StringType()
)

udf in PySpark assigns a Python function which is run for every row of Spark df.

Creates a user defined function (UDF).

New in version 1.3.0.

Parameters:

The returnType will be a string. Removing it, we get the function body we’re interested in:

lambda x: (datetime.datetime.utcnow() - timedelta(hours=x)).isoformat()[:-3] + 'Z'

In order to find out what the given lambda function does, you can create a regular function from it. You may need to add imports too.

import datetime
from datetime import timedelta

def func(x):
    return (datetime.datetime.utcnow() - timedelta(hours= x)).isoformat()[:-3] + 'Z'

To really see what’s going on you can create variables out of every element and print them.

import datetime
from datetime import timedelta

def my_func(x):
    v1 = datetime.datetime.utcnow()
    v2 = timedelta(hours=x)
    v3 = v1 - v2
    v4 = v3.isoformat()
    v5 = v4[:-3]
    v6 = v5 + 'Z'

    [print(e) for e in (v1, v2, v3, v4, v5)]
    
    return v6

print(my_func(3))

# 2022-06-17 07:16:36.212566
# 3:00:00
# 2022-06-17 04:16:36.212566
# 2022-06-17T04:16:36.212566
# 2022-06-17T04:16:36.212
# 2022-06-17T04:16:36.212Z

This way you see how result changes after every step. You can print whatever you want at any step you need. E.g. print(type(v4))

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement