Skip to content
Advertisement

How can I generate the same UUID for multiple dataframes in spark?

I have a df that I read from a file

import uuid

df = spark.read.csv(path, sep="|", header=True)

Then I give it a UUID column

uuidUdf= udf(lambda : str(uuid.uuid4()),StringType())
df = df.withColumn("UUID",uuidUdf())

Now I create a view

view = df.createOrReplaceTempView("view")

Now I create two new dataframes that take data from the view, both dataframes will use the original UUID column.

df2 = spark.sql("select UUID from view")
df3 = spark.sql("select UUID from view")

All 3 dataframes will have different UUIDs, is there a way to keep them the same across each dataframe?

Advertisement

Answer

Spark uses a lazy evaluation mechanism, where the computation is invoked when you call show or other actions. This means every time you call an action, the uuid is recalculated. To avoid this you need to cache the df before you call createOrReplaceTempView, here is what you should do

import uuid

df = spark.read.csv(path, sep="|", header=True)
uuidUdf= udf(lambda : str(uuid.uuid4()),StringType())
df = df.withColumn("UUID",uuidUdf())

df.cache()

view = df.createOrReplaceTempView("view")

df2 = spark.sql("select UUID from view")
df3 = spark.sql("select UUID from view")

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement