How can I generate the same UUID for multiple dataframes in spark?

Question

I have a df that I read from a file Then I give it a UUID column Now I create a view Now I create two new dataframes that take data from the view, both dataframes will use the original UUID column. All 3 dataframes will have different UUIDs, is there a way to keep them the same across each

Accepted Answer

Spark uses a lazy evaluation mechanism, where the computation is invoked when you call show or other actions. This means every time you call an action,  the uuid is recalculated. To avoid thisyou need to cache the df before you call createOrReplaceTempView, here is what you should doimport uuiddf = spark.read.csv(path, sep="|", header=True)uuidUdf= udf(lambda : str(uuid.uuid4()),StringType())df = df.withColumn("UUID",uuidUdf())df.cache()view = df.createOrReplaceTempView("view")df2 = spark.sql("select UUID from view")df3 = spark.sql("select UUID from view")

Advertisement

Answer