I have a df that I read from a file
import uuid df = spark.read.csv(path, sep="|", header=True)
Then I give it a UUID column
uuidUdf= udf(lambda : str(uuid.uuid4()),StringType()) df = df.withColumn("UUID",uuidUdf())
Now I create a view
view = df.createOrReplaceTempView("view")
Now I create two new dataframes that take data from the view, both dataframes will use the original UUID column.
df2 = spark.sql("select UUID from view") df3 = spark.sql("select UUID from view")
All 3 dataframes will have different UUIDs, is there a way to keep them the same across each dataframe?
Advertisement
Answer
Spark uses a lazy evaluation mechanism, where the computation is invoked when you call show
or other actions. This means every time you call an action, the uuid
is recalculated. To avoid this
you need to cache
the df
before you call createOrReplaceTempView
, here is what you should do
import uuid df = spark.read.csv(path, sep="|", header=True) uuidUdf= udf(lambda : str(uuid.uuid4()),StringType()) df = df.withColumn("UUID",uuidUdf()) df.cache() view = df.createOrReplaceTempView("view") df2 = spark.sql("select UUID from view") df3 = spark.sql("select UUID from view")