Suppose we have a PySpark dataframe df
with ~10M rows. Also let the columns be [col_a, col_b]
. Which would be faster:
df_test = df.sample(0.1) for i in range(10): df_sample = df_test.select(df.col_a).distinct().take(10)
or
df_test = df.sample(0.1) df_test = df_test.cache() for i in range(10): df_sample = df_test.select(df.col_a).distinct().take(10)
Would caching df_test
make sense here?
Advertisement
Answer
It won’t make much difference. it is just one loop where you can skip cache like below
>>> for i in range(10): ... df_sample = df.sample(0.1).select(df.id).distinct().take(10)
Here spark is loading Data once in memory.
If you want to use df_sample further in other operations repeatedly then you can use cache()