Skip to content
Advertisement

Caching a PySpark Dataframe

Suppose we have a PySpark dataframe df with ~10M rows. Also let the columns be [col_a, col_b]. Which would be faster:

df_test = df.sample(0.1)
for i in range(10):
  df_sample = df_test.select(df.col_a).distinct().take(10)

or

df_test = df.sample(0.1)
df_test = df_test.cache()
for i in range(10):
  df_sample = df_test.select(df.col_a).distinct().take(10) 

Would caching df_test make sense here?

Advertisement

Answer

It won’t make much difference. it is just one loop where you can skip cache like below

>>> for i in range(10):
...   df_sample = df.sample(0.1).select(df.id).distinct().take(10)

Here spark is loading Data once in memory.

If you want to use df_sample further in other operations repeatedly then you can use cache()

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement