Caching a PySpark Dataframe

Question

Suppose we have a PySpark dataframe df with ~10M rows. Also let the columns be [col_a, col_b]. Which would be faster: or Would caching df_test make sense here? Answer It won't make much difference. it is just one loop where you can skip cache like below Here spark is loading Data once in memory. If you want to use df_sample

Accepted Answer

It won&#8217;t make much difference.it is just one loop where you can skip cache like below>>> for i in range(10):...   df_sample = df.sample(0.1).select(df.id).distinct().take(10)Here spark is loading Data once in memory.If you want to use df_sample further in other operations repeatedly then you can use cache()

Advertisement

Answer