Median and quantile values in Pyspark

Question

In my dataframe I have an age column. The total number of rows are approx 77 billion. I want to calculate the quantile values of that column using PySpark. I have some code but the computation time is huge (maybe my process is very bad). Is there any good way to improve this? Dataframe example: What I have done so

Accepted Answer

The first improvment to do would be to do all the quantile calculations at the same time:quantiles = df.approxQuantile("age", [0.25, 0.5, 0.75], 0)Also, note that you use the exact calculation of the quantiles. From the documentation we can see that (emphasis added by me):  relativeError – The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.Since you have a very large dataframe I expect that some error is acceptable in these calculations, but it will be a trade-off between speed and precision (although anything more than 0 could have a significant speed improvement).

Advertisement

Answer