Skip to content
Advertisement

Median and quantile values in Pyspark

In my dataframe I have an age column. The total number of rows are approx 77 billion. I want to calculate the quantile values of that column using PySpark. I have some code but the computation time is huge (maybe my process is very bad).

Is there any good way to improve this?

Dataframe example:

JavaScript

What I have done so far:

JavaScript

Advertisement

Answer

The first improvment to do would be to do all the quantile calculations at the same time:

JavaScript

Also, note that you use the exact calculation of the quantiles. From the documentation we can see that (emphasis added by me):

relativeError – The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.

Since you have a very large dataframe I expect that some error is acceptable in these calculations, but it will be a trade-off between speed and precision (although anything more than 0 could have a significant speed improvement).

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement