Skip to content
Advertisement

Median and quantile values in Pyspark

In my dataframe I have an age column. The total number of rows are approx 77 billion. I want to calculate the quantile values of that column using PySpark. I have some code but the computation time is huge (maybe my process is very bad).

Is there any good way to improve this?

Dataframe example:

id       age
1         18
2         32
3         54
4         63
5         42
6         23

What I have done so far:

#Summary stats
df.describe('age').show()

#For Quantile values
x5 = df.approxQuantile("age", [0.5], 0)
x25 = df.approxQuantile("age", [0.25], 0)
x75 = df.approxQuantile("age", [0.75], 0)

Advertisement

Answer

The first improvment to do would be to do all the quantile calculations at the same time:

quantiles = df.approxQuantile("age", [0.25, 0.5, 0.75], 0)

Also, note that you use the exact calculation of the quantiles. From the documentation we can see that (emphasis added by me):

relativeError – The relative target precision to achieve (>= 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.

Since you have a very large dataframe I expect that some error is acceptable in these calculations, but it will be a trade-off between speed and precision (although anything more than 0 could have a significant speed improvement).

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement