Skip to content
Advertisement

What are alternative methods for pandas quantile and cut in pyspark 1.6

I’m newbie to pyspark. I have pandas code like below.

JavaScript

I have found ‘approxQuantile’ in pyspark 2.x but I didn’t find any such in pyspark 1.6.0

My sample input:

df.show()

JavaScript

df.collect()

JavaScript

I have to loop the above logic for all input columns.

JavaScript

Could anyone please suggest how to rewrite above code in pyspark 1.6 dataframe.

Thanks in advance

Advertisement

Answer

If you’re using pyspark 2.x, you can use QuantileDiscretizer from ml lib which uses approxQuantile() and Bucketizer under the hood.

However, since you’re using pyspark 1.6.x you need to:

1. Find the quantile values of a column

You can find the quantile values in two ways:

  1. Compute the percentile of a column by computing the percent_rank() and extract the column values which has percentile value close to the quantile that you want

  2. Follow the methods in this answer which explains how to perform quantile approximations with pyspark < 2.0.0

Here’s my example implementation of approximating quantile values:

JavaScript

What I wanted to achieve from above is computing the percentile of each row in the column, and categorizing it to the nearest quantile. Categorizing a percentile to nearest quantile can be done by choosing the lowest quantile category which has the lowest difference (squared error) to the percentile.

1. Computing Percentile

First, I compute the percentile of a column using the percent_rank(), a Window function in pyspark. You can think of Window as a partition specification for your data. Since percent_rank() is a Window function, so you need to pass in the Window.

2. Categorize percentile to quantile boundaries and compute errors

The nearest quantile category to a percentile can be below, equal to or above it. Hence, I need to compute the errors twice: first to compare the percentile with the lower quantile bounds, and the second to compare it with the upper quantile bounds. Note the ≤ operator is used to check whether the percentile is less than or equal to the boundaries. After knowing the direct upper and lower quantile boundaries of a percentile, we can assign a percentile to the nearest quantile category by choosing either the quantile below-or-equal or above-or-equal category which has the lowest error.

3. Approximate quantile values

Once we know all the closest quantile categories per each percentile, we can then approx the quantile values: it’s the value which has the lowest errors at each quantile category. This approx quantile values can be computed using first() function at each category partition using Window. Next, to extract the quantile values, we can just select the unique percentileCategory-approxQuantileValue pairs from the dataframe.


After testing for my data (~10000 rows) with desired_quantiles = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], I found that my example implementation is quite close to approxQuantile results. Both result values get even closer as I decrease the error supplied to approxQuantile.

Using extract_quantiles(compute_quantile(df, col, quantiles)):

enter image description here

Using approxQuantile:

enter image description here

2. Use Bucketizer

After finding the quantile values, you can use pyspark’s Bucketizer to bucketize values based on the quantile. Bucketizer is available in both pyspark 1.6.x [1][2] and 2.x [3][4]

Here is an example of how you can perform bucketization:

JavaScript

You can replace value_boundaries with the quantile values you found in step 1 or any bucket split range that you desire. When you’re using bucketizer, the whole column value range must be covered within the splits. Otherwise, values outside the splits specified will be treated as errors. Infinite values such as -float("inf"), float("inf") must be explicitly provided to cover all the floating values if you’re unsure about the value boundaries of your data.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement