Skip to content
Advertisement

get p value and r value from HuberRegressor in Sklearn

I have datasets with some outliers. From the simple linear regression, using

stat_lin = stats.linregress(X, Y)

I can get coefficient, intercept, r_value, p_value, std_err

But I want to apply robust regression method as I don’t want to include outliers.

So I applied Huber regressor from Sklearn,

huber = linear_model.HuberRegressor(alpha=0.0, epsilon=1.35)
huber.fit(mn_all_df['X'].to_numpy().reshape(-1, 1), mn_all_df['Y'].to_numpy().reshape(-1, 1))

from that, I can get, coefficient, intercept, scale, outliers.

I am happy with the result as the coefficient value is higher and the regression line is fitting with the majority of the data points.

However, I need a values such as r value and p value to say, the results from huber regressor is significant.

How can I get r value and p value from the robust regression (my case, using huber regressor)

Advertisement

Answer

You can also use robust linear models in statsmodels. For example:

import statsmodels.api as sm
from sklearn import datasets

x = iris.data[:,0]
y = iris.data[:,2]
rlm_model = sm.RLM(y, sm.add_constant(x),
M=sm.robust.norms.HuberT())
rlm_results = rlm_model.fit()

The p value you get from scipy.lingress is the p-value that the slope is not zero, this you can get by doing:

rlm_results.summary()
                     
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -7.1311      0.539    -13.241      0.000      -8.187      -6.076
x1             1.8648      0.091     20.434      0.000       1.686       2.044
==============================================================================

Now the r_value from lingress is a correlation coefficient and it stays as that. With robust linear model, you are weighing your observations differently, hence making it less sensitive to outliers, therefore, the r squared calculation does not make sense here. You might get a lower r squared since you are avoiding the line towards the outlier data points.

See comments by @Josef (who maintains statsmodels) from this question, this answer. You can try this calculation if you would like a meaningful r-squared

How to get R-squared for robust regression (RLM) in Statsmodels?

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement