Why do coefficient of determination, R², implementations produce different results?

Question

When attempting to implement a python function for calculating the coefficient of determination, R², I noticed I got wildly different results depending on whose calculation sequence I used. The wikipedia page on R² gives a seemingly very clear explanation as to how R² should be calculated. My numpy interpreta…

Accepted Answer

DefinitionsThis is a notation abuse that often leads to misunderstanding. You are comparing two different coefficients:Coefficient of determination (usually noted as R^2) which can be used for any OLS regression not only linear regression (OLS is linear with regards of fit parameters not the function itself);Pearson Correlation Coefficient (usually noted as r or r^2 when squared) which is used for linear regression only.If you read carefully the introduction of Coefficient of determination on Wikipedia page, you will see that it is discussed there, it starts as follow:There are several definitions of R2 that are only sometimesequivalent.MCVEYou can confirm that classical implementation of those score return expected results:import numpy as npimport scipyfrom sklearn import metricsnp.random.seed(12345)x = np.linspace(-3, 3, 1001)yh = np.polynomial.polynomial.polyval(x, [1, 2])e = np.random.randn(x.size)yn = yh + eThen your function calcR2_wikipedia (0.9265536406736125) returns the coefficient of determination, it can be confirmed as it returns the same as sklearn.metrics.r2_score:metrics.r2_score(yn, yh) # 0.9265536406736125In the other hand the scipy.stats.linregress returns the correlation coefficient (valid for linear regression only):slope, intercept, r_value, p_value, std_err = scipy.stats.linregress(yh, yn)r_value # 0.9625821384210018Which you can cross confirm by it&#8217;s definition:C = np.cov(yh, yn)C[1,0]/np.sqrt(C[0,0]*C[1,1]) # 0.9625821384210017

Advertisement

Answer

Definitions

MCVE