I am trying to create a multiple linear regression model from scratch in python. Dataset used: Boston Housing Dataset
from Sklearn
. Since my focus was on the model building I did not perform any pre-processing steps on the data. However, I used an OLS model to calculate p-values and dropped 3 features from the data. After that, I used a Linear Regression model to find out the weights for each feature.
import pandas as pd from sklearn.datasets import load_boston from sklearn.linear_model import LinearRegression X=load_boston() data=pd.DataFrame(X.data,columns=X.feature_names) y=X.target data.head() #dropping three features data=data.drop(['INDUS','NOX','AGE'],axis=1) #new shape of the data (506,10) not including the target variable #Passed the whole dataset to Linear Regression Model model_lr=LinearRegression() model_lr.fit(data,y) model_lr.score(data,y) 0.7278959820021539 model_lr.intercept_ 22.60536462807957 #----- intercept value model_lr.coef_ array([-0.09649731, 0.05281081, 2.3802989 , 3.94059598, -1.05476566, 0.28259531, -0.01572265, -0.75651996, 0.01023922, -0.57069861]) #--- coefficients
Now I wanted to calculate the coefficients manually in excel before creating the model in python. To calculate the weights of each feature I used this formula:
Calculating the Weights of the Features
To calculate the intercept I used the formula b0 = mean(y)-b1*mean(x1)-b2*(mean(x2)….-bn*mean(xn)
The intercept value from my calculations was 22.63551387(almost same to that of the model)
The problem is that the weights of the features from my calculation are far off from that of the sklearn linear model.
-0.002528644 #-- CRIM -0.001028914 #-- Zn -0.038663314 #-- CHAS -0.035026972 #-- RM -0.014275311 #-- DIS -0.004058291 #-- RAD -0.000241103 #-- TAX -0.015035534 #-- PTRATIO -0.000318376 #-- B -0.006411897 #-- LSTAT
Using the first row as a test data to check my calculations, I get 22.73167044199992 while the Linear Regression model predicts 30.42657776. The original value is 24.
But as soon as I check for other rows the sklearn model is having more variation while the predictions made by the weights from my calculations are all showing values close to 22.
I think I am making a mistake in calculating the weights, but I am not sure where the problem is? Is there a mistake in my calculation? Why are all my coefficients from the calculations so close to 0?
Here is my Code for Calculating the coefficients:(beginner here)
x_1=[] x_2=[] for i,j in zip(data['CRIM'],y): mean_x=data['CRIM'].mean() mean_y=np.mean(y) c=i-mean_x*(j-mean_y) d=(i-mean_x)**2 x_1.append(c) x_2.append(d) print(sum(x_1)/sum(x_2))
Thank you for reading this long post, I appreciate it.
Advertisement
Answer
It seems like the trouble lies in the coefficient calculation. The formula you have given for calculating the coefficients is in scalar form, used for the simplest case of linear regression, namely with only one feature x.
EDIT
Now after seeing your code for the coefficient calculation, the problem is clearer. You cannot use this equation to calculate the coefficients of each feature independent of each other, as each coefficient will depend on all the features. I suggest you take a look at the derivation of the solution to this least squares optimization problem in the simple case here and in the general case here. And as a general tip stick with matrix implementation whenever you can, as this is radically more efficient.
However, in this case we have a 10-dimensional feature vector, and so in matrix notation it becomes.
See derivation here
I suspect you made some computational error here, as implementing this in python using the scalar formula is more tedious and untidy than the matrix equivalent. But since you haven’t shared this peace of your code its hard to know.
Here’s an example of how you would implement it:
def calc_coefficients(X,Y): X=np.mat(X) Y = np.mat(Y) return np.dot((np.dot(np.transpose(X),X))**(-1),np.transpose(np.dot(Y,X))) def score_r2(y_pred,y_true): ss_tot=np.power(y_true-y_true.mean(),2).sum() ss_res = np.power(y_true -y_pred,2).sum() return 1 -ss_res/ss_tot X = np.ones(shape=(506,11)) X[:,1:] = data.values B=calc_coefficients(X,y) ##### Coeffcients B[:] matrix([[ 2.26053646e+01], [-9.64973063e-02], [ 5.28108077e-02], [ 2.38029890e+00], [ 3.94059598e+00], [-1.05476566e+00], [ 2.82595310e-01], [-1.57226536e-02], [-7.56519964e-01], [ 1.02392192e-02], [-5.70698610e-01]]) #### Intercept B[0] matrix([[22.60536463]]) y_pred = np.dot(np.transpose(B),np.transpose(X)) ##### First 5 rows predicted np.array(y_pred)[0][:5] array([30.42657776, 24.80818347, 30.69339701, 29.35761397, 28.6004966 ]) ##### First 5 rows Ground Truth y[:5] array([24. , 21.6, 34.7, 33.4, 36.2]) ### R^2 score score_r2(y_pred,y) 0.7278959820021539