Skip to content
Advertisement

Python: Develope Multiple Linear Regression Model From Scrath

I am trying to create a multiple linear regression model from scratch in python. Dataset used: Boston Housing Dataset from Sklearn. Since my focus was on the model building I did not perform any pre-processing steps on the data. However, I used an OLS model to calculate p-values and dropped 3 features from the data. After that, I used a Linear Regression model to find out the weights for each feature.

import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression

X=load_boston()
data=pd.DataFrame(X.data,columns=X.feature_names)
y=X.target
data.head()

#dropping three features
data=data.drop(['INDUS','NOX','AGE'],axis=1)

#new shape of the data (506,10) not including the target variable

#Passed the whole dataset to Linear Regression Model
model_lr=LinearRegression()
model_lr.fit(data,y)

model_lr.score(data,y)
0.7278959820021539

model_lr.intercept_
22.60536462807957   #----- intercept value

model_lr.coef_
array([-0.09649731,  0.05281081,  2.3802989 ,  3.94059598, -1.05476566,
        0.28259531, -0.01572265, -0.75651996,  0.01023922, -0.57069861]) #--- coefficients

Now I wanted to calculate the coefficients manually in excel before creating the model in python. To calculate the weights of each feature I used this formula:

Calculating the Weights of the Features

To calculate the intercept I used the formula b0 = mean(y)-b1*mean(x1)-b2*(mean(x2)….-bn*mean(xn)

The intercept value from my calculations was 22.63551387(almost same to that of the model)

The problem is that the weights of the features from my calculation are far off from that of the sklearn linear model.

-0.002528644 #-- CRIM
-0.001028914 #-- Zn
-0.038663314 #-- CHAS
-0.035026972 #-- RM
-0.014275311 #-- DIS
-0.004058291 #-- RAD
-0.000241103 #-- TAX
-0.015035534 #-- PTRATIO
-0.000318376 #-- B
-0.006411897 #-- LSTAT

Using the first row as a test data to check my calculations, I get 22.73167044199992 while the Linear Regression model predicts 30.42657776. The original value is 24.

But as soon as I check for other rows the sklearn model is having more variation while the predictions made by the weights from my calculations are all showing values close to 22.

I think I am making a mistake in calculating the weights, but I am not sure where the problem is? Is there a mistake in my calculation? Why are all my coefficients from the calculations so close to 0?

Here is my Code for Calculating the coefficients:(beginner here)

 x_1=[]
 x_2=[]
 for i,j in zip(data['CRIM'],y):
      mean_x=data['CRIM'].mean()
      mean_y=np.mean(y)
      c=i-mean_x*(j-mean_y)
      d=(i-mean_x)**2
      x_1.append(c)
      x_2.append(d)
 print(sum(x_1)/sum(x_2))

Thank you for reading this long post, I appreciate it.

Advertisement

Answer

It seems like the trouble lies in the coefficient calculation. The formula you have given for calculating the coefficients is in scalar form, used for the simplest case of linear regression, namely with only one feature x.

enter image description here

EDIT

Now after seeing your code for the coefficient calculation, the problem is clearer. You cannot use this equation to calculate the coefficients of each feature independent of each other, as each coefficient will depend on all the features. I suggest you take a look at the derivation of the solution to this least squares optimization problem in the simple case here and in the general case here. And as a general tip stick with matrix implementation whenever you can, as this is radically more efficient.

However, in this case we have a 10-dimensional feature vector, and so in matrix notation it becomes. enter image description here

See derivation here

I suspect you made some computational error here, as implementing this in python using the scalar formula is more tedious and untidy than the matrix equivalent. But since you haven’t shared this peace of your code its hard to know.

Here’s an example of how you would implement it:

def calc_coefficients(X,Y):
    X=np.mat(X)
    Y = np.mat(Y)
    return np.dot((np.dot(np.transpose(X),X))**(-1),np.transpose(np.dot(Y,X)))

def score_r2(y_pred,y_true):
    ss_tot=np.power(y_true-y_true.mean(),2).sum()
    ss_res = np.power(y_true -y_pred,2).sum()
    return 1 -ss_res/ss_tot


X = np.ones(shape=(506,11))
X[:,1:] = data.values

B=calc_coefficients(X,y)
##### Coeffcients 
B[:]
matrix([[ 2.26053646e+01],
        [-9.64973063e-02],
        [ 5.28108077e-02],
        [ 2.38029890e+00],
        [ 3.94059598e+00],
        [-1.05476566e+00],
        [ 2.82595310e-01],
        [-1.57226536e-02],
        [-7.56519964e-01],
        [ 1.02392192e-02],
        [-5.70698610e-01]])

#### Intercept 
B[0]
matrix([[22.60536463]])

y_pred = np.dot(np.transpose(B),np.transpose(X))
##### First 5 rows predicted
np.array(y_pred)[0][:5]

array([30.42657776, 24.80818347, 30.69339701, 29.35761397, 28.6004966 ])

##### First 5 rows Ground Truth
y[:5]

array([24. , 21.6, 34.7, 33.4, 36.2])

### R^2 score
score_r2(y_pred,y)

0.7278959820021539
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement