Skip to content
Advertisement

Python: Develope Multiple Linear Regression Model From Scrath

I am trying to create a multiple linear regression model from scratch in python. Dataset used: Boston Housing Dataset from Sklearn. Since my focus was on the model building I did not perform any pre-processing steps on the data. However, I used an OLS model to calculate p-values and dropped 3 features from the data. After that, I used a Linear Regression model to find out the weights for each feature.

JavaScript

Now I wanted to calculate the coefficients manually in excel before creating the model in python. To calculate the weights of each feature I used this formula:

Calculating the Weights of the Features

To calculate the intercept I used the formula b0 = mean(y)-b1*mean(x1)-b2*(mean(x2)….-bn*mean(xn)

The intercept value from my calculations was 22.63551387(almost same to that of the model)

The problem is that the weights of the features from my calculation are far off from that of the sklearn linear model.

JavaScript

Using the first row as a test data to check my calculations, I get 22.73167044199992 while the Linear Regression model predicts 30.42657776. The original value is 24.

But as soon as I check for other rows the sklearn model is having more variation while the predictions made by the weights from my calculations are all showing values close to 22.

I think I am making a mistake in calculating the weights, but I am not sure where the problem is? Is there a mistake in my calculation? Why are all my coefficients from the calculations so close to 0?

Here is my Code for Calculating the coefficients:(beginner here)

JavaScript

Thank you for reading this long post, I appreciate it.

Advertisement

Answer

It seems like the trouble lies in the coefficient calculation. The formula you have given for calculating the coefficients is in scalar form, used for the simplest case of linear regression, namely with only one feature x.

enter image description here

EDIT

Now after seeing your code for the coefficient calculation, the problem is clearer. You cannot use this equation to calculate the coefficients of each feature independent of each other, as each coefficient will depend on all the features. I suggest you take a look at the derivation of the solution to this least squares optimization problem in the simple case here and in the general case here. And as a general tip stick with matrix implementation whenever you can, as this is radically more efficient.

However, in this case we have a 10-dimensional feature vector, and so in matrix notation it becomes. enter image description here

See derivation here

I suspect you made some computational error here, as implementing this in python using the scalar formula is more tedious and untidy than the matrix equivalent. But since you haven’t shared this peace of your code its hard to know.

Here’s an example of how you would implement it:

JavaScript
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement