I’m trying to understand how the Multiple Line Regression works in code for machine learning. The issue I’m having is that I don’t get how to set up my regression line properly or if my coefficients are correct.
So I guess I can divide my thoughts into three questions.
- Is my method of finding the coefficients for the regression line correct?
- Is my method of setting up the regression line correct?
- Is my method of plotting correct?
My code in Python 3.8.5:
JavaScript
x
40
40
1
from scipy import stats as stats
2
%matplotlib inline
3
import numpy as np
4
import matplotlib.pyplot as plt
5
import pandas as pd
6
7
dataset = pd.read_csv("cars.csv")
8
df = dataset.fillna(dataset.mean().round(1))
9
x_cars = df[['Weight', 'Volume']]
10
y_cars = df['CO2']
11
x_cars_weight = x_cars.Weight
12
x_cars_volume = x_cars.Volume
13
14
# Best fitted line multiple variables
15
X = [x_cars_weight, x_cars_volume]
16
A = np.column_stack([np.ones(len(x_cars_volume))] + X)
17
Y = y_cars
18
coeffs_multi_reversed, _, _, _ = np.linalg.lstsq(A, Y, rcond=None)
19
coeffs_multi = coeffs_multi_reversed[::-1]
20
21
# Plot
22
from mpl_toolkits import mplot3d
23
fig = plt.figure()
24
ax = plt.axes(projection='3d')
25
z = y_cars
26
x = x_cars_weight
27
y = x_cars_volume
28
c = x + y
29
ax.scatter(x, y, z, c=c)
30
ax.set_title('$CO_2$ emission')
31
32
x1 = coeffs_multi[2]*np.linspace(0,120)
33
y1 = coeffs_multi[1]*np.linspace(0,120)
34
z1 = x1 + y1 + coeffs_multi[0]
35
ax.plot3D(x1, y1, z1, 'gray')
36
37
ax.set_xlabel('x - Weight')
38
ax.set_ylabel('y - Volume')
39
ax.set_zlabel('z - $CO_2$')
40
My list of data (cars.csv)
JavaScript
1
38
38
1
Car,Model,Volume,Weight,CO2
2
Toyoty,Aygo,1000,790,99
3
Mitsubishi,Space Star,1200,1160,95
4
Skoda,Citigo,1000,929,95
5
Fiat,500,900,865,90
6
Mini,Cooper,1500,1140,105
7
VW,Up!,1000,929,105
8
Skoda,Fabia,1400,1109,90
9
Mercedes,A-Class,1500,1365,92
10
Ford,Fiesta,1500,1112,98
11
Audi,A1,1600,1150,99
12
Hyundai,I20,1100,980,99
13
Suzuki,Swift,1300,990,101
14
Ford,Fiesta,1000,1112,99
15
Honda,Civic,1600,1252,94
16
Hundai,I30,1600,1326,97
17
Opel,Astra,1600,1330,97
18
BMW,1,1600,1365,99
19
Mazda,3,2200,1280,104
20
Skoda,Rapid,1600,1119,104
21
Ford,Focus,2000,1328,105
22
Ford,Mondeo,1600,1584,94
23
Opel,Insignia,2000,1428,99
24
Mercedes,C-Class,2100,1365,99
25
Skoda,Octavia,1600,1415,99
26
Volvo,S60,2000,1415,99
27
Mercedes,CLA,1500,1465,102
28
Audi,A4,2000,1490,104
29
Audi,A6,2000,1725,114
30
Volvo,V70,1600,1523,109
31
BMW,5,2000,1705,114
32
Mercedes,E-Class,2100,1605,115
33
Volvo,XC70,2000,1746,117
34
Ford,B-Max,1600,1235,104
35
BMW,216,1600,1390,108
36
Opel,Zafira,1600,1405,109
37
Mercedes,SLK,2500,1395,120
38
Advertisement
Answer
In order,
- The method appears to be correct but rather long-winded. See below for a more compact alternative
- Not sure what you mean but I think this:
JavaScript
1
4
1
x1 = coeffs_multi[2]*np.linspace(0,120)
2
y1 = coeffs_multi[1]*np.linspace(0,120)
3
z1 = x1 + y1 + coeffs_multi[0]
4
is not quite correct. The coefficients in coeffs_multi_reversed
are in order dictated by X
namely ‘constant’, ‘Weight’, ‘Volume’. In coeffs_multi
they are then ‘Volume’, ‘Weight’, ‘constant’, so the above are in the wrong order
- For the plot I would not do
x1
,y1
etc but simply plot actual vs predicted by the model, like so:
JavaScript
1
6
1
2
predicted = np.array(A) @ coeffs_multi_reversed
3
ax.scatter(x, y, z, label = 'actual')
4
ax.scatter(x, y, predicted, label = 'predicted')
5
6
the graph then looks like this:
- A much more standard way to do regression is as follows
JavaScript
1
11
11
1
from sklearn.linear_model import LinearRegression
2
3
lin_regr = LinearRegression()
4
lin_res = lin_regr.fit(x_cars, y_cars)
5
predicted = lin_regr.predict(x_cars)
6
print(lin_res.coef_, lin_res.intercept_)
7
plt.plot(predicted, y_cars, '.', label = 'actual vs predicted')
8
plt.plot(predicted, predicted, '.', label = 'predicted vs predicted')
9
plt.legend(loc = 'best')
10
plt.show()
11
prints
JavaScript
1
2
1
[0.00755095 0.00780526] 79.69471929115937
2
and plots
Edit: plotting 3D grid
To plot predicted output on a grid, you can do something like
JavaScript
1
23
23
1
npts = 20
2
3
from mpl_toolkits import mplot3d
4
fig = plt.figure()
5
ax = plt.axes(projection='3d')
6
x = x_cars['Weight']
7
y = x_cars['Volume']
8
ax.scatter(x, y, z, label = 'actual')
9
10
x1 = np.linspace(x.min(), x.max(), npts)
11
y1 = np.linspace(y.min(), y.max(), npts)
12
x1m,y1m = np.meshgrid(x1,y1)
13
z1 = lin_regr.predict(np.hstack([x1m.reshape(-1,1),y1m.reshape(-1,1)]))
14
ax.scatter(x1m.reshape(-1,1), y1m.reshape(-1,1), z1, '.', s=1, label = 'predicted')
15
16
ax.set_xlabel('x - Weight')
17
ax.set_ylabel('y - Volume')
18
ax.set_zlabel('z - $CO_2$')
19
ax.set_title('$CO_2$ emission')
20
21
plt.legend(loc = 'best')
22
plt.show()
23
for this kind of output: