In my attempt to dig deeper in the math behind machine learning models, I’m implementing a Ordinary Least Square algorithm in Python, using vectorization. My references are:
- https://github.com/paulaceccon/courses/blob/main/machine_learning_specialization/supervisioned_regression/2_multiple_regression.pdf
- https://www.geeksforgeeks.org/linear-regression-implementation-from-scratch-using-python/
This is what I have now:
import matplotlib.pyplot as plt import numpy as np import seaborn as sns from sklearn import datasets from sklearn.preprocessing import StandardScaler %matplotlib inline X, y = datasets.load_diabetes(return_X_y=True) # We only take the first feature (for visualization purposes). X = X[:, np.newaxis, 2] # Split the data into training/testing sets X_train = X[:-20] X_test = X[-20:] y_train = y[:-20] y_test = y[-20:] # Input data sns.scatterplot( x=X_train[:, 0], y=y_train, label="train", edgecolor=None, color="blue" ) # To predict sns.scatterplot( x=X_test[:, 0], y=y_test, label="test", edgecolor=None, marker="*", color="red", ); class LinearRegression: """ Ordinary least squares Linear Regression. Args: """ def __init__(self, learning_rate: float = 0.01, tolerance: float = 1e4, standardize: bool = True): # TODO: standardize if required self._learning_rate: float = learning_rate self._tolerance: float = tolerance self._standardize: bool = standardize self._fitted: bool = False def fit(self, X: np.ndarray, y: np.ndarray) -> None: """Fit linear model.""" self._X: np.ndarray = X self._y: np.ndarray = y[:, np.newaxis] self._m, self._n = self._X.shape # rows, features self._weights: np.ndarray = np.zeros((self._n, 1)) self._train() def predict(self, X: np.ndarray, add_bias: bool = True) -> np.ndarray: """Predict using the linear model.""" assert self._fitted, "Model not fitted." if add_bias: X = np.c_[np.ones((X.shape[0], 1)), X] predictions = np.dot(X, self._weights) return predictions def _train(self) -> None: """ Generate the clusters from the traning data. Algorithm: 1. Initiliaze weights. 2. Compute the cost. 3. Calculate the gradient. 4. Update weights. 4. Repeat from 2 until convergence. """ # Add bias term self._X = np.c_[np.ones((self._m, 1)), self._X] self._weights = np.r_[np.ones((1, 1)), self._weights] self._fitted = True converged = False iterations = 0 while not converged: iterations += 1 y_hat = self.predict(self._X, add_bias=False) residuals = self._residuals(self._y, y_hat) gradients = self._gradients(self._X, residuals) self._weights -= self._learning_rate * gradients gradient_magnitude = np.linalg.norm(gradients) print(gradient_magnitude) if gradient_magnitude < self._tolerance: converged = True print(self._weights) print(iterations) def _residuals(self, y: np.ndarray, y_hat: np.ndarray) -> np.ndarray: residuals = y - y_hat return residuals def _gradients(self, X: np.ndarray, residuals: np.ndarray) -> np.ndarray: gradients = -2 * np.dot(X.T, residuals) return gradients scaler = StandardScaler() X_train = scaler.fit_transform(X_train) clf = LinearRegression() clf.fit(X_train, y_train)
The problem I’m facing is that my weights keep increasing until I end up getting a bunch of nans. I’ve been trying to find out what I’m missing, but so far no luck. Also tried to tweak the tolerance threshold, but I don’t think that’s the issue, but something wrong with my math.
Advertisement
Answer
Your code seems actually to work fine; except for learning rate, really! Just reduce it from 0.01
to e.g. 0.0001
and everything works fine (well, I would also reduce tolerance to something much much smaller, like 1e-5
, to make sure it actually converges to the right solution).
Small image showing that it works:
clf = LinearRegression(learning_rate=0.0001) clf.fit(X_train, y_train) b, m = clf._weights[:, 0] plt.scatter(X_train[:, 0], y_train) plt.plot([-2, 4], [-2 * m + b, 4 * m + b])
gives
Linear regression is a convex optimization problem, so you can imagine it like putting a ball on a parabola and then moving it towards the bottom by a fixed amount of space multiplied by the slope of the position you’re at. If that “fixed amount” is small enough, you get closer and closer to the bottom, until you find the optimum position. But if you get the value too large, you jump from one side of the parabola to the other, and if it’s large enough you land in a place which is actually higher than where you started from. Iterate this a few times and you get indeed in the exact situation you had…