The model fit of my
SGDRegressor wont increase or decrease its performance on the validation set (
test) after around 20’000 training records. Even if I try to switch
early_stopping (True/False) or
eta0 to extremely high or low levels, there is no change in the behaviour of the “stuck” validation score
StandardScaler and shuffled the data for trainig- and testset before.
train_test_split(X,y, test_size = 0.3, random_state=85, shuffle=True) print(X_train.shape, X_test.shape) print(y_train.shape, y_test.shape) >>>(336144, 10) (144063, 10) >>>(336144,) (144063,)
Is anything wrong with my validation code or is the behaviour explainable because of a limitation dealing on training-data that
from sklearn.linear_model import SGDRegressor from sklearn.metrics import mean_squared_error import pandas import matplotlib.pyplot as plt scores_test =  scores_train =  my_rng = range(10,len(X_train),30000) for m in my_rng: print(m) modelSGD = SGDRegressor(alpha=0.00001, penalty='l1') modelSGD.fit(X_train[:m], y_train[:m]) ypred_train = modelSGD.predict(X_train[:m]) ypred_test = modelSGD.predict(X_test) mse_train = mean_squared_error(y_train[:m], ypred_train) mse_test = mean_squared_error(y_test, ypred_test) scores_train.append(mse_train) scores_test.append(mse_test)
How can I “force”
SGDRegressor to respect a larger amount of training data and change its performance on the
I am trying to visualize that the model does not change its score on
test after being trained by 30’000 or 300’000 records. That’s the reason why I’m initializing the SGDRegressor within the loop, so it is completely newly trained in every iteration.
As asked by @Nikaido, these are the models
intercept_ after fitting:
trainsize: 10, coef: [ 0.81815135 2.2966633 1.61231584 -0.00339933 -3.03094922 0.12757874 -2.60874563 1.52383531 0.3250487 -0.61251297], intercept: [50.77553038] trainsize: 30010, coef: [ 0.19097587 -0.35854903 -0.16142221 0.11281925 -0.66771756 0.55912533 0.90462141 -1.417289 0.50487032 -1.42423654], intercept: [83.28458307] trainsize: 60010, coef: [ 0.09848169 -0.1362008 -0.15825232 -0.4401373 0.31664536 0.04960247 -0.37299047 0.6641436 0.02782047 -1.15355052], intercept: [80.87163096] trainsize: 90010, coef: [-0.00923631 0.5845441 0.28485334 -0.29528061 -0.30643056 1.20320208 1.9723999 -0.47707621 1.25355186 -2.04990825], intercept: [85.17812028] trainsize: 120010, coef: [-0.04959943 -0.15744169 -0.17071373 -0.20829149 -1.38683906 2.18572481 1.43380752 -1.48133799 2.18962484 -3.41135224], intercept: [86.40188522] trainsize: 150010, coef: [ 0.56190926 0.05052168 0.22624504 0.55751301 -0.50829818 1.27571154 1.49847285 -0.15134682 1.30017967 -0.88259823], intercept: [83.69264344] trainsize: 180010, coef: [ 0.17765624 0.1137466 0.15081498 -0.51520765 -1.00811419 -0.13203398 1.28565565 -0.03594421 -0.08053252 -2.31793746], intercept: [85.21824705] trainsize: 210010, coef: [-0.53937513 -0.33872786 -0.44854466 0.70039384 -0.77073389 0.4361326 0.88175392 -0.32460908 0.5141777 -1.5123801 ], intercept: [82.75353293] trainsize: 240010, coef: [ 0.70748011 -0.08992019 0.25365326 0.61999278 -0.29374005 0.25833863 -0.00485613 -0.21211637 0.19286126 -1.09503691], intercept: [85.76414815] trainsize: 270010, coef: [ 0.73787648 0.30155102 0.44013832 -0.2355825 0.26255699 1.55410066 0.4733571 0.85352683 1.4399516 -1.73360843], intercept: [84.19473044] trainsize: 300010, coef: [ 0.04861321 -0.35446415 -0.17774692 -0.1060901 -0.5864299 1.03429399 0.57160049 -0.13900199 1.09189946 -1.26298814], intercept: [83.14797646] trainsize: 330010, coef: [ 0.20214825 0.22605839 0.17022397 0.28191112 -1.05982574 0.74025932 0.04981973 -0.27232538 0.72094765 -0.94875017], intercept: [81.97656309]
@Nikaido asked for it: This is the distribution of the Data. Very similar distributed train- / testdata features comes due to the origin values that are categories (of a range 1-9) or deconstructed timestamps as NumberOfMonth, DayOfWeek, Hours, Minutes.
labels plot shows a lack of normal distribution arround 100. The reason for this: missing values have been replaced by the global average of each category, wich was between 80 an 95.
Further on I have created a plot that shows the validation zoom generated by the code snippet above by changing:
my_rng = range(1000,len(X_train)-200000,2000)
EDIT: Regarding your output, my guess is that your results are so close for the validation set because a linear model like SGDregressor tends to underfit on complex data
To see this you can check the weights outputted by the model at every iteration. You’ll see that they are the same or really close
To enhance variability in the output you need to introduce non linearity and complexity
You are obtaining what is referred as “Bias” in machine learning (in contraposition to the “variance”)
I think I got it now.
SamAmani In the end I think that the problem is underfitting. And the fact that you are using incremental sizes of the dataset. The model underfit quite fast (which means that the model is stuck at the beginning to a more or less fixed model)
Only the first training output a different result for the test set because it hasn’t reached the final model, more or less
The underlying variability is in the incremental training sets. Simply speaking the test results are a more accurate estimate of the performance of the underfitted model. And adding training sample will lead in the end to near results between test and training without improving too much.
You can check the fact that are the incremental datasets of the training to be different from the test set. What you did wrong was to check the stats on all the training set
First of all, why are you training on incremental training set size? The strange results are due to the fact that you are training in an incremental fashion your dataset.
When you do this:
for m in my_rng: modelSGD = SGDRegressor(alpha=0.00001, penalty='l1') modelSGD.fit(X_train[:m], y_train[:m]) [...]
you are basically training your model in incremental fashion, with this incremental sizes:
for m in range(10, 180001, 30000): print(m) 10 30010 60010 90010 120010 150010
If you are trying to make mini-batch gradient descent, you should split your dataset in independent batches instead of making incremental batches. Something like this:
previous = 0 for m in range(30000, 180001, 30000): modelSGD.partial_fit(X_train[previous:m], y_train[previous:m]) previous = m # training set ranges 0 30000 30000 60000 60000 90000 90000 120000 120000 150000 150000 180000
Also note that I am using
partial_fit method, instead of
fit (because I am not retraining the model from zero and I am making only a step, iteration of the gradient descent), and I am not going to initialize a new model every time (my sgd initialization is out of the for loop). The full code should be something like this:
my_rng = range(0 ,len(X_train), 30000) previous = 0 modelSGD = SGDRegressor(alpha=0.00001, penalty='l1') for m in my_rng: modelSGD.partial_fit(X_train[previous:m], y_train[previous:m]) ypred_train = modelSGD.predict(X_train[previous:m]) ypred_test = modelSGD.predict(X_test) mse_train = mean_squared_error(y_train[previous:m], ypred_train) mse_test = mean_squared_error(y_test, ypred_test) scores_train.append(mse_train) scores_test.append(mse_test)
In this way you are simulating one epoch mini-batch stochastic gradient. To make more epochs an outer loop is needed
SGD allows minibatch (online/out-of-core) learning via the partial_fit method. For best results using the default learning rate schedule, the data should have zero mean and unit variance.