SGDRegressor() constantly not increasing validation performance

Tags: , , , ,



The model fit of my SGDRegressor wont increase or decrease its performance on the validation set (test) after around 20’000 training records. Even if I try to switch penalty, early_stopping (True/False) or alpha,eta0 to extremely high or low levels, there is no change in the behaviour of the “stuck” validation score test.

enter image description here

I used StandardScaler and shuffled the data for trainig- and testset before.

train_test_split(X,y, test_size = 0.3, random_state=85, shuffle=True)

print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)
>>>(336144, 10) (144063, 10)
>>>(336144,) (144063,)

Is anything wrong with my validation code or is the behaviour explainable because of a limitation dealing on training-data that SGDRegressor has?

from sklearn.linear_model import SGDRegressor
from sklearn.metrics import mean_squared_error
import pandas
import matplotlib.pyplot as plt


scores_test = []
scores_train = []
my_rng = range(10,len(X_train),30000)
for m in my_rng:
    print(m)
    modelSGD = SGDRegressor(alpha=0.00001, penalty='l1')
    modelSGD.fit(X_train[:m], y_train[:m])
    
    ypred_train = modelSGD.predict(X_train[:m])
    ypred_test = modelSGD.predict(X_test)
    mse_train = mean_squared_error(y_train[:m], ypred_train)
    mse_test = mean_squared_error(y_test, ypred_test)
    scores_train.append(mse_train)
    scores_test.append(mse_test)

How can I “force” SGDRegressor to respect a larger amount of training data and change its performance on the test data?

Edit: I am trying to visualize that the model does not change its score on test after being trained by 30’000 or 300’000 records. That’s the reason why I’m initializing the SGDRegressor within the loop, so it is completely newly trained in every iteration.

As asked by @Nikaido, these are the models coef_, intercept_ after fitting:

trainsize: 10, coef:  [ 0.81815135  2.2966633   1.61231584 -0.00339933 -3.03094922  0.12757874  -2.60874563  1.52383531  0.3250487  -0.61251297], intercept:  [50.77553038]
trainsize: 30010, coef:  [ 0.19097587 -0.35854903 -0.16142221  0.11281925 -0.66771756  0.55912533   0.90462141 -1.417289    0.50487032 -1.42423654], intercept:  [83.28458307]
trainsize: 60010, coef:  [ 0.09848169 -0.1362008  -0.15825232 -0.4401373   0.31664536  0.04960247  -0.37299047  0.6641436   0.02782047 -1.15355052], intercept:  [80.87163096]
trainsize: 90010, coef:  [-0.00923631  0.5845441   0.28485334 -0.29528061 -0.30643056  1.20320208   1.9723999  -0.47707621  1.25355186 -2.04990825], intercept:  [85.17812028]
trainsize: 120010, coef:  [-0.04959943 -0.15744169 -0.17071373 -0.20829149 -1.38683906  2.18572481   1.43380752 -1.48133799  2.18962484 -3.41135224], intercept:  [86.40188522]
trainsize: 150010, coef:  [ 0.56190926  0.05052168  0.22624504  0.55751301 -0.50829818  1.27571154   1.49847285 -0.15134682  1.30017967 -0.88259823], intercept:  [83.69264344]
trainsize: 180010, coef:  [ 0.17765624  0.1137466   0.15081498 -0.51520765 -1.00811419 -0.13203398   1.28565565 -0.03594421 -0.08053252 -2.31793746], intercept:  [85.21824705]
trainsize: 210010, coef:  [-0.53937513 -0.33872786 -0.44854466  0.70039384 -0.77073389  0.4361326   0.88175392 -0.32460908  0.5141777  -1.5123801 ], intercept:  [82.75353293]
trainsize: 240010, coef:  [ 0.70748011 -0.08992019  0.25365326  0.61999278 -0.29374005  0.25833863  -0.00485613 -0.21211637  0.19286126 -1.09503691], intercept:  [85.76414815]
trainsize: 270010, coef:  [ 0.73787648  0.30155102  0.44013832 -0.2355825   0.26255699  1.55410066   0.4733571   0.85352683  1.4399516  -1.73360843], intercept:  [84.19473044]
trainsize: 300010, coef:  [ 0.04861321 -0.35446415 -0.17774692 -0.1060901  -0.5864299   1.03429399   0.57160049 -0.13900199  1.09189946 -1.26298814], intercept:  [83.14797646]
trainsize: 330010, coef:  [ 0.20214825  0.22605839  0.17022397  0.28191112 -1.05982574  0.74025932   0.04981973 -0.27232538  0.72094765 -0.94875017], intercept:  [81.97656309]

Edit2: @Nikaido asked for it: This is the distribution of the Data. Very similar distributed train- / testdata features comes due to the origin values that are categories (of a range 1-9) or deconstructed timestamps as NumberOfMonth, DayOfWeek, Hours, Minutes. The labels plot shows a lack of normal distribution arround 100. The reason for this: missing values have been replaced by the global average of each category, wich was between 80 an 95.

enter image description here

Further on I have created a plot that shows the validation zoom generated by the code snippet above by changing:

my_rng = range(1000,len(X_train)-200000,2000)

SGD typical jumping around an optimum is visible. But anyways the tendence of the test-score does not change in any significant way as records of the training-set increase. enter image description here

Answer

EDIT: Regarding your output, my guess is that your results are so close for the validation set because a linear model like SGDregressor tends to underfit on complex data

To see this you can check the weights outputted by the model at every iteration. You’ll see that they are the same or really close

To enhance variability in the output you need to introduce non linearity and complexity

You are obtaining what is referred as “Bias” in machine learning (in contraposition to the “variance”)

I think I got it now.

SamAmani In the end I think that the problem is underfitting. And the fact that you are using incremental sizes of the dataset. The model underfit quite fast (which means that the model is stuck at the beginning to a more or less fixed model)

Only the first training output a different result for the test set because it hasn’t reached the final model, more or less

The underlying variability is in the incremental training sets. Simply speaking the test results are a more accurate estimate of the performance of the underfitted model. And adding training sample will lead in the end to near results between test and training without improving too much.

You can check the fact that are the incremental datasets of the training to be different from the test set. What you did wrong was to check the stats on all the training set


First of all, why are you training on incremental training set size? The strange results are due to the fact that you are training in an incremental fashion your dataset.

When you do this:

for m in my_rng:
    modelSGD = SGDRegressor(alpha=0.00001, penalty='l1')
    modelSGD.fit(X_train[:m], y_train[:m])
    [...]

you are basically training your model in incremental fashion, with this incremental sizes:

for m in range(10, 180001, 30000):
    print(m)


10
30010
60010
90010
120010
150010

If you are trying to make mini-batch gradient descent, you should split your dataset in independent batches instead of making incremental batches. Something like this:

previous = 0
for m in range(30000, 180001, 30000):
    modelSGD.partial_fit(X_train[previous:m], y_train[previous:m])
    previous = m

# training set ranges
0 30000
30000 60000
60000 90000
90000 120000
120000 150000
150000 180000

Also note that I am using partial_fit method, instead of fit (because I am not retraining the model from zero and I am making only a step, iteration of the gradient descent), and I am not going to initialize a new model every time (my sgd initialization is out of the for loop). The full code should be something like this:

my_rng = range(0 ,len(X_train), 30000)
previous = 0
modelSGD = SGDRegressor(alpha=0.00001, penalty='l1')
for m in my_rng:
    modelSGD.partial_fit(X_train[previous:m], y_train[previous:m])
    ypred_train = modelSGD.predict(X_train[previous:m])
    ypred_test = modelSGD.predict(X_test)
    mse_train = mean_squared_error(y_train[previous:m], ypred_train)
    mse_test = mean_squared_error(y_test, ypred_test)
    scores_train.append(mse_train)
    scores_test.append(mse_test)

In this way you are simulating one epoch mini-batch stochastic gradient. To make more epochs an outer loop is needed

From sklearn:

SGD allows minibatch (online/out-of-core) learning via the partial_fit method. For best results using the default learning rate schedule, the data should have zero mean and unit variance.

Details here



Source: stackoverflow