Training on multiple data sets with scikit.mlpregressor

Tags: , ,

I’m currently training my first neural network on a larger dataset. I have splitted my training data to several .npy binary files, that each contain batches of 20k training samples. I’m loading the data from the npy files, apply some simple pre-processing operations, and then start to train my network by applying the partial_fit method several times in a loop:

for i in range(50):

I read already, that the regular .fit() method is not capable of training with multiple batches, but partial_fit in contrary should be able to do it.. My first training run goes always well. The loss is decreasing, and I get nice fitting results, so I save my model using the joblib.dump method. For the next call I’m using exactly the same script again, that loads my data from the .npy files (doesn’t matter if I feed the same batch, or another one), pre-process it, this time load my pre-trained model with joblib.load, and start doing the partial_fit loop again. What I always get in the second run is a constant loss over all iterations, the error is not decreasing anymore, no matter what dataset I use:

Iteration 51, loss = 3.93268978
Iteration 52, loss = 3.93268978
Iteration 53, loss = 3.93268978
Iteration 54, loss = 3.93268978 ...

What am I doing wrong here? Thanks already!


There are several possibilities.

  1. The model may have converged
  2. There may not be enough passes over the batches (in the example below the model doesn’t converge until ~500 iterations)
  3. (Need more info) the joblib.dump and joblib.load may be saving or loading in an unexpected way

Instead of calling a script multiple times and dumping the results between iterations, it might be easier to debug if initializing/preprocessing/training/visualizing all happens in one script. Here is a minimal example:

import matplotlib.pyplot as plt
from sklearn.neural_network import MLPRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

X, y = make_regression(n_samples=10000, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y)

regr = MLPRegressor()

losses = []
test_performance = []

for _ in range(100):
    # Make 100 passes over the batches

    for batch in range(500, 7501, 500):
        # Perform partial fits on batches of 500 examples

        # Simulate batches, these could also be loaded from `.npy`
        X_train_batch = X_train[batch-500:batch]
        y_train_batch = y_train[batch-500:batch]

        regr.partial_fit(X_train_batch, y_train_batch)

        test_performance.append(regr.score(X_test, y_test))

# Plotting results:
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.title.set_text("Training Loss")
ax2.title.set_text("Score on test set")
ax1.plot(range(len(losses)), losses)
ax2.plot(range(len(test_performance)), test_performance)


Graph showing the training loss in the left plot and test score on the right plot. Loss goes to zero, score goes to 1.0.

Source: stackoverflow