Skip to content
Advertisement

Can evaluation metrics for test set be better than training set?

I am doing some Logistics Regression homework.

I just wonder if in any case, the evaluation metrics for the test set are a bit better than the training set (like my results below)? And if yes, what gap is allowed?

Below is my evaluation result for the test set and training set, given that both sets are extracted from the same dataset.

EVALUATION METRICS FOR Test Dataset:
Confusion Matrix:
                 Predicted Negative  Predicted Positive
Actual Negative                  82                 20
Actual Positive                  10                 93

Accuracy = 0.8536585365853658
Precision = 0.8230088495575221
Recall = 0.9029126213592233
F1 score = 0.8380535530381049
EVALUATION METRICS FOR Training Dataset:
Confusion Matrix:
                 Predicted Negative  Predicted Positive
Actual Negative                 279                 70
Actual Positive                  44                324

Accuracy = 0.8410041841004184
Precision = 0.8223350253807107
Recall = 0.8804347826086957
F1 score = 0.8315648343229267

Advertisement

Answer

Usually this is not the case, but it is not impossible. If you randomly split your data into a test and a training set, the test data can fit better to your model than the training data in some cases. Imagine the extreme case below, where your data consists of values with different levels of noise added to it. If the test data points are the ones with less noise, they will fit better to the model. However, such a split into test and training data is very unlikely.

If the test score is just slightly above the training score, this is absolutely normal. If by chance, noise not captured by the model is slightly lower in the test data, the test samples will fit better to the model. Actually this a good sign, because it means that you are not over fitting. You may be able to increase overall performance by increasing the degrees of freedom in your model.

If the test score is much higher then the training score, you should check whether the split between test data and training has been done in a reasonable way.

Plot of data and models

import numpy as np
import matplotlib.pyplot as plt

def f(x):
    return 2*x + 3

x_test = np.random.rand(30)
x_training = np.random.rand(70)
y_test = f(x_test) + 1e-3 * np.random.randn(30)
y_training = f(x_training) + 0.2 * np.random.randn(70)
k, d = np.polyfit(x_training, y_training, 1)

x = np.linspace(0, 1)

plt.plot(x_training, y_training, 'o', label='training data')
plt.plot(x_test, y_test, 'o', label='test data')
plt.plot(x, k*x +d, 'k--', label='fitted model')
plt.plot(x, f(x), 'k:', label='real model')
plt.legend();
plt.show()
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement