Skip to content
Advertisement

SKLearn & ElasticNet: Cross validation fails when using Accuracy as a metric

I have a binary classification problem. I’ve been using cross validation to optimize the ElasticNet parameters. However ElasticNet only seems to work when I supply roc_auc as the scoring method to be used during CV, However I also want to test out a wide range of scoring methods, in particular accuracy. Specifically, when using accuracy, ElasticNet returns this error:

ValueError: Classification metrics can't handle a mix of binary and continuous targets

However my y targets are indeed binary. Below is a replication of my problem using the dataset from here:

import numpy as np
import pandas as pd

from sklearn.preprocessing import LabelBinarizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import make_scorer, recall_score, accuracy_score, precision_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import ElasticNet

data = pd.read_csv('data 2.csv')
# by default majority class (benign) will be negative
lb = LabelBinarizer()
data['diagnosis'] = lb.fit_transform(data['diagnosis'].values)
targets = data['diagnosis']
data.drop(['id', 'diagnosis', 'Unnamed: 32'], axis=1, inplace=True)
X_train, X_test, y_train, y_test = train_test_split(data, targets, stratify=targets)

#elastic net logistic regression
lr = ElasticNet(max_iter=2000)
scorer = 'accuracy'
param_grid = {
    'alpha': [1e-4, 1e-3, 1e-2, 0.01, 0.1, 1, 5, 10],
    'l1_ratio': np.arange(0.2, 0.9, 0.1)
}
skf = StratifiedKFold(n_splits=10)
clf = GridSearchCV(lr, param_grid, scoring=scorer, cv=skf, return_train_score=True,
                    n_jobs=-1)
clf.fit(X_train.values, y_train.values)

I figured that ElasticNet might be trying to solve a linear regression problem so I tried lr = LogisticRegression(penalty='elasticnet', l1_ratios=[0.1, 0.5, 0.9], solver='saga') as the classifier but the same problem persists.

If I use as the scoring metric scorer = 'roc_auc' then the model is built as expected.

Also, as a sanity to check to see if there is something wrong with the data I tried the same but with a random forest classifier and here the problem disappears:

# random forest
clf = RandomForestClassifier(n_jobs=-1)
param_grid = {
    'min_samples_split': [3, 5, 10],
    'n_estimators' : [100, 300],
    'max_depth': [3, 5, 15, 25],
    'max_features': [3, 5, 10, 20]
}
skf = StratifiedKFold(n_splits=10)
scorer = 'accuracy'
grid_search = GridSearchCV(clf, param_grid, scoring=scorer,
                        cv=skf, return_train_score=True, n_jobs=-1)
grid_search.fit(X_train.values, y_train.values)

Has anyone got any ideas on what’s happening here?

Advertisement

Answer

ElasticNet is a regression model.

If you want an ElasticNet penalty in classification, use LogisticRegression:

lr = LogisticRegression(solver="saga", penalty="elasticnet")

Minimal Reproducible Example:

import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

lr = LogisticRegression(solver="saga", penalty="elasticnet", max_iter=2000)

param_grid = {
    'l1_ratio': np.arange(0.2, 0.9, 0.1)
}

clf = GridSearchCV(lr, param_grid, scoring='accuracy', cv=StratifiedKFold(n_splits=10), return_train_score=True, n_jobs=-1)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
Advertisement