BERT DataLoader: Difference between shuffle=True v…

I trained a DistilBERT model with DistilBertForTokenClassification on ConLL data fro predicting NER. Training seem to have completed with no problems but I have 2 problems during evaluation phase.

I’m getting negative loss value
During training, I used shuffle=True for DataLoader. But during evaluation, when I do shuffle=True for DataLoader, I get very poor metric results(f_1, accuracy, recall etc). But if I do shuffle = False or use a Sampler instead of shuffling I get pretty good metric results. I’m wondering if there is anything wrong with my code.

Here is the evaluation code:

print('Prediction started on test data')
model.eval()

eval_loss = 0
predictions , true_labels = [], []

for batch in val_loader:
  b_input_ids = batch['input_ids'].to(device)
  b_input_mask = batch['attention_mask'].to(device)
  b_labels = batch['labels'].to(device)

  with torch.no_grad():
      outputs = model(b_input_ids, 
                      attention_mask=b_input_mask)

  logits = outputs[0]
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.detach().cpu().numpy()
  
  predictions.append(logits)
  true_labels.append(label_ids)

  eval_loss += outputs[0].mean().item()


print('Prediction completed')
eval_loss = eval_loss / len(val_loader)
print("Validation loss: {}".format(eval_loss))

out:

Prediction started on test data
Prediction completed
Validation loss: -0.2584906197858579

I believe I’m calculating the loss wrong here. Is it possible to get negative loss values with BERT?

For DataLoader, if I use the code snippet below, I have no problems with the metric results.

val_sampler = SequentialSampler(val_dataset)
val_loader = DataLoader(val_dataset, sampler=val_sampler, batch_size=128)

Bu if I do this one I get very poor metric results

val_loader = DataLoader(val_dataset, batch_size=128, shuffle=True)

Is it normal that I’m getting vastly different results with shuffle=True vs shuffle=False ?

code for the metric calculation:

metric = load_metric("seqeval")
results = metric.compute(predictions=true_predictions, references=true_labels)
results

out:

{'LOCATION': {'f1': 0.9588207767898924,
  'number': 2134,
  'precision': 0.9574766355140187,
  'recall': 0.9601686972820993},
 'MISC': {'f1': 0.8658965344048217,
  'number': 995,
  'precision': 0.8654618473895582,
  'recall': 0.8663316582914573},
 'ORGANIZATION': {'f1': 0.9066332916145182,
  'number': 1971,
  'precision': 0.8947628458498024,
  'recall': 0.9188229325215627},
 'PERSON': {'f1': 0.9632426988922457,
  'number': 2015,
  'precision': 0.9775166070516096,
  'recall': 0.9493796526054591},
 'overall_accuracy': 0.988255561629313,
 'overall_f1': 0.9324058459808882,
 'overall_precision': 0.9322748349023465,
 'overall_recall': 0.932536893886156}

The above metrics are printed when I use Sampler or shuffle=False. If I use shuffle=True, I get:

{'LOCATION': {'f1': 0.03902284263959391,
  'number': 2134,
  'precision': 0.029496402877697843,
  'recall': 0.057638238050609185},
 'MISC': {'f1': 0.010318142734307824,
  'number': 995,
  'precision': 0.009015777610818933,
  'recall': 0.012060301507537688},
 'ORGANIZATION': {'f1': 0.027420984269014285,
  'number': 1971,
  'precision': 0.019160951996772892,
  'recall': 0.04819888381532217},
 'PERSON': {'f1': 0.02119907254057635,
  'number': 2015,
  'precision': 0.01590852597564007,
  'recall': 0.03176178660049628},
 'overall_accuracy': 0.5651741788003777,
 'overall_f1': 0.02722600361161272,
 'overall_precision': 0.020301063389034663,
 'overall_recall': 0.041321152494729445}

UPDATE: I modified loss code for evaluation. There seems to be no problem with this code. You can see the new code below:

print('Prediction started on test data')
model.eval()

eval_loss = 0
predictions , true_labels = [], []

for batch in val_loader:

  b_labels = batch['labels'].to(device)

  batch = {k:v.type(torch.long).to(device) for k,v in batch.items()}
  
  with torch.no_grad():
      outputs = model(**batch)

      loss, logits = outputs[0:2]
      logits = logits.detach().cpu().numpy()
      label_ids = b_labels.detach().cpu().numpy()
  
      predictions.append(logits)
      true_labels.append(label_ids)

      eval_loss += loss


print('Prediction completed')
eval_loss = eval_loss / len(val_loader)
print("Validation loss: {}".format(eval_loss))

Though I still haven’t got an asnwer to the DataLoader question. Also I jsut realised when I do print(model.eval()) I still get dropouts from the model in evaluation mode.

Answer

As far as I understand, the answer is pretty simple:

“I saw my father do it this way, and his father was also doing it this way, so I’m also doing it this way”.

I’ve looked around a lot of notebooks to see how people were loading the data for validation and in every notebook I saw that people were using the Sequential Sampler for validation. Nobody uses Shuffling or Random Sampling during validation. I don’t exactly know why, but this is the case. So if anyone visiting this post was wondering the same thing, the answer is basically what I quoted above.

Also, I edited the original post for the loss problem I was having. I was calculating it wrong. Apperantly Bert reutrns loss at index 0 of the output (outputs[0]) if you also feed the model the original labels. In the first code snippet, when I was getting the outputs from the model, I was not feeding the model with the original labels, so it was not returning the loss value at index 0, but returning only the logits.

Basically what you need to do is:

outputs = model(input_ids, mask, label=label)
loss = outputs[0]
logits = outputs[1]

BERT DataLoader: Difference between shuffle=True vs Sampler?

Advertisement

Answer