Skip to content
Advertisement

How do I interpret my BERT output from Huggingface Transformers for Sequence Classification and tensorflow?

Short TL;DR: I am using BERT for a sequence classification task and don’t understand the output I get.

This is my first post, so please bear with me: I am using bert for a sequence classification task with 3 labels. To do this, I am using huggingface transformers with tensorflow, more specifically the TFBertForSequenceClassification class with the bert-base-german-cased model (yes, using german sentences).

I am by no means an expert in NLP, which is why I pretty much followed this approch here: https://towardsdatascience.com/fine-tuning-hugging-face-model-with-custom-dataset-82b8092f5333 (with some tweaks of course)

Everything seems to be working fine, but the output I receive from my model is what throws me off. Here’s just some of the output along the way for context.

The main difference I have to the example from the article is the number of labels. I have 3 while the article only featured 2.

I use a LabelEncoder from sklearn.preprocessing to process my labels

JavaScript

*Y here is a list of labels as strings, so something like this

JavaScript

then turns into this:

JavaScript

I then use the BertTokenizer to process my text and create the input datasets (training and testing). These are the shapes of those:

JavaScript

I then train the model as per Huggingface docs.

The last epoch while training the model looks like this:

JavaScript

Then I run model.predict on an example sentence and get this output (yes I tokenized the sentence accordingly just like the other article does). The output looks like this:

JavaScript

And lastly that’s the softmax function I apply in the end and it’s output:

JavaScript

So here’s my question: I don’t quite understand that output. With an accuracy of ~70% (validation accuracy), my model should be okay in predicting the labels. Yet only the logits from the direct output don’t mean much to me tbh and the output after the softmax function seems to be on a linear scale, as if it came from a sigmoid function. How do I interpret this and translate it to the label I am trying to predict?

And also: shouldn’t I feed one hot encoded labels into my bert model for it to work? I always thought Bert needs that but it seems like it doesn’t.

Thanks a lot in advance!

Advertisement

Answer

Your output means that probability of the first class is 65.9%.

You can feed your labels either as integers or as one-hot vectors. You have to use an appropriate loss function (categorical_crossentropy with one-hot or sparse_categorical_crossentropy with integers).

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement