Skip to content
Advertisement

Removing SEP token in Bert for text classification

Given a sentiment classification dataset, I want to fine-tune Bert.

As you know that BERT created to predict the next sentence given the current sentence. Thus, to make the network aware of this, they inserted a [CLS] token in the beginning of the first sentence then they add [SEP] token to separate the first from the second sentence and finally another [SEP] at the end of the second sentence (it’s not clear to me why they append another token at the end).

Anyway, for text classification, what I noticed in some of the examples online (see BERT in Keras with Tensorflow hub) is that they add [CLS] token and then the sentence and at the end another [SEP] token.

Where in other research works (e.g. Enriching Pre-trained Language Model with Entity Information for Relation Classification) they remove the last [SEP] token.

Why is it/not beneficial to add the [SEP] token at the end of the input text when my task uses only single sentence?

Advertisement

Answer

Im not quite sure why BERT needs the separation token [SEP] at the end for single-sentence tasks, but my guess is that BERT is an autoencoding model that, as mentioned, originally was designed for Language Modelling and Next Sentence Prediction. So BERT was trained that way to always expect the [SEP] token, which means that the token is involved in the underlying knowledge that BERT built up during training.

Downstream tasks that followed later, such as single-sentence use-cases (e.g. text classification), turned out to work too with BERT, however the [SEP] was left as a relict for BERT to work properly and thus is needed even for these tasks.

BERT might learn faster, if [SEP] is appended at the end of a single sentence, because it encodes somewhat of a knowledge in that token, that this marks the end of the input. Without it, BERT would still know where the sentence ends (due to the padding tokens), which explains that fore mentioned research leaves away the token, but this might slow down training slightly, since BERT might be able to learn faster with appended [SEP] token, especially if there are no padding tokens in a truncated input.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement