Skip to content
Advertisement

How does Tokenizer in tensorflow deal with out of vocabulary tokens if I don’t provide oov_token?

tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)
encoded_docs = tokenizer.texts_to_sequences(X_train)
padded_sequence = pad_sequences(encoded_docs, maxlen=60)
test_tweets = tokenizer.texts_to_sequences(X_test)
test_padded_sequence = pad_sequences(test_tweets, maxlen=60)

I didn’t get any error with that code even though I didn’t provide oov_token argument. I expected to get an error in test_tweets = tokenizer.texts_to_sequences(X_test)

How does tensorflow deal with out of vocabulary words during the test time when you don’t provide the oov_token?

Advertisement

Answer

OOV words will be ignored / discarded by default, if oov_token is None:

import tensorflow as tf

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(['hello world'])
print(tokenizer.word_index)

sequences = tokenizer.texts_to_sequences(['hello friends'])
print(sequences)
{'hello': 1, 'world': 2}
[[1]]
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement