JavaScript
x
7
1
tokenizer = Tokenizer()
2
tokenizer.fit_on_texts(X_train)
3
encoded_docs = tokenizer.texts_to_sequences(X_train)
4
padded_sequence = pad_sequences(encoded_docs, maxlen=60)
5
test_tweets = tokenizer.texts_to_sequences(X_test)
6
test_padded_sequence = pad_sequences(test_tweets, maxlen=60)
7
I didn’t get any error with that code even though I didn’t provide oov_token
argument. I expected to get an error in test_tweets = tokenizer.texts_to_sequences(X_test)
How does tensorflow deal with out of vocabulary words during the test time when you don’t provide the oov_token
?
Advertisement
Answer
OOV words will be ignored / discarded by default, if oov_token
is None
:
JavaScript
1
9
1
import tensorflow as tf
2
3
tokenizer = tf.keras.preprocessing.text.Tokenizer()
4
tokenizer.fit_on_texts(['hello world'])
5
print(tokenizer.word_index)
6
7
sequences = tokenizer.texts_to_sequences(['hello friends'])
8
print(sequences)
9
JavaScript
1
3
1
{'hello': 1, 'world': 2}
2
[[1]]
3