I have a TensorFlow TextVectorization
layer named “eng_vectorization
“:
vocab_size = 15000 sequence_length = 20 eng_vectorization = TextVectorization(max_tokens = vocab_size, output_mode = 'int', output_sequence_length = sequence_length) train_eng_texts = [pair[0] for pair in text_pairs] # Where text_pairs is my english-spanish text data. eng_vectorization.adapt(train_eng_texts)
and I saved it in a pickle file, using this code:
pickle.dump({'config': eng_vectorization.get_config(), 'weights': eng_vectorization.get_weights()}, open("english_vocab.pkl", "wb"))
Then I load that pickle file properly as new_eng_vectorization
:
from_disk = pickle.load(open("english_vocab.pkl", "rb")) new_eng_vectorization = TextVectorization.from_config(from_disk['config']) new_eng_vectorization.adapt(tf.data.Dataset.from_tensor_slices(["xyz"])) new_eng_vectorization.set_weights(from_disk['weights'])
Now I am expecting, both previous vectorization eng_vectorization
and newly loaded vectorization new_eng_vectorization
to work the same, but they are not.
The output of original vectorization, eng_vectorization(['Hello people'])
is a Tensor:
<tf.Tensor: shape=(1, 20), dtype=int64, numpy= array([[1800, 110, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>
And the output of pickled vectorization, new_eng_vectorization(['Hello people'])
is a Ragged Tensor.
<tf.RaggedTensor [[1800, 110]]>
Both eng_vectorization
and new_eng_vectorization
have same config:
{'batch_input_shape': (None,), 'dtype': 'string', 'idf_weights': None, 'max_tokens': 15000, 'name': 'text_vectorization', 'ngrams': None, 'output_mode': 'int', 'output_sequence_length': 20, 'pad_to_max_tokens': False, 'ragged': False, 'sparse': False, 'split': 'whitespace', 'standardize': 'lower_and_strip_punctuation', 'trainable': True, 'vocabulary': None}
I think there is some problem with the way I saved the vectorization, how do I fix this? I am using this for deployment, that’s why I want that pickled vectorization to work as the previous one.
Here is a Google Colab link to a reproduciable code – [CLICK HERE]
Advertisement
Answer
The problem is related to a very recent bug, where the output_mode
is not set correctly when it comes from a saved configuration.
This works:
pickle.dump({'config': eng_vectorization.get_config(), 'weights': eng_vectorization.get_weights()}, open("english_vocab.pkl", "wb")) from_disk = pickle.load(open("english_vocab.pkl", "rb")) new_eng_vectorization = TextVectorization(max_tokens=from_disk['config']['max_tokens'], output_mode='int', output_sequence_length=from_disk['config']['output_sequence_length']) new_eng_vectorization.adapt(tf.data.Dataset.from_tensor_slices(["xyz"])) new_eng_vectorization.set_weights(from_disk['weights']) new_eng_vectorization(['Hello people'])
<tf.Tensor: shape=(1, 20), dtype=int64, numpy= array([[1800, 110, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])>
This is currently not working correctly:
pickle.dump({'config': eng_vectorization.get_config(), 'weights': eng_vectorization.get_weights()}, open("english_vocab.pkl", "wb")) from_disk = pickle.load(open("english_vocab.pkl", "rb")) new_eng_vectorization = TextVectorization(max_tokens=from_disk['config']['max_tokens'], output_mode=from_disk['config']['output_mode'], output_sequence_length=from_disk['config']['output_sequence_length']) new_eng_vectorization.adapt(tf.data.Dataset.from_tensor_slices(["xyz"])) new_eng_vectorization.set_weights(from_disk['weights']) new_eng_vectorization(['Hello people'])
<tf.RaggedTensor [[1800, 110]]>
Even though both 'int'
and from_disk['config']['output_mode']
are equal and of the same data type. Anyway, you can use the workaround for now.