I have tried writing a python program to save tf.keras.layers.TextVectorization to disk and load it with the answer of How to save TextVectorization to disk in tensorflow?.
The TextVectorization layer built from saved configs outputs a vector with wrong length when the arg output_sequence_length
is not None
and output_mode='int'
.
For example, if I set output_sequence_length= 10
, and output_mode='int'
, it is expected that given a text, TextVectorization should output a vector with length of 10, see vectorizer
and new_v2
in the code below.
However, if TextVectorization’s arg output_mode='int'
is set from saved configs, it doesn’t output a vector with length of 10(actually it is 9, the real length of the sentence. It seems like output_sequence_length
is not set successfully). See the object new_v1
in the code below.
The interesting thing is, I have compared from_disk['config']['output_mode']
and 'int'
, they equal to each other.
import tensorflow as tf from tensorflow.keras.models import load_model import pickle # In[] max_len = 10 # Sequence length to pad the outputs to. text_dataset = tf.data.Dataset.from_tensor_slices([ "I like natural language processing", "You like computer vision", "I like computer games and computer science"]) # Fit a TextVectorization layer VOCAB_SIZE = 10 # Maximum vocab size. vectorizer = tf.keras.layers.TextVectorization( max_tokens=None, standardize="lower_and_strip_punctuation", split="whitespace", output_mode='int', output_sequence_length=max_len ) vectorizer.adapt(text_dataset.batch(64)) # In[] #print(vectorizer.get_vocabulary()) #print(vectorizer.get_config()) #print(vectorizer.get_weights()) # In[] # Pickle the config and weights pickle.dump({'config': vectorizer.get_config(), 'weights': vectorizer.get_weights()} , open("./models/tv_layer.pkl", "wb")) # Later you can unpickle and use # `config` to create object and # `weights` to load the trained weights. from_disk = pickle.load(open("./models/tv_layer.pkl", "rb")) new_v1 = tf.keras.layers.TextVectorization( max_tokens=None, standardize="lower_and_strip_punctuation", split="whitespace", output_mode=from_disk['config']['output_mode'], output_sequence_length=from_disk['config']['output_sequence_length'], ) # You have to call `adapt` with some dummy data (BUG in Keras) new_v1.adapt(tf.data.Dataset.from_tensor_slices(["xyz"])) new_v1.set_weights(from_disk['weights']) new_v2 = tf.keras.layers.TextVectorization( max_tokens=None, standardize="lower_and_strip_punctuation", split="whitespace", output_mode='int', output_sequence_length=from_disk['config']['output_sequence_length'], ) # You have to call `adapt` with some dummy data (BUG in Keras) new_v2.adapt(tf.data.Dataset.from_tensor_slices(["xyz"])) new_v2.set_weights(from_disk['weights']) print ("*"*10) # In[] test_sentence="Jack likes computer scinece, computer games, and foreign language" print(vectorizer(test_sentence)) print (new_v1(test_sentence)) print (new_v2(test_sentence)) print(from_disk['config']['output_mode']=='int')
Here are the print() outputs:
********** tf.Tensor([ 1 1 3 1 3 11 12 1 10 0], shape=(10,), dtype=int64) tf.Tensor([ 1 1 3 1 3 11 12 1 10], shape=(9,), dtype=int64) tf.Tensor([ 1 1 3 1 3 11 12 1 10 0], shape=(10,), dtype=int64) True
Does anyone know why?
Advertisement
Answer
the bug is fixed by the PR in https://github.com/keras-team/keras/pull/15422