a bug for tf.keras.layers.TextVectorization when built from saved configs and weights

I have tried writing a python program to save tf.keras.layers.TextVectorization to disk and load it with the answer of How to save TextVectorization to disk in tensorflow?. The TextVectorization layer built from saved configs outputs a vector with wrong length when the arg output_sequence_length is not None and output_mode='int'. For example, if I set output_sequence_length= 10, and output_mode='int', it is expected that given a text, TextVectorization should output a vector with length of 10, see vectorizer and new_v2 in the code below. However, if TextVectorization’s arg output_mode='int' is set from saved configs, it doesn’t output a vector with length of 10(actually it is 9, the real length of the sentence. It seems like output_sequence_length is not set successfully). See the object new_v1 in the code below. The interesting thing is, I have compared from_disk['config']['output_mode'] and 'int', they equal to each other.

import tensorflow as tf
from tensorflow.keras.models import load_model
import pickle
# In[]
max_len = 10  # Sequence length to pad the outputs to.
text_dataset = tf.data.Dataset.from_tensor_slices([
                                                   "I like natural language processing",
                                                   "You like computer vision",
                                                   "I like computer games and computer science"])
# Fit a TextVectorization layer
VOCAB_SIZE = 10  # Maximum vocab size.
vectorizer = tf.keras.layers.TextVectorization(
        max_tokens=None,
        standardize="lower_and_strip_punctuation",
        split="whitespace",
        output_mode='int',
        output_sequence_length=max_len
        )
vectorizer.adapt(text_dataset.batch(64))
# In[]
#print(vectorizer.get_vocabulary())
#print(vectorizer.get_config())
#print(vectorizer.get_weights())
# In[]


# Pickle the config and weights
pickle.dump({'config': vectorizer.get_config(),
             'weights': vectorizer.get_weights()}
            , open("./models/tv_layer.pkl", "wb"))


# Later you can unpickle and use
# `config` to create object and
# `weights` to load the trained weights.

from_disk = pickle.load(open("./models/tv_layer.pkl", "rb"))

new_v1 = tf.keras.layers.TextVectorization(
        max_tokens=None,
        standardize="lower_and_strip_punctuation",
        split="whitespace",
        output_mode=from_disk['config']['output_mode'],
        output_sequence_length=from_disk['config']['output_sequence_length'],
        )
# You have to call `adapt` with some dummy data (BUG in Keras)
new_v1.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_v1.set_weights(from_disk['weights'])
new_v2 = tf.keras.layers.TextVectorization(
        max_tokens=None,
        standardize="lower_and_strip_punctuation",
        split="whitespace",
        output_mode='int',
        output_sequence_length=from_disk['config']['output_sequence_length'],
        )

# You have to call `adapt` with some dummy data (BUG in Keras)
new_v2.adapt(tf.data.Dataset.from_tensor_slices(["xyz"]))
new_v2.set_weights(from_disk['weights'])
print ("*"*10)
# In[]
test_sentence="Jack likes computer scinece, computer games, and foreign language"

print(vectorizer(test_sentence))
print (new_v1(test_sentence))
print (new_v2(test_sentence))
print(from_disk['config']['output_mode']=='int')

Here are the print() outputs:

**********
tf.Tensor([ 1  1  3  1  3 11 12  1 10  0], shape=(10,), dtype=int64)
tf.Tensor([ 1  1  3  1  3 11 12  1 10], shape=(9,), dtype=int64)
tf.Tensor([ 1  1  3  1  3 11 12  1 10  0], shape=(10,), dtype=int64)
True

Does anyone know why?

Answer

the bug is fixed by the PR in https://github.com/keras-team/keras/pull/15422

Advertisement

Answer