I have a model that I trained with
common_embed = Embedding( name="synopsis_embedd", input_dim =len(t.word_index)+1, output_dim=len(embeddings_index['no']), weights=[embedding_matrix], input_length=len(X_train['asset_text_seq_pad'].tolist()[0]), trainable=True ) lstm_1 = common_embed(input_1) common_lstm = LSTM(64, input_shape=(100,2)) ...
For the embedding I use Glove as a pre-trained embedding dictionary. Where I first build the tokenizer and text sequence with: t = Tokenizer() t.fit_on_texts(all_text)
text_seq= pad_sequences(t.texts_to_sequences(data['example_texts'].astype(str).values))
and then I’m calculating the embedding matrix with:
embeddings_index = {} for line in new_byte_string.decode('utf-8').split('n'): if line: values = line.split() word = values[0] coefs = np.asarray(values[1:], dtype='float32') embeddings_index[word] = coefs embedding_vector = None not_present_list = [] vocab_size = len(t.word_index) + 1 print('Loaded %s word vectors.' % len(embeddings_index)) embedding_matrix = np.zeros((vocab_size, len(embeddings_index['no']))) for word, i in t.word_index.items(): if word in embeddings_index.keys(): embedding_vector = embeddings_index.get(word) else: not_present_list.append(word) if embedding_vector is not None: embedding_matrix[i] = embedding_vector else: embedding_matrix[i] = np.zeros(300)
now I’m using a new dataset for the prediction. This leads to an error:
Node: ‘model/synopsis_embedd/embedding_lookup’ indices[38666,63] = 136482 is not in [0, 129872) [[{{node model/synopsis_embedd/embedding_lookup}}]] [Op:__inference_predict_function_12452]
I do all of the steps for the prediction step again. Is that wrong and do I have to reuse the tokenizer from the training? Or why are the indices during prediction not existing?
Advertisement
Answer
You are probably getting this error because you are not using the same tokenizer
and embedding_matrix
during inference. Here is an example:
import tensorflow as tf vocab_size = 50 embedding_layer = tf.keras.layers.Embedding(vocab_size, 64, input_length=10) sequence1 = tf.constant([[1, 2, 5, 10, 32]]) embedding_layer(sequence1) # This works sequence2 = tf.constant([[51, 2, 5, 10, 32]]) embedding_layer(sequence2) # This throws an error because 51 is larger than the vocab_size=50