Skip to content
Advertisement

Why does Keras.preprocessing.sequence pad_sequences process characters instead of words?

I’m working on transcribing speech to text and ran into an issue (I think) when using pad_sequences in Keras. I pretrained a model which used pad_sequences on a dataframe and it fit the data into an array with the same number of columns & rows for each value. However when I used pad_sequences on transcribing text, the number of characters in that spoken string is how many rows are returned as a numpy array.

Say I have a string with 4 characters then it will return a 4 X 500 Numpy array. For a string with 6 characters it will return 6 X 500 Numpy array and so on.

My code for clarification:

JavaScript

I use a helper function script:

JavaScript

The text returned will be fed into a pre-trained model and I’m pretty sure that different length rows will cause an issue.

Advertisement

Answer

The Tokenizer‘s methods such as fit_on_texts or texts_to_sequences expect a list of texts/strings as input (as their name suggests, i.e. texts). However, you are passing a single text/string to them and therefore it would iterate over its characters instead while assuming it’s actually a list!

One way of resolving this is to add a check at the beginning of each function to make sure that the input data type is actually a list. For example:

JavaScript

You should also do this for the tokenize_text function. After this change, your custom functions would work on both a single string as well as a list of strings.


As an important side note, if the code you have put in your question belongs to the prediction phase there is a fundamental error in it: you should use the same Tokenizer instance you have used when training the model to ensure the mapping and tokenization is done the same way as in training phase. Actually, it does not make sense to create a new Tokenizer instance for each or all test samples (unless it has the same mapping and configuration as the one used in training phase).

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement