Why does Keras.preprocessing.sequence pad_sequences process characters instead of words?

Question

I&#8217;m working on transcribing speech to text and ran into an issue (I think) when using pad_sequences in Keras. I pretrained a model which used pad_sequences on a dataframe and it fit the data into an array with the same number of columns & rows for each value. However when I used pad_sequences on tra…

Accepted Answer

The Tokenizer&#8216;s methods such as fit_on_texts or texts_to_sequences expect a list of texts/strings as input (as their name suggests, i.e. texts). However, you are passing a single text/string to them and therefore it would iterate over its characters instead while assuming it&#8217;s actually a list!One way of resolving this is to add a check at the beginning of each function to make sure that the input data type is actually a list. For example:def padding(text, tokenizer):    if isinstanceof(text, str):        text = [text]    # the rest would not change...You should also do this for the tokenize_text function. After this change, your custom functions would work on both a single string as well as a list of strings.As an important side note, if the code you have put in your question belongs to the prediction phase there is a fundamental error in it: you should use the same Tokenizer instance you have used when training the model to ensure the mapping and tokenization is done the same way as in training phase. Actually, it does not make sense to create a new Tokenizer instance for each or all test samples (unless it has the same mapping and configuration as the one used in training phase).

Advertisement

Answer