Why does Keras.preprocessing.sequence pad_sequences process characters instead of words?

I’m working on transcribing speech to text and ran into an issue (I think) when using pad_sequences in Keras. I pretrained a model which used pad_sequences on a dataframe and it fit the data into an array with the same number of columns & rows for each value. However when I used pad_sequences on transcribing text, the number of characters in that spoken string is how many rows are returned as a numpy array.

Say I have a string with 4 characters then it will return a 4 X 500 Numpy array. For a string with 6 characters it will return 6 X 500 Numpy array and so on.

My code for clarification:

import speech_recognition as sr
import pyaudio
import pandas as pd
from helperFunctions import *

jurors = ['Zack', 'Ben']
storage = []
storage_df = pd.DataFrame()


while len(storage) < len(jurors):
    print('Juror' + ' ' + jurors[len(storage)] + ' ' + 'is speaking:')
    init_rec = sr.Recognizer()
    with sr.Microphone() as source:
        audio_data = init_rec.adjust_for_ambient_noise(source)
        audio_data = init_rec.listen(source) #each juror speaks for 10 seconds
        audio_text = init_rec.recognize_google(audio_data)
        print('End of juror' + ' ' + jurors[len(storage)] + ' ' + 'speech')
        storage.append(audio_text)
        cleaned = clean_text(audio_text)
        tokenized = tokenize_text(cleaned)
        padded_text = padding(cleaned, tokenized) #fix padded text elongating rows

JavaScript
​x
 
import speech_recognition as sr
import pyaudio
import pandas as pd
from helperFunctions import *
​
jurors = ['Zack', 'Ben']
storage = []
storage_df = pd.DataFrame()
​
​
while len(storage) < len(jurors):
    print('Juror' + ' ' + jurors[len(storage)] + ' ' + 'is speaking:')
    init_rec = sr.Recognizer()
    with sr.Microphone() as source:
        audio_data = init_rec.adjust_for_ambient_noise(source)
        audio_data = init_rec.listen(source) #each juror speaks for 10 seconds
        audio_text = init_rec.recognize_google(audio_data)
        print('End of juror' + ' ' + jurors[len(storage)] + ' ' + 'speech')
        storage.append(audio_text)
        cleaned = clean_text(audio_text)
        tokenized = tokenize_text(cleaned)
        padded_text = padding(cleaned, tokenized) #fix padded text elongating rows
​

I use a helper function script:

def clean_text(text, stem=False):
    text_clean = '@S+|https?:S|[^A-Za-z0-9]+'
    text = re.sub(text_clean, ' ', str(text).lower()).strip()
    #text = tf.strings.substr(text, 0, 300) #restrict text size to 300 chars
    return text

def tokenize_text(text):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(text)
    return tokenizer

def padding(text, tokenizer):
    text = pad_sequences(tokenizer.texts_to_sequences(text), 
                       maxlen = 500)
    return text

JavaScript
 
def clean_text(text, stem=False):
    text_clean = '@S+|https?:S|[^A-Za-z0-9]+'
    text = re.sub(text_clean, ' ', str(text).lower()).strip()
    #text = tf.strings.substr(text, 0, 300) #restrict text size to 300 chars
    return text
​
def tokenize_text(text):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(text)
    return tokenizer
​
def padding(text, tokenizer):
    text = pad_sequences(tokenizer.texts_to_sequences(text), 
                       maxlen = 500)
    return text
​

The text returned will be fed into a pre-trained model and I’m pretty sure that different length rows will cause an issue.

Answer

The Tokenizer‘s methods such as fit_on_texts or texts_to_sequences expect a list of texts/strings as input (as their name suggests, i.e. texts). However, you are passing a single text/string to them and therefore it would iterate over its characters instead while assuming it’s actually a list!

One way of resolving this is to add a check at the beginning of each function to make sure that the input data type is actually a list. For example:

def padding(text, tokenizer):
    if isinstanceof(text, str):
        text = [text]
    # the rest would not change...

JavaScript
 
def padding(text, tokenizer):
    if isinstanceof(text, str):
        text = [text]
    # the rest would not change...
​

You should also do this for the tokenize_text function. After this change, your custom functions would work on both a single string as well as a list of strings.

As an important side note, if the code you have put in your question belongs to the prediction phase there is a fundamental error in it: you should use the same Tokenizer instance you have used when training the model to ensure the mapping and tokenization is done the same way as in training phase. Actually, it does not make sense to create a new Tokenizer instance for each or all test samples (unless it has the same mapping and configuration as the one used in training phase).

Advertisement

Answer