I’m working on transcribing speech to text and ran into an issue (I think) when using pad_sequences
in Keras. I pretrained a model which used pad_sequences
on a dataframe and it fit the data into an array with the same number of columns & rows for each value. However when I used pad_sequences
on transcribing text, the number of characters in that spoken string is how many rows are returned as a numpy array.
Say I have a string with 4 characters then it will return a 4 X 500
Numpy array. For a string with 6 characters it will return 6 X 500
Numpy array and so on.
My code for clarification:
import speech_recognition as sr import pyaudio import pandas as pd from helperFunctions import * jurors = ['Zack', 'Ben'] storage = [] storage_df = pd.DataFrame() while len(storage) < len(jurors): print('Juror' + ' ' + jurors[len(storage)] + ' ' + 'is speaking:') init_rec = sr.Recognizer() with sr.Microphone() as source: audio_data = init_rec.adjust_for_ambient_noise(source) audio_data = init_rec.listen(source) #each juror speaks for 10 seconds audio_text = init_rec.recognize_google(audio_data) print('End of juror' + ' ' + jurors[len(storage)] + ' ' + 'speech') storage.append(audio_text) cleaned = clean_text(audio_text) tokenized = tokenize_text(cleaned) padded_text = padding(cleaned, tokenized) #fix padded text elongating rows
I use a helper function script:
def clean_text(text, stem=False): text_clean = '@S+|https?:S|[^A-Za-z0-9]+' text = re.sub(text_clean, ' ', str(text).lower()).strip() #text = tf.strings.substr(text, 0, 300) #restrict text size to 300 chars return text def tokenize_text(text): tokenizer = Tokenizer() tokenizer.fit_on_texts(text) return tokenizer def padding(text, tokenizer): text = pad_sequences(tokenizer.texts_to_sequences(text), maxlen = 500) return text
The text returned will be fed into a pre-trained model and I’m pretty sure that different length rows will cause an issue.
Advertisement
Answer
The Tokenizer
‘s methods such as fit_on_texts
or texts_to_sequences
expect a list of texts/strings as input (as their name suggests, i.e. texts
). However, you are passing a single text/string to them and therefore it would iterate over its characters instead while assuming it’s actually a list!
One way of resolving this is to add a check at the beginning of each function to make sure that the input data type is actually a list. For example:
def padding(text, tokenizer): if isinstanceof(text, str): text = [text] # the rest would not change...
You should also do this for the tokenize_text
function. After this change, your custom functions would work on both a single string as well as a list of strings.
As an important side note, if the code you have put in your question belongs to the prediction phase there is a fundamental error in it: you should use the same Tokenizer
instance you have used when training the model to ensure the mapping and tokenization is done the same way as in training phase. Actually, it does not make sense to create a new Tokenizer
instance for each or all test samples (unless it has the same mapping and configuration as the one used in training phase).