Skip to content
Advertisement

Getting the number of words from tf.Tokenizer after fitting

I initially tried making an RNN that can predict Shakespeare text, and I did it successfully using character level-encoding. But when I switched to word level encoding, I ran into a multitude of issues. Specifically, I am having a hard time getting the total number of characters (I was told it was just dataset_size = tokenizer.document_count but this just returns 1 ) so that I can set steps_per_epoch = dataset_size // batch_size when fitting my model (Now, both char and word level encoding return 1). I tried setting dataset_size = sum(tokenizer.word_counts.values()) but when I fit the model, I get this error right before the first epoch ends:

WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least steps_per_epoch * epochs batches (in this case, 32 batches). You may need to use the repeat() function when building your dataset.

So I assume that my code believes that I have slightly more training sets available than I actually do. Or it may be the fact that I am programming on the new M1 chip which doesn’t have a production version of TF? So really, I’m just not sure how to get the exact number of words in this text.

Here’s the code:

JavaScript

Thanks:)

Advertisement

Answer

The count of all words found in the input text is stored in an OrderedDict tokenizer.word_counts. It looks like

JavaScript

So, to get the word count number, you need

JavaScript
Advertisement