Skip to content

What is Keras’ Tokenizer fit_on_sequences used for?

I’m familiar with the method ‘fit_on_texts’ from the Keras’ Tokenizer. What does ‘fit_on_sequences’ do and when is it useful? According to the documentation, it “Updates internal vocabulary based on a list of sequences.”, and it takes as input: ‘A list of sequence. A “sequence” is a list of integer word indices.’. When is this useful?

For fitting on texts, I understand the text is parsed into tokens and each token is assigned an index (integer). Thus, the tokenizer object contains, among other things, a dictionary relating tokens (strings) and indices (integers). However, if I give it only a sequence of numbers and call fit_on_sequences, how would it know what tokens do these things represent?

As an experiment, try the following:


Then, the properties word_index or index_word, which would otherwise contain the dictionary of values are, of course, empty. The documentation also states about fit_on_sequences: “Required before using sequences_to_matrix (if fit_on_texts was never called).”, however, calling sequences_to_matrix after calling only fit_on_sequences (not fit_on_texts) does not work. So, what is fit_on_sequences used for?



sequences_to_matrix does work after calling fit_on_sequences, you just need to specify the argument num_words in the Tokenizer() instantiation.


The zero at the beginning is there because there is no 0 in your sequence, and the zeroes at the end are because I specified 10 num_words but the highest value in your test sequence in 6.

The purpose it serves is just skipping the step of mapping an integer to a string. It only uses the integer.
