Skip to content
Advertisement

training a Fasttext model

I want to train a Fasttext model in Python using the “gensim” library. First, I should tokenize each sentences to its words, hence converting each sentence to a list of words. Then, this list should be appended to a final list. Therefore, at the end, I will have a nested list containing all tokenized sentences:

JavaScript

Then, the model should be built as the following:

JavaScript

However, the number of sentences in “word_tokenized_corpus” is very large and the program can’t handle it. Is it possible that I train the model by giving each tokenized sentence to it one by one, such as the following:?

JavaScript

Does this make any difference to the final results? Is it possible to train the model without having to build such a large list and keeping it in the memory?

Advertisement

Answer

Since the volume of the data is very high, it is better to convert the text file into a COR file. Then, read it in the following way:

JavaScript

As for the next step:

JavaScript
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement