I want to train a Fasttext model in Python using the “gensim” library. First, I should tokenize each sentences to its words, hence converting each sentence to a list of words. Then, this list should be appended to a final list. Therefore, at the end, I will have a nested list containing all tokenized sentences:
word_punctuation_tokenizer = nltk.WordPunctTokenizer() word_tokenized_corpus = [] for line in open('sentences.txt'): new = line.strip() new = word_punctuation_tokenizer.tokenize(new) if len(new) != 0: word_tokenized_corpus.append(new)
Then, the model should be built as the following:
embedding_size = 60 window_size = 40 min_word = 5 down_sampling = 1e-2 ft_model = FastText(word_tokenized_corpus, size=embedding_size, window=window_size, min_count=min_word, sample=down_sampling, sg=1, iter=100)
However, the number of sentences in “word_tokenized_corpus” is very large and the program can’t handle it. Is it possible that I train the model by giving each tokenized sentence to it one by one, such as the following:?
for line in open('sentences.txt'): new = line.strip() new = word_punctuation_tokenizer.tokenize(new) if len(new) != 0: ft_model = FastText(new, size=embedding_size, window=window_size, min_count=min_word, sample=down_sampling, sg=1, iter=100)
Does this make any difference to the final results? Is it possible to train the model without having to build such a large list and keeping it in the memory?
Advertisement
Answer
Since the volume of the data is very high, it is better to convert the text file into a COR file. Then, read it in the following way:
from gensim.test.utils import datapath corpus_file = datapath('sentences.cor')
As for the next step:
model = FastText(size=embedding_size, window=window_size, min_count=min_word, sample=down_sampling, sg=1, iter=100) model.build_vocab(corpus_file=corpus_file) total_words = model.corpus_total_words model.train(corpus_file=corpus_file, total_words=total_words, epochs=5)