training a Fasttext model

Question

I want to train a Fasttext model in Python using the &#8220;gensim&#8221; library. First, I should tokenize each sentences to its words, hence converting each sentence to a list of words. Then, this list should be appended to a final list. Therefore, at the end, I will have a nested list containing all tokeni…

Accepted Answer

Since the volume of the data is very high, it is better to convert the text file into a COR file. Then, read it in the following way:from gensim.test.utils import datapathcorpus_file = datapath('sentences.cor')As for the next step:model = FastText(size=embedding_size,                  window=window_size,                  min_count=min_word,                  sample=down_sampling,                  sg=1,                  iter=100)model.build_vocab(corpus_file=corpus_file)total_words = model.corpus_total_wordsmodel.train(corpus_file=corpus_file, total_words=total_words, epochs=5)

Advertisement

Answer