Gensim Word2Vec exhausting iterable

Question

I&#8217;m getting the following prompt when calling model.train() from gensim word2vec The only solutions I found on my search for an answer point to the itarable vs iterator difference, and at this point, I tried everything I could to solve this on my own, currently, my code looks like this: The corpus varia…

Accepted Answer

workers=-1 is not a supported value for Gensim&#8217;s Word2Vec model; it essentially means you&#8217;re using no threads.Instead, you must specify the actual number of worker threads you&#8217;d like to use.When using an iterable corpus, the optimal number of workers is usually some number up to your number of CPU cores, but not higher than 8-12 if you&#8217;ve got 16+ cores, because of some hard-to-remove inefficiencies in both the Python&#8217;s Global Interpreter Lock (&#8220;GIL&#8221;) and the Gensim master-reader-thread approach.Generally, also, you&#8217;ll get better throughput if your iterable isn&#8217;t doing anything expensive or repetitive in its preprocessing &#8211; like any regex-based tokenization, or a tokenization that&#8217;s repeated on every epoch. So best to do such preprocessing once, writing the resulting  simple space-delimited tokens to a new file. Then, read that file with a very-simple, no-regex, space-splitting only tokenization.(If performance becomes a major concern on a large dataset, you can also look into the alternate corpus_file method of specifying your corpus. It expects a single file, where each text is on its own line, and tokens are already just space-delimited. But it then lets every worker thread read its own range of the file, with far less GIL/reader-thread bottlenecking, so using workers equal to the CPU core count is then roughly optimal for throughput.)

Advertisement

Answer