Retrieve n-grams with word2vec

Question

I have a list of texts. I turn each text into a token list. For example if one of the texts is &#8216;I am studying word2vec&#8217; the respective token list will be (assuming I consider n-grams with n = 1, 2, 3) [&#8216;I&#8217;, &#8216;am&#8217;, &#8216;studying &#8216;, &#8216;word2vec, &#8216;I am&#8217;,…

Accepted Answer

Generally, the very best way to determine &#8220;what kinds of results&#8221; you will get if you were to try certain things is to try those things, and observe the results you actually get.In preparing text for word2vec training, it is not typical to convert an input text to the form you&#8217;ve shown, with a bunch of space-delimited word n-grams added. Rather, the string 'I am studying word2vec' would typically just be preprocessed/tokenized to a list of (unigram) tokens like ['I', 'am', 'studying', 'word2vec'].The model will then learn one vector per single word – with no vectors for multigrams. And since it only knows such 1-word vectors, all the results its reports from .most_similar() will also be single words.You can preprocess your text to combine some words into multiword entities, based on some sort of statistical or semantic understanding of the text. Very often, this process converts the runs-of-related-words to underscore-connected single tokens. For example, 'I visited New York City' might become ['I', 'visited', 'New_York_City'].But any such preprocessing decisions are separate from the word2vec algorithm itself, which just considers whatever &#8216;words&#8217; you feed it as 1:1 keys for looking-up vectors-in-training. It only knows tokens, not n-grams.

Advertisement

Answer