Skip to content
Advertisement

Retrieve n-grams with word2vec

I have a list of texts. I turn each text into a token list. For example if one of the texts is 'I am studying word2vec' the respective token list will be (assuming I consider n-grams with n = 1, 2, 3) ['I', 'am', 'studying ', 'word2vec, 'I am', 'am studying', 'studying word2vec', 'I am studying', 'am studying word2vec'].

  1. Is this the right way to transform any text in order to apply most_similar()?

(I could also delete n-grams that contain at least one stopword, but that’s not the point of my question.)

I call this list of lists of tokens texts. Now I build the model:

model = Word2Vec(texts)

then, if I use

words = model.most_similar('term', topn=5)

  1. Is there a way to determine what kind of results i will get? For example, if term is a 1-gram then will I get a list of five 1-gram? If term is a 2-gram then will I get a list of five 2-gram?

Advertisement

Answer

Generally, the very best way to determine “what kinds of results” you will get if you were to try certain things is to try those things, and observe the results you actually get.

In preparing text for word2vec training, it is not typical to convert an input text to the form you’ve shown, with a bunch of space-delimited word n-grams added. Rather, the string 'I am studying word2vec' would typically just be preprocessed/tokenized to a list of (unigram) tokens like ['I', 'am', 'studying', 'word2vec'].

The model will then learn one vector per single word – with no vectors for multigrams. And since it only knows such 1-word vectors, all the results its reports from .most_similar() will also be single words.

You can preprocess your text to combine some words into multiword entities, based on some sort of statistical or semantic understanding of the text. Very often, this process converts the runs-of-related-words to underscore-connected single tokens. For example, 'I visited New York City' might become ['I', 'visited', 'New_York_City'].

But any such preprocessing decisions are separate from the word2vec algorithm itself, which just considers whatever ‘words’ you feed it as 1:1 keys for looking-up vectors-in-training. It only knows tokens, not n-grams.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement