Skip to content
Advertisement

Modifying .trainables.syn1neg[i] with previously trained vectors in Gensim word2vec

My issue is the following.

In my code I’m modifying the .wv[word] before training but after .build_vocab(), which is fairly straight forward. Just instead of the vectors in there add mine for every word.

for elem in setIntersection:
    if len(word_space[elem]) != 300:
        print('here', elem) #cast it to the fire
        sys.exit()
    w2vObjectRI.wv[elem] = np.asarray(word_space[elem], dtype=np.float32)

Where setIntersection is just a set of common words between gensim word2vec and RandomIndexing trained. Same size of 300 in both.

Now I want to also modify the hidden-to-output layer weights, which I was told that they are in .trainables.syn1neg[i], but here is my issue this matrix is not word addressable, is just a normal matrix with out names. How could I know which letter I will be modifying in this matrix? Also I see that they are initialised with 0s, I was just thinking if these weights are not reset before training? More clearly if I change those weights and then call train will it use the ones I provided? Thanks.

for i in range(len(setIntersection)):
if len(word_space[setIntersection[i]]) != 300:
    print('here', setIntersection[i]) #cast it to the fire
    sys.exit()
w2vObjectRI.trainables.syn1neg[i] = np.asarray(word_space[setIntersection[i]], dtype=np.float32)

Cheers,

Pedro.

Advertisement

Answer

In Gensim 4.0+, that “hidden to output layer” is just in w2v_model.syn1neg, instead of a (now-removed) subcomponent .trainables.

Following the original word2vec.c on which Gensim’s implementation is based, those weights begin training as uninitialized zeros.

As the output (predicted-word) nodes are exactly the same vocabulary as are considered in the input/projection layer, the correspondence of rows-to-words is exactly the same as in the input layer, aka the word-vectors being trained. (That was previously in an array called .syn0, more recently called just .vectors.)

So the word that’s in slot 0 in w2v_model.wv.vectors is also the word represented by the output-node fed by w2v_model.syn1neg[0].

In Gensim 4.0+, these word-to-slot values can be read from w2v_model.wv.key_to_index[word]. (Pre-4.0, I think it was w2v_model.wv.vocab[word].index.)

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement