Skip to content
Advertisement

Doc2Vec find the similar sentence

I am trying find similar sentence using doc2vec. What I am not able to find is actual sentence that is matching from the trained sentences.

Below is the code from this article:

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love machine learning. Its awesome.",
        "I love coding in python",
        "I love building chatbots",
        "they chat amagingly well"]

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)
  
model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")

model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love building chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)

# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('1')
print(similar_doc)


# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
print(model.docvecs['1'])

But the above code only gives me vectors or numbers. But how can I get the actual sentence matched from training data. For Eg – In this case I am expecting the result as “I love building chatbots”.

Advertisement

Answer

The output of similar_doc is: [('2', 0.991769552230835), ('0', 0.989276111125946), ('3', 0.9854298830032349)]

This shows the similarity score of each document in the data with the requested document and it is sorted in descending order.

Based in this, '2' index in the data is the closest to the requested data i.e. test_data.

print(data[int(similar_doc[0][0])])
// prints: I love building chatbots

Note: this code is giving different results every time, maybe you need a better model or more training data.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement