I am trying find similar sentence using doc2vec. What I am not able to find is actual sentence that is matching from the trained sentences.
Below is the code from this article:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love machine learning. Its awesome.",
"I love coding in python",
"I love building chatbots",
"they chat amagingly well"]
tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
max_epochs = 100
vec_size = 20
alpha = 0.025
model = Doc2Vec(size=vec_size,
dm =1)
for epoch in range(max_epochs):
print('iteration {0}'.format(epoch))
# decrease the learning rate
model.alpha -= 0.0002
# fix the learning rate, no decay
model.min_alpha = model.alpha
print("Model Saved")
model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("I love building chatbots".lower())
v1 = model.infer_vector(test_data)
print("V1_infer", v1)
# to find most similar doc using tags
similar_doc = model.docvecs.most_similar('1')
# to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data
But the above code only gives me vectors or numbers. But how can I get the actual sentence matched from training data. For Eg – In this case I am expecting the result as “I love building chatbots”.
The output of similar_doc
is: [('2', 0.991769552230835), ('0', 0.989276111125946), ('3', 0.9854298830032349)]
This shows the similarity score of each document in the data
with the requested document and it is sorted in descending order.
Based in this, '2' index
in the data
is the closest to the requested data i.e. test_data
// prints: I love building chatbots
Note: this code is giving different results every time, maybe you need a better model or more training data.