I am trying find similar sentence using doc2vec. What I am not able to find is actual sentence that is matching from the trained sentences.
Below is the code from this article:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument from nltk.tokenize import word_tokenize data = ["I love machine learning. Its awesome.", "I love coding in python", "I love building chatbots", "they chat amagingly well"] tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)] max_epochs = 100 vec_size = 20 alpha = 0.025 model = Doc2Vec(size=vec_size, alpha=alpha, min_alpha=0.00025, min_count=1, dm =1) model.build_vocab(tagged_data) for epoch in range(max_epochs): print('iteration {0}'.format(epoch)) model.train(tagged_data, total_examples=model.corpus_count, epochs=model.iter) # decrease the learning rate model.alpha -= 0.0002 # fix the learning rate, no decay model.min_alpha = model.alpha model.save("d2v.model") print("Model Saved") model= Doc2Vec.load("d2v.model") #to find the vector of a document which is not in training data test_data = word_tokenize("I love building chatbots".lower()) v1 = model.infer_vector(test_data) print("V1_infer", v1) # to find most similar doc using tags similar_doc = model.docvecs.most_similar('1') print(similar_doc) # to find vector of doc in training data using tags or in other words, printing the vector of document at index 1 in training data print(model.docvecs['1'])
But the above code only gives me vectors or numbers. But how can I get the actual sentence matched from training data. For Eg – In this case I am expecting the result as “I love building chatbots”.
Advertisement
Answer
The output of similar_doc
is: [('2', 0.991769552230835), ('0', 0.989276111125946), ('3', 0.9854298830032349)]
This shows the similarity score of each document in the data
with the requested document and it is sorted in descending order.
Based in this, '2' index
in the data
is the closest to the requested data i.e. test_data
.
print(data[int(similar_doc[0][0])]) // prints: I love building chatbots
Note: this code is giving different results every time, maybe you need a better model or more training data.