I have already trained gensim doc2Vec model, which is finding most similar documents to an unknown one.
Now I need to find the similarity value between two unknown documents (which were not in the training data, so they can not be referenced by doc id)
d2v_model = doc2vec.Doc2Vec.load(model_file) string1 = 'this is some random paragraph' string2 = 'this is another random paragraph' vec1 = d2v_model.infer_vector(string1.split()) vec2 = d2v_model.infer_vector(string2.split())
in the code above vec1 and vec2 are successfully initialized to some values and of size – ‘vector_size’
now looking through the gensim api and examples I could not find method that works for me, all of them are expecting TaggedDocument
Can I compare the feature vectors value by value and if they are closer => the texts are more similar?
Advertisement
Answer
Hello just In case someone is interested, to do this you just need the cosine distance between the two vectors.
I found that most people are using ‘spatial’ for this pourpose
Here is a small code sniped that should work pretty well if you already have trained doc2vec
from gensim.models import doc2vec from scipy import spatial d2v_model = doc2vec.Doc2Vec.load(model_file) fisrt_text = '..' second_text = '..' vec1 = d2v_model.infer_vector(fisrt_text.split()) vec2 = d2v_model.infer_vector(second_text.split()) cos_distance = spatial.distance.cosine(vec1, vec2) # cos_distance indicates how much the two texts differ from each other: # higher values mean more distant (i.e. different) texts