Skip to content
Advertisement

Measure similarity between two documents using Doc2Vec

I have already trained gensim doc2Vec model, which is finding most similar documents to an unknown one.

Now I need to find the similarity value between two unknown documents (which were not in the training data, so they can not be referenced by doc id)

d2v_model = doc2vec.Doc2Vec.load(model_file)

string1 = 'this is some random paragraph'
string2 = 'this is another random paragraph'

vec1 = d2v_model.infer_vector(string1.split())
vec2 = d2v_model.infer_vector(string2.split())

in the code above vec1 and vec2 are successfully initialized to some values and of size – ‘vector_size’

now looking through the gensim api and examples I could not find method that works for me, all of them are expecting TaggedDocument

Can I compare the feature vectors value by value and if they are closer => the texts are more similar?

Advertisement

Answer

Hello just In case someone is interested, to do this you just need the cosine distance between the two vectors.

I found that most people are using ‘spatial’ for this pourpose

Here is a small code sniped that should work pretty well if you already have trained doc2vec

from gensim.models import doc2vec
from scipy import spatial

d2v_model = doc2vec.Doc2Vec.load(model_file)

fisrt_text = '..'
second_text = '..'

vec1 = d2v_model.infer_vector(fisrt_text.split())
vec2 = d2v_model.infer_vector(second_text.split())

cos_distance = spatial.distance.cosine(vec1, vec2)
# cos_distance indicates how much the two texts differ from each other:
# higher values mean more distant (i.e. different) texts
Advertisement