Using NearestNeighbors and word2vec to detect sentence similarity

Question

I have calculated a word2vec model using python and gensim in my corpus. Then I calculated the mean word2vec vector for each sentence (averaging all the vectors for all the words in the sentence) and stored it in a pandas data frame. The columns of the pandas data frame df are: sentence Book title (the book w…

Accepted Answer

The reason your edit is throwing an error is because sklearn expects a 2D input, with each example being in a new row. You can either use X.reshape(1, -1) or [X], the first is better practice. Without the raw data or a proper MWE it&#8217;s hard to say exactly is going wrong, but my guess is that something is going wrong with either putting the data in or out of the dataframe. Check that X.shape makes sense to you.Below is the example I used to check everything worked for me:from sklearn.neighbors import NearestNeighborsfrom gensim.models import Word2Vecimport numpy as npa = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et doloremagna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."""a = [x.split(' ') for x in a.split('n') if len(x)]model = Word2Vec(a, min_count=1)# Get the average of all of the words to get data for a sentenceb = np.array([np.mean([model[xx] for xx in x], axis=0) for x in a])# Check it's the correct shapeprint b.shapenbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(b)

Advertisement

Answer