I have calculated a word2vec model using python and gensim
in my corpus.
Then I calculated the mean word2vec vector for each sentence (averaging all the vectors for all the words in the sentence) and stored it in a pandas data frame.
The columns of the pandas data frame df
are:
- sentence
- Book title (the book where the sentence comes from)
- mean-vector (the mean of the word2vec vectors in the sentence – size 100)
I am trying to use scikit-learn
NearestNeighbors
to detect sentence similarity (I could probably use doc2vec instead, but one of the objectives is to compare this method against doc2vec).
This is my code:
X = df['mean_vector'].values nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
I get the following error:
ValueError: setting an array element with a sequence.
I think somehow I should iterate the vectors, to be able to calculate on a row == sentence
basis the nearest neighbours of each row, but it seems this exceeds my current (limited) python skills.
This is the data of the first cell in df['mean_vector'][0]
. It is a full vector size 100 averaged over the vectors of the sentence.
array([ -2.14208905e-02, 2.42093615e-02, -5.78106642e-02, 1.32915592e-02, -2.43393257e-02, -1.41872400e-02, 2.83471867e-02, -2.02910602e-02, -5.49359620e-02, -6.70913085e-02, -5.56188896e-02, -2.95186806e-02, 4.97652516e-02, 7.16793686e-02, 1.81338750e-02, -1.50108105e-02, 1.79438610e-02, -2.41483524e-02, 4.97504435e-02, 2.91026086e-02, -6.87966943e-02, 3.27585079e-02, 5.10644279e-02, 1.97029337e-02, 7.73109496e-02, 3.23865712e-02, -2.81659551e-02, -9.69715789e-03, 5.23059331e-02, 3.81100960e-02, -3.62489261e-02, -3.40068117e-02, -4.90736961e-02, 8.72346922e-04, 2.27111522e-02, 1.06063476e-02, -3.93234752e-02, -1.10617064e-01, 8.05142429e-03, 4.56497036e-02, -1.73281748e-02, 2.35153548e-02, 5.13465842e-03, 1.88336968e-02, 2.40451116e-02, 3.79024050e-03, -4.83284928e-02, 2.10295208e-02, -4.92134318e-03, 1.01532964e-02, 8.02216958e-03, -6.74675079e-03, -1.39653292e-02, -2.07276996e-02, 9.73508134e-03, -7.37899616e-02, -2.58320477e-02, -1.10700730e-05, -4.53227758e-02, 2.31859135e-03, 1.40053956e-02, 1.61973312e-02, 3.01702786e-02, -6.96818605e-02, -3.47468331e-02, 4.79541793e-02, -1.78820305e-02, 5.99209731e-03, -5.92620336e-02, 7.34678581e-02, -5.23381204e-05, -5.07357903e-02, -2.55154949e-02, 5.06089740e-02, -3.70467864e-02, -2.04878468e-02, -7.62404222e-03, -5.38200373e-03, 7.68705690e-03, -3.27000804e-02, -2.18365286e-02, 2.34392099e-03, -3.02998684e-02, 9.42565035e-03, 3.24523374e-02, -1.10793915e-02, 3.06244520e-03, -1.82240941e-02, -5.70741761e-03, 3.13486941e-02, -1.15621388e-02, 1.10221673e-02, -3.55655849e-02, -4.56304513e-02, 5.54837054e-03, 4.38252240e-02, 1.57828294e-02, 2.65670624e-02, 8.08797963e-03, 4.55569401e-02], dtype=float32)
I have also tried to do:
for vec in df['mean_vector']: X = vec nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
But I only get the following warning:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
If there is an example on github using word2vec and NearestNeighbors
in a similar scenario I would love to see it.
Advertisement
Answer
The reason your edit is throwing an error is because sklearn
expects a 2D input, with each example being in a new row. You can either use X.reshape(1, -1)
or [X]
, the first is better practice. Without the raw data or a proper MWE it’s hard to say exactly is going wrong, but my guess is that something is going wrong with either putting the data in or out of the dataframe. Check that X.shape
makes sense to you.
Below is the example I used to check everything worked for me:
from sklearn.neighbors import NearestNeighbors from gensim.models import Word2Vec import numpy as np a = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.""" a = [x.split(' ') for x in a.split('n') if len(x)] model = Word2Vec(a, min_count=1) # Get the average of all of the words to get data for a sentence b = np.array([np.mean([model[xx] for xx in x], axis=0) for x in a]) # Check it's the correct shape print b.shape nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(b)