Extract sentence embeddings features with Pandas and spaCy

Question

I&#8217;m currently learning spaCy, and I have an exercise on word and sentence embeddings. Sentences are stored in a pandas DataFrame columns, and, we&#8217;re requested to train a classifier based on the vector of these sentences. I have a dataframe that looks like this: Next, I apply an NLP function to the…

Accepted Answer

Actually, using a single value averaging all vectors does yield good results in a classification model. What was needed was indeed a dataframe of 300 columns per sentence (since 300 is the standard length of spaCy word embeddings:So, to continue @Sergey&#8217;s code:sents = ["'Whitey on the Moon' is a 1970 spoken word"         , "St Anselm's Church is a Roman Catholic church"         , "Nymphargus grandisonae (common name: giant)"]df=pd.DataFrame({"sentence":sents})df['tokenized'] = df['sentence'].apply(nlp)df['sent_vectors'] = df['tokenized'].apply(lambda x: x.vector)vectors = 0['sent_vector'].apply(pd.Series)With this, vectors contains the features of which a model can be trained. For instance, assuming each sentence has a sentiment attached to it:from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitX = vectorsy = df['sentiment']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)clf = LogisticRegression()clf.fit(X_train,y_train)y_pred = clf.predict(X_test)What I couldn&#8217;t do is to remove stopwords from the DataFrame entries (i.e. remove each Token object from the Doc parent object stored in the dataframe where is_stop is False.

Advertisement

Answer