Skip to content
Advertisement

Extract sentence embeddings features with Pandas and spaCy

I’m currently learning spaCy, and I have an exercise on word and sentence embeddings. Sentences are stored in a pandas DataFrame columns, and, we’re requested to train a classifier based on the vector of these sentences.

I have a dataframe that looks like this:

JavaScript

Next, I apply an NLP function to these sentences:

JavaScript

Now, if I understand correctly, each item in df[‘tokenized’] has an attribute that returns the vector of the sentence in a 2D array.

JavaScript

yields

JavaScript

How do I add the content of this array (300 rows) as columns to the df dataframe for the corresponding sentence, ignoring stop words?

Thanks!

Advertisement

Answer

Actually, using a single value averaging all vectors does yield good results in a classification model. What was needed was indeed a dataframe of 300 columns per sentence (since 300 is the standard length of spaCy word embeddings:

So, to continue @Sergey’s code:

JavaScript

With this, vectors contains the features of which a model can be trained. For instance, assuming each sentence has a sentiment attached to it:

JavaScript

What I couldn’t do is to remove stopwords from the DataFrame entries (i.e. remove each Token object from the Doc parent object stored in the dataframe where is_stop is False.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement