How to get feature names for a glove vectors

Tags: , , , ,



Countvectorizer has feature names, like this.

vectorizer = CountVectorizer(min_df=10,ngram_range=(1,4), max_features=15000)
vectorizer.fit(X_train['essay'].values) # fit has to happen only on train data

X_train_essay_bow = vectorizer.transform(X_train['essay'].values)
feature_names= vectorizer.get_feature_names()

What would be the feature names for a glove vector?

How to get those feature names?

with open('glove_vectors', 'rb') as f:
    model = pickle.load(f)
    glove_words =  set(model.keys())

I have the glove vector file of 300 dimensions like the above shown.

What would be the name of the 300 dimensions of the glove vectors?

Answer

There is no name for the Glove features. The countvectorizer counts the occurrences of each token in each sentence. So the features have easily understandable names. The feature “cat” is the count in each sentence of the token “cat”.

For Glove Vectors, the strategy is totally different and there is no equivalent representation of the features. Glove vectors are embeddings of words in an abstract N-dimensional space.

The Glove vector for a token comes from passing the token as an input into a trained neural network, and taking the activations of an auto-encoding layer in the middle.

If you’ve ever trained a deep neural network, imagine choosing some hidden layer within. What is the feature_name for each node in that hidden layer? It’s a meaningless question because the nodes aren’t features; they exist to pass the activation to the next layer. The same is true of Glove vector features; they are the activation values of a hidden layer in a network.



Source: stackoverflow