Skip to content
Advertisement

Cosine Similarity between two words in a context in Python

I am trying to perform in python the cosine similarity between two words which are in a dataset of texts (each text represents a tweet). I want to evaluate the similarity based on the context where they are placed.

I have set a code like the following:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

corpus = dataset
// corpus is a list of texts (in this case is a list of tweets)
vectorizer = TfidfVectorizer()
trsfm = vectorizer.fit_transform(corpus)
sims = cosine_similarity(trsfm, trsfm)
counts = count_vect.fit_transform(corpus)
pd.DataFrame(trsfm.toarray(), columns = vectorizer.get_feature_names(), index = corpus)
vectorizer.get_feature_names()

The result is the similarity between the texts but I want the similarity between two words.

So, wow can I obtain the similarity between two words and not between two texts? For instance, I want the similarity between these couple of words: {[“covid”,”vaccine”], [“work”,”covid”], [“environment”,”pollution”]}.

In addition, I want to represet these words in a cartesian plane in order to display graphically the distances amongst them. So I need to calculate their cartesian coordinates.

Is there anyone who can help me?

Advertisement

Answer

Here are some useful links you can get start with –

https://www.tensorflow.org/text/guide/word_embeddings  
https://arxiv.org/abs/1810.04805  
What Are Word Embeddings for Text?
An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement