I am trying to perform in python the cosine similarity between two words which are in a dataset of texts (each text represents a tweet). I want to evaluate the similarity based on the context where they are placed.
I have set a code like the following:
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity corpus = dataset // corpus is a list of texts (in this case is a list of tweets) vectorizer = TfidfVectorizer() trsfm = vectorizer.fit_transform(corpus) sims = cosine_similarity(trsfm, trsfm) counts = count_vect.fit_transform(corpus) pd.DataFrame(trsfm.toarray(), columns = vectorizer.get_feature_names(), index = corpus) vectorizer.get_feature_names()
The result is the similarity between the texts but I want the similarity between two words.
So, wow can I obtain the similarity between two words and not between two texts? For instance, I want the similarity between these couple of words: {[“covid”,”vaccine”], [“work”,”covid”], [“environment”,”pollution”]}.
In addition, I want to represet these words in a cartesian plane in order to display graphically the distances amongst them. So I need to calculate their cartesian coordinates.
Is there anyone who can help me?
Advertisement
Answer
Here are some useful links you can get start with –
https://www.tensorflow.org/text/guide/word_embeddings https://arxiv.org/abs/1810.04805What Are Word Embeddings for Text?An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec