Skip to content
Advertisement

Is there a way to calculate cosine similarity between documents sets in Python?

I’m trying to calculate cosine similarity between documents sets. I’m using this code and it works very well, but the problem is that it sorts the results in descending order. Is there a way to get the results according to the comparison order of the inserted documents? Or is there another way to do it? Thanks in advance to everyone.

This is the code that I’m using:

import pandas as pd
import numpy as np
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances


documents = ['Machine learning is the study of computer algorithms that improve automatically through experience.
Machine learning algorithms build a mathematical model based on sample data, known as training data.
The discipline of machine learning employs various approaches to teach computers to accomplish tasks 
where no fully satisfactory algorithm is available.',
'A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concerned
about the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and variability.
Developing a machine learning application is more iterative and explorative process than software engineering.',
             'Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. 
It involves computers learning from data provided so that they carry out certain tasks.',
             'Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the "signal"
or "feedback" available to the learning system: Supervised, Unsupervised and Reinforcement',
             'Software engineering is the systematic application of engineering approaches to the development of software.
Software engineering is a computing discipline.',
'Machine learning is closely related to computational statistics, which focuses on making predictions using computers.
The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.']
documents_df = pd.DataFrame(documents, columns=['documents'])

# removing special characters and stop words from the text
stop_words_l = stopwords.words('english')
documents_df['documents_cleaned'] = documents_df.documents.apply(lambda x: " ".join(
    re.sub(r'[^a-zA-Z]', ' ', w).lower() for w in x.split() if
    re.sub(r'[^a-zA-Z]', ' ', w).lower() not in stop_words_l))

tfidfvectoriser = TfidfVectorizer()
tfidfvectoriser.fit(documents_df.documents_cleaned)
tfidf_vectors = tfidfvectoriser.transform(documents_df.documents_cleaned)

pairwise_similarities = np.dot(tfidf_vectors, tfidf_vectors.T).toarray()
pairwise_differences = euclidean_distances(tfidf_vectors)

def most_similar(doc_id, similarity_matrix, matrix):
    print(similarity_matrix)
    print(f'Document: {documents_df.iloc[doc_id]["documents"]}')
    print('n')
    print('Similar Documents:')
    if matrix == 'Cosine Similarity':
        similar_ix = np.argsort(similarity_matrix[doc_id])[::-1]
    elif matrix == 'Euclidean Distance':
        similar_ix = np.argsort(similarity_matrix[doc_id])
    for ix in similar_ix:
        if ix == doc_id:
            continue
        print('n')
        print(f'Document: {documents_df.iloc[ix]["documents"]}')
        print(f'{matrix} : {similarity_matrix[doc_id][ix]}')

most_similar(0, pairwise_similarities, 'Cosine Similarity')
most_similar(0, pairwise_differences, 'Euclidean Distance')

This is the output:

Document: Machine learning is the study of computer algorithms that improve automatically through experience.Machine learning algorithms build a mathematical model based on sample data, known as training data.The discipline of machine learning employs various approaches to teach computers to accomplish tasks where no fully satisfactory algorithm is available.

Similar Documents:

Document: Machine learning is closely related to computational statistics, which focuses on making predictions using computers.The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Cosine Similarity : 0.22860560787391593

Document: Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. It involves computers learning from data provided so that they carry out certain tasks. Cosine Similarity : 0.22581304743529423

Document: Machine learning approaches are traditionally divided into three broad categories, depending on the nature of the “signal”or “feedback” available to the learning system: Supervised, Unsupervised and Reinforcement Cosine Similarity : 0.15314340308039842

Document: A software engineer creates programs based on logic for the computer to execute. A software engineer has to be more concernedabout the correctness of the program in all the cases. Meanwhile, a data scientist is comfortable with uncertainty and variability.Developing a machine learning application is more iterative and explorative process than software engineering. Cosine Similarity : 0.12407396777398046

Document: Software engineering is the systematic application of engineering approaches to the development of software.Software engineering is a computing discipline. Cosine Similarity : 0.04978528121489196

Advertisement

Answer

I think this does what you are looking for:

def most_similar(doc_id, similarity_matrix, matrix):
    print(similarity_matrix)
    print(f'Document: {documents_df.iloc[doc_id]["documents"]}')
    print('n')
    print('Similar Documents:')
    if matrix == 'Cosine Similarity':
        similar_ix = similarity_matrix[doc_id][::-1]
    elif matrix == 'Euclidean Distance':
        similar_ix = similarity_matrix[doc_id]
    for i, ix in enumerate(similar_ix):
        if ix == doc_id:
            continue
        print('n')
        print(f'Document: {documents_df.iloc[i]["documents"]}')
        print(f'{matrix} : {similarity_matrix[doc_id][i]}')

most_similar(0, pairwise_similarities, 'Cosine Similarity')
most_similar(0, pairwise_differences, 'Euclidean Distance')
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement