Skip to content
Advertisement

how to compare two text document with tfidf vectorizer?

I have two different text which I want to compare using tfidf vectorization. What I am doing is:

  1. tokenizing each document
  2. vectorizing using TFIDFVectorizer.fit_transform(tokens_list)

Now the vectors that I get after step 2 are of different shape. But as per the concept, we should have the same shape for both the vectors. Only then the vectors can be compared.

What am I doing wrong? Please help.

Thanks in advance.

Advertisement

Answer

As G. Anderson already pointed out, and to help the future guys on this, when we use the fit function of TFIDFVectorizer on document D1, it means that for the D1, the bag of words are constructed.

The transform() function computes the tfidf frequency of each word in the bag of word.

Now our aim is to compare the document D2 with D1. It means we want to see how many words of D1 match up with D2. Thats why we perform fit_transform() on D1 and then only the transform() function on D2 would apply the bag of words of D1 and count the inverse frequency of tokens in D2. This would give the relative comparison of D1 against D2.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement