I have two different text which I want to compare using tfidf vectorization. What I am doing is:
- tokenizing each document
- vectorizing using TFIDFVectorizer.fit_transform(tokens_list)
Now the vectors that I get after step 2 are of different shape. But as per the concept, we should have the same shape for both the vectors. Only then the vectors can be compared.
What am I doing wrong? Please help.
Thanks in advance.
Advertisement
Answer
As G. Anderson already pointed out, and to help the future guys on this, when we use the fit function of TFIDFVectorizer on document D1, it means that for the D1, the bag of words are constructed.
The transform() function computes the tfidf frequency of each word in the bag of word.
Now our aim is to compare the document D2 with D1. It means we want to see how many words of D1 match up with D2. Thats why we perform fit_transform() on D1 and then only the transform() function on D2 would apply the bag of words of D1 and count the inverse frequency of tokens in D2. This would give the relative comparison of D1 against D2.