I am trying to get the tf-idf vector for a single document using Sklearn’s TfidfVectorizer object. I create a vocabulary based on some training documents and use fit_transform to train the TfidfVectorizer. Then, I want to find the tf-idf vectors for any given testing document.
from sklearn.feature_extraction.text import TfidfVectorizer self.vocabulary = "a list of words I want to look for in the documents".split() self.vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english') self.vect.fit_transform(self.vocabulary) ... doc = "some string I want to get tf-idf vector for" tfidf = self.vect.transform(doc)
The problem is that this returns a matrix with n rows where n is the size of my doc string. I want it to return just a single vector representing the tf-idf for the entire string. How can I make this see the string as a single document, rather than each character being a document? Also, I am very new to text mining so if I am doing something wrong conceptually, that would be great to know. Any help is appreciated.
Advertisement
Answer
If you want to compute tf-idf only for a given vocabulary, use vocabulary
argument to TfidfVectorizer
constructor,
vocabulary = "a list of words I want to look for in the documents".split() vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english', vocabulary=vocabulary)
Then, to fit, i.e. calculate counts, with a given corpus
, i.e. an iterable of documents, use fit
:
vect.fit(corpus)
Method fit_transform
is a shortening for
vect.fit(corpus) corpus_tf_idf = vect.transform(corpus)
Last, transform
method accepts a corpus, so for a single document, you should pass it as list, or it is treated as iterable of symbols, each symbol being a document.
doc_tfidf = vect.transform([doc])