Tag: tfidfvectorizer

tf-idf for large number of documents (>100k)

So I’m doing tf-idf for very large corpus(100k documents) and it is giving me memory errors. Is there any implantation that can work well with such large number of documents? I want to make my own stopwords list. Also, it worked on 50k documents, what is the limit of number of documents I can use in this calculation if there

Why does TfidVectorizer.fit_transform() change the number of samples and labels for my text data?

python scikit-learn tfidfvectorizer

I have a data set that contains 3 columns for 310 data. The columns are all text. One column is text input by a user into an inquiry form and the second column are the labels (one of six labels) that say which inquiry category the input falls into. I am doing the following preprocessing to my data before I

how to compare two text document with tfidf vectorizer?

I have two different text which I want to compare using tfidf vectorization. What I am doing is: tokenizing each document vectorizing using TFIDFVectorizer.fit_transform(tokens_list) Now the vectors that I get after step 2 are of different shape. But as per the concept, we should have the same shape for both the vectors. Only then the vectors can be compared. What