So I’m doing tf-idf for very large corpus(100k documents) and it is giving me memory errors. Is there any implantation that can work well with such large number of documents? I want to make my own stopwords list. Also, it worked on 50k documents, what is the limit of number of documents I can use in this calculation if there
Tag: tf-idf
Pipeline with count and tfidf vectorizer produces TypeError: expected string or bytes-like object
I have a corpus like the following ‘C C C 0 0 0 X 0 1 0 0 0 0’, ‘C C C 0 0 0 X 0 1 0 0 0 0’, ‘C C C 0 0 0 X 0 1 0 0 0 0’, ‘X X X’, ‘X X X’, ‘X X X’, I would like to use
Creating a new column for predicted cluster: SettingWithCopyWarning
This question will be a duplicate unfortunately, but I could not fix the issue in my code, even after looking at the other similar questions and their related answers. I need to split my dataset into train a test a dataset. However, it seems I am doing some error when I add a new column for predicting the cluster. The
Using Sklearn’s TfidfVectorizer transform
I am trying to get the tf-idf vector for a single document using Sklearn’s TfidfVectorizer object. I create a vocabulary based on some training documents and use fit_transform to train the TfidfVectorizer. Then, I want to find the tf-idf vectors for any given testing document. The problem is that this returns a matrix with n rows where n is the