Skip to content
Advertisement

tf-idf for large number of documents (>100k)

So I’m doing tf-idf for very large corpus(100k documents) and it is giving me memory errors. Is there any implantation that can work well with such large number of documents? I want to make my own stopwords list. Also, it worked on 50k documents, what is the limit of number of documents I can use in this calculation if there is one (sklearn implantation).

JavaScript

The error: MemoryError: Unable to allocate 65.3 GiB for an array with shape (96671, 90622) and data type float64

Thanks in advance.

Advertisement

Answer

As @NickODell said, the memory error is only when you convert the sparse matrix into a dense matrix. The solution is to do everything you want using the sparse matrix only

JavaScript

And that’s the solution.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement