tf-idf for large number of documents (100k)

Question

So I&#8217;m doing tf-idf for very large corpus(100k documents) and it is giving me memory errors. Is there any implantation that can work well with such large number of documents? I want to make my own stopwords list. Also, it worked on 50k documents, what is the limit of number of documents I can use in thi…

Accepted Answer

As @NickODell said, the memory error is only when you convert the sparse matrix into a dense matrix.The solution is to do everything you want using the sparse matrix only  def tf_idf(self, df):        df_clean, corpus = self.CleanText(df)    tfidf=TfidfVectorizer().fit(corpus)    count_tokens=tfidf.get_feature_names_out()    article_vect = tfidf.transform(corpus)    #The following line is the solution:    tf_idf_DF=pd.DataFrame(data=article_vect.tocsr().sum(axis=0),columns=count_tokens)    tf_idf_DF = tf_idf_DF.T.sort_values(ascending=False, by=[0])    tf_idf_DF['word'] = tf_idf_DF.index    tf_idf_DF['tf-idf'] = tf_idf_DF[0]    tf_idf_DF = tf_idf_DF.reset_index().drop(['index', 0],axis=1)        return tf_idf_DFAnd that&#8217;s the solution.

tf-idf for large number of documents (>100k)

Advertisement

Answer