Skip to content
Advertisement

tf-idf for large number of documents (>100k)

So I’m doing tf-idf for very large corpus(100k documents) and it is giving me memory errors. Is there any implantation that can work well with such large number of documents? I want to make my own stopwords list. Also, it worked on 50k documents, what is the limit of number of documents I can use in this calculation if there is one (sklearn implantation).

  def tf_idf(self, df):
    df_clean, corpus = self.CleanText(df)
    tfidf=TfidfVectorizer().fit(corpus)
    count_tokens=tfidf.get_feature_names_out()
    article_vect = tfidf.transform(corpus)
    tf_idf_DF=pd.DataFrame(data=article_vect.toarray(),columns=count_tokens)
    tf_idf_DF = pd.DataFrame(tf_idf_DF.sum(axis=0).sort_values(ascending=False))

    return tf_idf_DF

The error: MemoryError: Unable to allocate 65.3 GiB for an array with shape (96671, 90622) and data type float64

Thanks in advance.

Advertisement

Answer

As @NickODell said, the memory error is only when you convert the sparse matrix into a dense matrix. The solution is to do everything you want using the sparse matrix only

  def tf_idf(self, df):
    
    df_clean, corpus = self.CleanText(df)
    tfidf=TfidfVectorizer().fit(corpus)
    count_tokens=tfidf.get_feature_names_out()
    article_vect = tfidf.transform(corpus)
    #The following line is the solution:
    tf_idf_DF=pd.DataFrame(data=article_vect.tocsr().sum(axis=0),columns=count_tokens)
    tf_idf_DF = tf_idf_DF.T.sort_values(ascending=False, by=[0])

    tf_idf_DF['word'] = tf_idf_DF.index
    tf_idf_DF['tf-idf'] = tf_idf_DF[0]
    tf_idf_DF = tf_idf_DF.reset_index().drop(['index', 0],axis=1)
    
    return tf_idf_DF

And that’s the solution.

User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement