So I’m doing tf-idf for very large corpus(100k documents) and it is giving me memory errors. Is there any implantation that can work well with such large number of documents? I want to make my own stopwords list. Also, it worked on 50k documents, what is the limit of number of documents I can use in this calculation if there is one (sklearn implantation).
JavaScript
x
10
10
1
def tf_idf(self, df):
2
df_clean, corpus = self.CleanText(df)
3
tfidf=TfidfVectorizer().fit(corpus)
4
count_tokens=tfidf.get_feature_names_out()
5
article_vect = tfidf.transform(corpus)
6
tf_idf_DF=pd.DataFrame(data=article_vect.toarray(),columns=count_tokens)
7
tf_idf_DF = pd.DataFrame(tf_idf_DF.sum(axis=0).sort_values(ascending=False))
8
9
return tf_idf_DF
10
The error: MemoryError: Unable to allocate 65.3 GiB for an array with shape (96671, 90622) and data type float64
Thanks in advance.
Advertisement
Answer
As @NickODell said, the memory error is only when you convert the sparse matrix into a dense matrix. The solution is to do everything you want using the sparse matrix only
JavaScript
1
16
16
1
def tf_idf(self, df):
2
3
df_clean, corpus = self.CleanText(df)
4
tfidf=TfidfVectorizer().fit(corpus)
5
count_tokens=tfidf.get_feature_names_out()
6
article_vect = tfidf.transform(corpus)
7
#The following line is the solution:
8
tf_idf_DF=pd.DataFrame(data=article_vect.tocsr().sum(axis=0),columns=count_tokens)
9
tf_idf_DF = tf_idf_DF.T.sort_values(ascending=False, by=[0])
10
11
tf_idf_DF['word'] = tf_idf_DF.index
12
tf_idf_DF['tf-idf'] = tf_idf_DF[0]
13
tf_idf_DF = tf_idf_DF.reset_index().drop(['index', 0],axis=1)
14
15
return tf_idf_DF
16
And that’s the solution.