I have a large (about 12M rows) DataFrame df
:
JavaScript
x
2
1
df.columns = ['word','documents','frequency']
2
The following ran in a timely fashion:
JavaScript
1
4
1
word_grouping = df[['word','frequency']].groupby('word')
2
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
3
MaxFrequency_perWord.columns = ['word','MaxFrequency']
4
However, this is taking an unexpectedly long time to run:
JavaScript
1
2
1
Occurrences_of_Words = word_grouping[['word']].count().reset_index()
2
What am I doing wrong here? Is there a better way to count occurrences in a large DataFrame?
JavaScript
1
2
1
df.word.describe()
2
ran pretty well, so I really did not expect this Occurrences_of_Words
DataFrame to take very long to build.
Advertisement
Answer
I think df['word'].value_counts()
should serve. By skipping the groupby machinery, you’ll save some time. I’m not sure why count
should be much slower than max
. Both take some time to avoid missing values. (Compare with size
.)
In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you’ll do much better than that.