Skip to content
Advertisement

Getting word count of doc/docx files in R

I have a stream of doc/docx documents that I need to get the word count of.

The procedure so far is to manually open the document and write down the word count offered by MS Word itself, and I am trying to automate it using R.

This is what I tried:

JavaScript

Unfortunately, wordCount is NOT what MS Word suggests.

For example, I noticed that MS Word counts the numbers in numbered lists, whereas textreadr does not even import them.

Is there a workaround? I don’t mind trying something in Python, too, although I’m less experienced there.

Any help would be greatly appreciated.

Advertisement

Answer

I tried reading the docx files with a different library (the officer) and, even though it doesn’t agree 100%, it does significantly better this time.

Another small fix would be to copy MS Word’s strategy on what is a Word and what isn’t. The naive method of counting all spaces can be improved by ignoring the “En Dash” (U+2013) character as well.

Here is my improved function:

JavaScript

This still has a weakness that prevents 100% accuracy: The officer library doesn’t read list separators (like bullets or hyphens), but MS Word considers those as words. So in any list, this function currently returns X words less where X is the number of listed items. I haven’t experimented too much with the attributes of the docxObject, but if it somehow holds the number of listed items, then a definite improvement can be made.

Advertisement