Getting word count of doc/docx files in R

Question

I have a stream of doc/docx documents that I need to get the word count of. The procedure so far is to manually open the document and write down the word count offered by MS Word itself, and I am trying to automate it using R. This is what I tried: Unfortunately, wordCount is NOT what MS Word suggests. For

Accepted Answer

I tried reading the docx files with a different library (the officer) and, even though it doesn&#8217;t agree 100%, it does significantly better this time.Another small fix would be to copy MS Word&#8217;s strategy on what is a Word and what isn&#8217;t. The naive method of counting all spaces can be improved by ignoring the &#8220;En Dash&#8221; (U+2013) character as well.Here is my improved function:getDocxWordCount = function(docxFile) {    docxObject = officer::read_docx(docxFile)    myFixedText = as.data.table(officer::docx_summary(docxObject))[nchar(str_trim(text)) > 1, str_trim(text)]    wordBd = sapply(as.list(myFixedText), function(z) 1 + str_count(z, "\s+([u{2013}]\s+)?"))    return(sum(wordBd))}This still has a weakness that prevents 100% accuracy:The officer library doesn&#8217;t read list separators (like bullets or hyphens), but MS Word considers those as words. So in any list, this function currently returns X words less where X is the number of listed items. I haven&#8217;t experimented too much with the attributes of the docxObject, but if it somehow holds the number of listed items, then a definite improvement can be made.

Advertisement

Answer