Skip to content
Advertisement

Best way to extract text from a Word doc without using COM/automation?

Is there a reasonable way to extract plain text from a Word file that doesn’t depend on COM automation? (This is a a feature for a web app deployed on a non-Windows platform – that’s non-negotiable in this case.)

Antiword seems like it might be a reasonable option, but it seems like it might be abandoned.

A Python solution would be ideal, but doesn’t appear to be available.

Advertisement

Answer

I use catdoc or antiword for this, whatever gives the result that is the easiest to parse. I have embedded this in python functions, so it is easy to use from the parsing system (which is written in python).

JavaScript

The -w switch to catdoc turns off line wrapping, BTW.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement