Skip to content
Advertisement

python-docx returning empty cells when they should be full

I am trying to iterate through all tables in a document and extract the text from them. As an intermediate step I am just trying to print the text to the console.

I have looked at other code provided by scanny in similar posts but for some reason it is not giving me my expected output from the document I am parsing through

The document can be found at https://www.ontario.ca/laws/regulation/140300

from docx import Document
from docx.enum.text import WD_COLOR_INDEX
import os, re, sys

document = Document("path/to/doc")

tables = document.tables

for table in tables:

    for row in table.rows:

         for cell in row.cells:

              for paragraph in cell.paragraphs:
                   print(paragraph.text)

I expect this to print out all the text but instead I get nothing. if I try to print(row.cells) it just prints (). which is an empty list I guess. My document definetly does have text in the cells though. Not sure whats wrong here.

Any help is appreciated,

Advertisement

Answer

Found the error. I was using a third party tool (multiDoc converter) to convert old .Doc files into Docx format. works for the most part, however there must be some meta data that doesn’t convert properly because it was causing the issue. Opening the file and manually saving it as Docx solved the issue. Only problem is that I want to convert 2000+ files into Docx, so I’ll need to find another solution for convertiing the files.

Advertisement