Skip to content
Advertisement

How to extract Table from PDF in Python? [duplicate]

I have thousands of PDF files, composed only by tables, with this structure:

pdf file

However, despite being fairly structured, I cannot read the tables without losing the structure.

I tried PyPDF2, but the data comes completely messed up.

JavaScript

I also tried Tabula, but it only reads the header (and not the content of the tables)

JavaScript

Any thoughts?

Advertisement

Answer

After struggling a little bit, I found a way.

For each page of the file, it was necessary to define into tabula’s read_pdf function the area of the table and the limits of the columns.

Here is the working code:

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement