I have thousands of PDF files, composed only by tables, with this structure:
However, despite being fairly structured, I cannot read the tables without losing the structure.
I tried PyPDF2, but the data comes completely messed up.
JavaScript
x
10
10
1
import PyPDF2
2
3
pdfFileObj = open(pdf_file.pdf, 'rb')
4
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
5
pageObj = pdfReader.getPage(0)
6
7
print(pageObj.extractText())
8
print(pageObj.extractText().split('n')[0])
9
print(pageObj.extractText().split('/')[0])
10
I also tried Tabula, but it only reads the header (and not the content of the tables)
JavaScript
1
5
1
from tabula import read_pdf
2
3
pdfFile1 = read_pdf(pdf_file.pdf, output_format = 'json') #Option 1: reads all the headers
4
pdfFile2 = read_pdf(pdf_file.pdf, multiple_tables = True) #Option 2: reads only the first header and few lines of content
5
Any thoughts?
Advertisement
Answer
After struggling a little bit, I found a way.
For each page of the file, it was necessary to define into tabula’s read_pdf function the area of the table and the limits of the columns.
Here is the working code:
JavaScript
1
18
18
1
import pypdf
2
from tabula import read_pdf
3
4
# Get the number of pages in the file
5
pdf_reader = pypdf.PdfReader(pdf_file)
6
n_pages = len(pdf_reader.pages)
7
8
# For each page the table can be read with the following code
9
table_pdf = read_pdf(
10
pdf_file,
11
guess=False,
12
pages=1,
13
stream=True,
14
encoding="utf-8",
15
area=(96, 24, 558, 750),
16
columns=(24, 127, 220, 274, 298, 325, 343, 364, 459, 545, 591, 748),
17
)
18