How to extract Table from PDF in Python? [duplicate]

Question

This question already has answers here: How can I extract tables from PDF documents? (4 answers) Closed 7 days ago. I have thousands of PDF files, composed only by tables, with this structure: pdf file However, despite being fairly structured, I cannot read the tables without losing the structure. I tried PyPDF2, but the data comes completely messed up. I

Accepted Answer

After struggling a little bit, I found a way.For each page of the file, it was necessary to define into tabula&#8217;s read_pdf function the area of the table and the limits of the columns.Here is the working code:import pypdffrom tabula import read_pdf# Get the number of pages in the filepdf_reader = pypdf.PdfReader(pdf_file)n_pages = len(pdf_reader.pages)# For each page the table can be read with the following codetable_pdf = read_pdf(    pdf_file,    guess=False,    pages=1,    stream=True,    encoding="utf-8",    area=(96, 24, 558, 750),    columns=(24, 127, 220, 274, 298, 325, 343, 364, 459, 545, 591, 748),)

Advertisement

Answer