I am trying to read a particular portion of a document as a table. It is structured as a table but there are no dividing lines between, cells, rows or columns.
I had success with using the read_pdf()
method with the area
and column
arguments. I could specify exactly where the table starts and ends and where the columns divide.
But my pdf has multiple different sizes of tables on each page with no clear markers to identify them and I have to use these arguments.
I found out about the read_pdf_with_template()
method in the Github repo issues here, and a bit more about it in the documentation, pull request and the example notebook.
But nowhere it is mentioned how to structure the teamplate.json
and which arguments I could use or what they mean.
I tried inserting the area
coordinates to the x1, y1, x2, y2
, passing the columns list in the method argument and height
, width
arguments with the size of the table.
But it is picking up a top centre section of the pdf which does not equate to any of the coordinates I inserted when I reverse calculated everything.
Here’s the page I’m trying to read (I’ve deleted some sensitive data)
and here are the code snippets
import tabula tables = tabula.read_pdf_with_template(input_path = "test.pdf", template_path = "template.json", columns=[195, 310, 380]) print(tables[0])
[ { "page": 1, "extraction_method": "stream", "x1": 225, "x2": 35, "y1": 375, "y2": 565, "width": 525, "height": 400 } ]
Advertisement
Answer
I was just being a dum-dum.
Templates are not something that you generate manually. They are supposed to be generated by the tabula app as mentioned here.
Just download tabula from the official website. Once you launch the app, it’s fairly simple. Manually click and drag on each table on each page and click on the download template button on top.