Skip to content
Advertisement

How to trim (crop) bottom whitespace of a PDF document, in memory

I am using wkhtmltopdf to render a (Django-templated) HTML document to a single-page PDF file. I would like to either render it immediately with the correct height (which I’ve failed to do so far) or render it incorrectly and trim it. I’m using Python.

Attempt type 1:

  • wkhtmltopdf render to a very, very long single-page PDF with a lot of extra space using --page-height
  • Use pdfCropMargins to trim: crop(["-p4", "100", "0", "100", "100", "-a4", "0", "-28", "0", "0", "input.pdf"])

The PDF is rendered perfectly with 28 units of margin at the bottom, but I had to use the filesystem to execute the crop command. It seems that the tool expects an input file and output file, and also creates temporary files midway through. So I can’t use it.

Attempt type 2:

  • wkhtmltopdf render to multi-page PDF with default parameters
  • Use PyPDF4 (or PyPDF2) to read the file and combine pages into a long, single page

The PDF is rendered fine-ish in most cases, however, sometimes a lot of extra white space can be seen on the bottom if by chance the last PDF page had very little content.

Ideal scenario:

The ideal scenario would involve a function that takes HTML and renders it into a single-page PDF with the expected amount of white space at the bottom. I would be happy with rendering the PDF using wkhtmltopdf, since it returns bytes, and later processing these bytes to remove any extra white space. But I don’t want to involve the file system in this, as instead, I want to perform all operations in memory. Perhaps I can somehow inspect the PDF directly and remove the white space manually, or do some HTML magic to determine the render height before-hand?

What am I doing now:

Note that pdfkit is a wkhtmltopdf wrapper

JavaScript

It’s equivalent to Attempt type 2, except I don’t use PyDPF4 here to stitch the pages together, but instead render again with wkhtmltopdf using precomputed page height.

Advertisement

Answer

There might be better ways to do this, but this at least works.

I’m assuming that you are able to crop the PDF yourself, and all I’m doing here is determining how far down on the last page you still have content. If that assumption is wrong, I could probably figure out how to crop the PDF. Or otherwise, just crop the image (easy in Pillow) and then convert that to PDF?

Also, if you have one big PDF, you might need to figure how how far down on the whole PDF the text ends. I’m just finding out how far down on the last page the content ends. But converting from one to the other is like just an easy arithmetic problem.

Tested code:

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement