How to trim (crop) bottom whitespace of a PDF document, in memory

Question

I am using wkhtmltopdf to render a (Django-templated) HTML document to a single-page PDF file. I would like to either render it immediately with the correct height (which I've failed to do so far) or render it incorrectly and trim it. I'm using Python. Attempt type 1: wkhtmltopdf render to a very, very long single-page PDF with a lot of

Accepted Answer

There might be better ways to do this, but this at least works.I’m assuming that you are able to crop the PDF yourself, and all I’m doing here is determining how far down on the last page you still have content. If that assumption is wrong, I could probably figure out how to crop the PDF. Or otherwise, just crop the image (easy in Pillow) and then convert that to PDF?Also, if you have one big PDF, you might need to figure how how far down on the whole PDF the text ends. I’m just finding out how far down on the last page the content ends. But converting from one to the other is like just an easy arithmetic problem.Tested code:import pdfkitfrom PyPDF2 import PdfFileReaderfrom io import BytesIO# This library isn't named fitz on pypi,# obtain this library with `pip install PyMuPDF==1.19.4`import fitz# `pip install Pillow==8.3.1`from PIL import Imageimport numpy as np# However you arrive at valid HTML, it makes no difference to the solution.rendered = "

Hello World

hello

"# This first renders PDF from HTML normally (multiple pages)# Then counts how many pages were created and determines the required single-page height# Then renders a single-page PDF from HTML using the page height and width argumentspdf_bytes = pdfkit.from_string(rendered, options={ "page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm", "page-width": "210mm"})# convert the pdf into an image.pdf = fitz.open(stream=pdf_bytes, filetype="pdf")last_page = pdf[pdf.pageCount-1]matrix = fitz.Matrix(1, 1)image_pixels = last_page.get_pixmap(matrix=matrix, colorspace="GRAY")image = Image.frombytes("L", [image_pixels.width, image_pixels.height], image_pixels.samples)#Uncomment if you want to see.#image.show()# Now figure out where the end of the text is:# First binarize. This might not be the most efficient way to do this.# But it's how I do it.THRESHOLD = 100# I wrote this code ages ago and don't remember the details but# basically, we treat every pixel > 100 as a white pixel, # We convert the result to a true/false matrix # And then invert that. # The upshot is that, at the end, a value of "True" # in the matrix will represent a black pixel in that location.binary_matrix = np.logical_not(image.point( lambda p: 255 if p > THRESHOLD else 0 ).convert("1"))# Now find last white row, starting at the bottomrow_count, column_count = binary_matrix.shapelast_row = 0for i, row in enumerate(reversed(binary_matrix)): if any(row): last_row = i break else: continue percentage_from_top = (1 - last_row / row_count) * 100print(percentage_from_top)# Now you know where the page ends.# Go back and crop the PDF accordingly.

How to trim (crop) bottom whitespace of a PDF document, in memory

Attempt type 1:

Attempt type 2:

Ideal scenario:

What am I doing now:

Advertisement

Answer