I am using wkhtmltopdf
to render a (Django-templated) HTML document to a single-page PDF file. I would like to either render it immediately with the correct height (which I’ve failed to do so far) or render it incorrectly and trim it. I’m using Python.
Attempt type 1:
wkhtmltopdf
render to a very, very long single-page PDF with a lot of extra space using--page-height
- Use
pdfCropMargins
to trim:crop(["-p4", "100", "0", "100", "100", "-a4", "0", "-28", "0", "0", "input.pdf"])
The PDF is rendered perfectly with 28 units of margin at the bottom, but I had to use the filesystem to execute the crop
command. It seems that the tool expects an input file and output file, and also creates temporary files midway through. So I can’t use it.
Attempt type 2:
wkhtmltopdf
render to multi-page PDF with default parameters- Use
PyPDF4
(orPyPDF2
) to read the file and combine pages into a long, single page
The PDF is rendered fine-ish in most cases, however, sometimes a lot of extra white space can be seen on the bottom if by chance the last PDF page had very little content.
Ideal scenario:
The ideal scenario would involve a function that takes HTML and renders it into a single-page PDF with the expected amount of white space at the bottom. I would be happy with rendering the PDF using wkhtmltopdf
, since it returns bytes, and later processing these bytes to remove any extra white space. But I don’t want to involve the file system in this, as instead, I want to perform all operations in memory. Perhaps I can somehow inspect the PDF directly and remove the white space manually, or do some HTML magic to determine the render height before-hand?
What am I doing now:
Note that pdfkit
is a wkhtmltopdf
wrapper
# This is not a valid HTML (includes Django-specific stuff) template: Template = get_template("some-django-template.html") # This is now valid HTML rendered = template.render({ "foo": "bar", }) # This first renders PDF from HTML normally (multiple pages) # Then counts how many pages were created and determines the required single-page height # Then renders a single-page PDF from HTML using the page height and width arguments return pdfkit.from_string(rendered, options={ "page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm", "page-width": "210mm" })
It’s equivalent to Attempt type 2
, except I don’t use PyDPF4
here to stitch the pages together, but instead render again with wkhtmltopdf
using precomputed page height.
Advertisement
Answer
There might be better ways to do this, but this at least works.
I’m assuming that you are able to crop the PDF yourself, and all I’m doing here is determining how far down on the last page you still have content. If that assumption is wrong, I could probably figure out how to crop the PDF. Or otherwise, just crop the image (easy in Pillow) and then convert that to PDF?
Also, if you have one big PDF, you might need to figure how how far down on the whole PDF the text ends. I’m just finding out how far down on the last page the content ends. But converting from one to the other is like just an easy arithmetic problem.
Tested code:
import pdfkit from PyPDF2 import PdfFileReader from io import BytesIO # This library isn't named fitz on pypi, # obtain this library with `pip install PyMuPDF==1.19.4` import fitz # `pip install Pillow==8.3.1` from PIL import Image import numpy as np # However you arrive at valid HTML, it makes no difference to the solution. rendered = "<html><head></head><body><h3>Hello World</h3><p>hello</p></body></html>" # This first renders PDF from HTML normally (multiple pages) # Then counts how many pages were created and determines the required single-page height # Then renders a single-page PDF from HTML using the page height and width arguments pdf_bytes = pdfkit.from_string(rendered, options={ "page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm", "page-width": "210mm" }) # convert the pdf into an image. pdf = fitz.open(stream=pdf_bytes, filetype="pdf") last_page = pdf[pdf.pageCount-1] matrix = fitz.Matrix(1, 1) image_pixels = last_page.get_pixmap(matrix=matrix, colorspace="GRAY") image = Image.frombytes("L", [image_pixels.width, image_pixels.height], image_pixels.samples) #Uncomment if you want to see. #image.show() # Now figure out where the end of the text is: # First binarize. This might not be the most efficient way to do this. # But it's how I do it. THRESHOLD = 100 # I wrote this code ages ago and don't remember the details but # basically, we treat every pixel > 100 as a white pixel, # We convert the result to a true/false matrix # And then invert that. # The upshot is that, at the end, a value of "True" # in the matrix will represent a black pixel in that location. binary_matrix = np.logical_not(image.point( lambda p: 255 if p > THRESHOLD else 0 ).convert("1")) # Now find last white row, starting at the bottom row_count, column_count = binary_matrix.shape last_row = 0 for i, row in enumerate(reversed(binary_matrix)): if any(row): last_row = i break else: continue percentage_from_top = (1 - last_row / row_count) * 100 print(percentage_from_top) # Now you know where the page ends. # Go back and crop the PDF accordingly.