I have a function that gets a page from a PDF file via pyPdf2
and should convert the first page to a png (or jpg) with Pillow
(PIL Fork)
from PyPDF2 import PdfFileWriter, PdfFileReader import os from PIL import Image import io # Open PDF Source # app_path = os.path.dirname(__file__) src_pdf= PdfFileReader(open(os.path.join(app_path, "../../../uploads/%s" % filename), "rb")) # Get the first page of the PDF # dst_pdf = PdfFileWriter() dst_pdf.addPage(src_pdf.getPage(0)) # Create BytesIO # pdf_bytes = io.BytesIO() dst_pdf.write(pdf_bytes) pdf_bytes.seek(0) file_name = "../../../uploads/%s_p%s.png" % (name, pagenum) img = Image.open(pdf_bytes) img.save(file_name, 'PNG') pdf_bytes.flush()
That results in an error:
OSError: cannot identify image file <_io.BytesIO object at 0x0000023440F3A8E0>
I found some threads with a similar issue, (PIL open() method not working with BytesIO) but I cannot see where I am wrong here, as I have pdf_bytes.seek(0)
already added.
Any hints appreciated
Advertisement
Answer
Per document:
write(stream) Writes the collection of pages added to this object out as a PDF file.
Parameters: stream – An object to write the file to. The object must support the write method and the tell method, similar to a file object.
So the object pdf_bytes contains a PDF file, not an image file.
The reason why there are codes like above work is: sometimes, the pdf file just contains a jpeg file as its content. If your pdf is just a normal pdf file, you can’t just read the bytes and parse it as an image.
And refer to as a more robust implementation: https://stackoverflow.com/a/34116472/334999