Skip to content
Advertisement

Pythons library pdfreader for PDF extraction wont iterate trough pages

I want to extract text from PDF file with Python’s lib called pdfreader. I followed the instructions here:

https://pdfreader.readthedocs.io/en/latest/tutorial.html#how-to-browse-document-pages

This is my code:

JavaScript

The code does not give me any errors but the problem is that it does not iterate over pages. Variable total_page_num returns me number of pages (more than 1), but when I go in for loop it always goes into only one page (only first page)

Advertisement

Answer

Solving this issue required a lot of documentation reading for the Python module pdfreader. I was shocked at the level of difficulty in using this module for simple text extraction. It took hours to figure out a working solution.

The code below will enumerate the text on individual pages. You will still need to do some text cleaning to get your desired output.

I noted that one of your PDFs is having a problem with some font encoding during the parsing, which throws a warning message.

JavaScript
Advertisement