Pythons library pdfreader for PDF extraction wont iterate trough pages

Question

I want to extract text from PDF file with Python's lib called pdfreader. I followed the instructions here: https://pdfreader.readthedocs.io/en/latest/tutorial.html#how-to-browse-document-pages This is my code: The code does not give me any errors but the problem is that it does not iterate over pages. Variable total_page_num returns me number of pages (more than 1), but when I go in for loop it

Accepted Answer

Solving this issue required a lot of documentation reading for the Python module pdfreader.  I was shocked at the level of difficulty in using this module for simple text extraction. It took hours to figure out a working solution.The code below will enumerate the text on individual pages.  You will still need to do some text cleaning to get your desired output.I noted that one of your PDFs is having a problem with some font encoding during the parsing, which throws a warning message.import requestsfrom io import BytesIOfrom pdfreader import SimplePDFViewerpdf_links = [    'https://www.buelach.ch/fileadmin/files/documents/Finanzen/Finanz-_und_Aufgabenplan_2020-2024_2020-09-14.pdf',    'https://www.buelach.ch/fileadmin/files/documents/Finanzen/201214_budget2021_aenderungen_gr.pdf',    'http://www.dielsdorf.ch/dl.php/de/5e8c284c3b694/2020.04.06.pdf',    'http://www.dielsdorf.ch/dl.php/de/5f17e472ca9f1/2020.07.20.pdf']for pdf_link in pdf_links:    response = requests.get(pdf_link, stream=True)    # extract text page by page    with BytesIO(response.content) as data:        viewer = SimplePDFViewer(data)        all_pages = [p for p in viewer.doc.pages()]        number_of_pages = len(all_pages)        for page_number in range(1, number_of_pages + 1):            viewer.navigate(int(page_number))            viewer.render()            page_strings = " ".join(viewer.canvas.strings).replace('     ', 'nn').strip()            print(f'Current Page Number: {page_number}')            print(f'Page Text: {page_strings}')

Advertisement

Answer