Skip to content
Advertisement

Pythons library pdfreader for PDF extraction wont iterate trough pages

I want to extract text from PDF file with Python’s lib called pdfreader. I followed the instructions here:

https://pdfreader.readthedocs.io/en/latest/tutorial.html#how-to-browse-document-pages

This is my code:

import requests
from io import StringIO, BytesIO
from pdfreader import SimplePDFViewer, PDFDocument

pdf_links = ['https://www.buelach.ch/fileadmin/files/documents/Finanzen/Finanz-_und_Aufgabenplan_2020-2024_2020-09-14.pdf',
             'https://www.buelach.ch/fileadmin/files/documents/Finanzen/201214_budget2021_aenderungen_gr.pdf',
             'http://www.dielsdorf.ch/dl.php/de/5e8c284c3b694/2020.04.06.pdf',
             'http://www.dielsdorf.ch/dl.php/de/5f17e472ca9f1/2020.07.20.pdf']

for pdf_link in pdf_links:

    response = requests.get(pdf_link)
    my_raw_data = response.content


    #extract text page by page
    with BytesIO(my_raw_data) as data:
        
        viewer = SimplePDFViewer(data)
        full_pdf_text = ''

        total_page_num = len(list(viewer))
        for i, page in enumerate(viewer):
            text = page.strings
            text = "".join(text)
            text = text.strip().replace('     ', 'nn').strip()
            text = text.replace('  ', 'nn')
            print('PAGE', i)

The code does not give me any errors but the problem is that it does not iterate over pages. Variable total_page_num returns me number of pages (more than 1), but when I go in for loop it always goes into only one page (only first page)

Advertisement

Answer

Solving this issue required a lot of documentation reading for the Python module pdfreader. I was shocked at the level of difficulty in using this module for simple text extraction. It took hours to figure out a working solution.

The code below will enumerate the text on individual pages. You will still need to do some text cleaning to get your desired output.

I noted that one of your PDFs is having a problem with some font encoding during the parsing, which throws a warning message.

import requests
from io import BytesIO
from pdfreader import SimplePDFViewer

pdf_links = [
    'https://www.buelach.ch/fileadmin/files/documents/Finanzen/Finanz-_und_Aufgabenplan_2020-2024_2020-09-14.pdf',
    'https://www.buelach.ch/fileadmin/files/documents/Finanzen/201214_budget2021_aenderungen_gr.pdf',
    'http://www.dielsdorf.ch/dl.php/de/5e8c284c3b694/2020.04.06.pdf',
    'http://www.dielsdorf.ch/dl.php/de/5f17e472ca9f1/2020.07.20.pdf']

for pdf_link in pdf_links:

    response = requests.get(pdf_link, stream=True)

    # extract text page by page
    with BytesIO(response.content) as data:

        viewer = SimplePDFViewer(data)

        all_pages = [p for p in viewer.doc.pages()]
        number_of_pages = len(all_pages)
        for page_number in range(1, number_of_pages + 1):
            viewer.navigate(int(page_number))
            viewer.render()
            page_strings = " ".join(viewer.canvas.strings).replace('     ', 'nn').strip()
            print(f'Current Page Number: {page_number}')
            print(f'Page Text: {page_strings}')
Advertisement