I want to extract text from PDF file with Python’s lib called pdfreader. I followed the instructions here:
https://pdfreader.readthedocs.io/en/latest/tutorial.html#how-to-browse-document-pages
This is my code:
import requests from io import StringIO, BytesIO from pdfreader import SimplePDFViewer, PDFDocument pdf_links = ['https://www.buelach.ch/fileadmin/files/documents/Finanzen/Finanz-_und_Aufgabenplan_2020-2024_2020-09-14.pdf', 'https://www.buelach.ch/fileadmin/files/documents/Finanzen/201214_budget2021_aenderungen_gr.pdf', 'http://www.dielsdorf.ch/dl.php/de/5e8c284c3b694/2020.04.06.pdf', 'http://www.dielsdorf.ch/dl.php/de/5f17e472ca9f1/2020.07.20.pdf'] for pdf_link in pdf_links: response = requests.get(pdf_link) my_raw_data = response.content #extract text page by page with BytesIO(my_raw_data) as data: viewer = SimplePDFViewer(data) full_pdf_text = '' total_page_num = len(list(viewer)) for i, page in enumerate(viewer): text = page.strings text = "".join(text) text = text.strip().replace(' ', 'nn').strip() text = text.replace(' ', 'nn') print('PAGE', i)
The code does not give me any errors but the problem is that it does not iterate over pages.
Variable total_page_num
returns me number of pages (more than 1), but when I go in for loop it always goes into only one page (only first page)
Advertisement
Answer
Solving this issue required a lot of documentation reading for the Python module pdfreader. I was shocked at the level of difficulty in using this module for simple text extraction. It took hours to figure out a working solution.
The code below will enumerate the text on individual pages. You will still need to do some text cleaning to get your desired output.
I noted that one of your PDFs is having a problem with some font encoding during the parsing, which throws a warning message.
import requests from io import BytesIO from pdfreader import SimplePDFViewer pdf_links = [ 'https://www.buelach.ch/fileadmin/files/documents/Finanzen/Finanz-_und_Aufgabenplan_2020-2024_2020-09-14.pdf', 'https://www.buelach.ch/fileadmin/files/documents/Finanzen/201214_budget2021_aenderungen_gr.pdf', 'http://www.dielsdorf.ch/dl.php/de/5e8c284c3b694/2020.04.06.pdf', 'http://www.dielsdorf.ch/dl.php/de/5f17e472ca9f1/2020.07.20.pdf'] for pdf_link in pdf_links: response = requests.get(pdf_link, stream=True) # extract text page by page with BytesIO(response.content) as data: viewer = SimplePDFViewer(data) all_pages = [p for p in viewer.doc.pages()] number_of_pages = len(all_pages) for page_number in range(1, number_of_pages + 1): viewer.navigate(int(page_number)) viewer.render() page_strings = " ".join(viewer.canvas.strings).replace(' ', 'nn').strip() print(f'Current Page Number: {page_number}') print(f'Page Text: {page_strings}')