Skip to content
Advertisement

PDF reading, returning empty rows

I have a function to read PDF as below:

#PDF files
def Readingpdf(pdfname):
    pdfRead=PyPDF2.PdfFileReader(pdfname)
    comp = ""
    for i in range(pdfRead.getNumPages()):
        comp += pdfRead.getPage(i).extractText()
    return comp

it is working fine on a normal PDF file (like books) I am able to extract the texts easily, but when I tried it at work on “meeting minutes” I got only empty lines like below:

' nnnnn nnn nn nn nnnn n nn nn nn n n n n n nnnnn n nnnn n nn nn nn nn nn nn nn nn nn nn nn nn nn nn nnn nn nn nn nn n nn nn nn n nn nn nn n nn nn nn n nn n nn n nnn n nn n nn n n n n nnnnnnnnn n n nn n nn n nn n nn n nn nn nn n n n n n n n n nnn nn nn n n n n n n n n nnn nn n nn n n n n n n n n-n nnn n nn nn n nn n n n nn nn n nn nn n nn n n n n nn nnnn n nn n n n n n n n n-n nnn n nn nn n nn n n n nn nn nn nn n n n n n n n n nnn nnnn n nn n n n n n n n n-n nnnn n nn nn n nn n n n n n nn nn n nn n nnnn n nn nn n nn nnnnnnnn n nn n nnn n nn nn n nn nn nn nn nn n nn n n nnnnnnn nnn2n nn nn n nn n nn n nn n nn nnn nnnn n nnnnn n n n n n n n n-n nnnnn n nn nnnnnnnn n nn n n n n n nn nnnnnn n nn n n n n n nn nnnnnnn n nn n nnnn n nn nn n nn nnnnnnnn n nn n nnnn n nn nn n nn nnnnnnnnnnn n nn n nnnnnnnn n nn nn nnn nn n n n n n n n n nnn nnn n nn n n n n n n n n-n nnnnn n nn nn n nn n n n n n nn nnnnnnnnn n nn n n n n n nn nnnnnnnnnnnnnnnn n nn n n n n n-n nnnnnn n nn nnnnn n nn n n n n n nn nnn n nn n n n n n nn nn n nn n n n n n nn nnnn n nn n nnnn n nn nn nn nn n nn n n n n n n n n-n nnnnnn n nn nn n n n n n n n nn nnn nn nnn n nn n nn n n n n n nn nnnnnnnnnnn n nn n n n n n nn nnnnnnnnnnnnnn nn nnnn n nn n nn n n n nnnn n n n nn nn n nn nnn n nn n n n n n nn nnnn n nn n nnn n nnn n n n nnnnnnn nnn3n nn nn nn n nn n n n n n n n n-n nnnnn n nn nn n nn n n n n n nn nnn n nn n n n n n nn nn nnn nnn n nn n n n n n nn nn n nn n n n n n n nn nnnnnnnn n nn n nn n nn nn n nn nn n nn n nn n nn nnn n nn nnnnnnnnn n nn n nnn n nnn n nn nnnnnnn n nn n nnnnn n nn nnnn nn nn n n n n n n n n nn nn n nn n n n n n n n nn n n n n n n n n n-n nnn n nn nnnnnnnnnnnn n nn n n n n n nn nn n nn n nnn n nn nn n-n nnnn n nn nnnnn n nn n n n n n nn nnnnn nnnnnnnnnnnnnnn nnnnn n nn n n n n n-n nn n nn nnnnn n nn n n n n n nn nnnnnnnnnnnnnnnnnnnnnn n nn n n n n n nn nnnnnnnnn

Very sorry that I can not share the original PDF however here is a picture below to explain its structure: enter image description here

I am not getting any errors it is just empty lines – Is it because the files has table format? Or am I missing something in this simple function?

It is part of NLP project where I need to upload 100 of these files and try to see pattern in them. Any help is really appreciated.

Advertisement

Answer

There are many ways to extract text PDF. Please try the following:

  1. PDFMINER

Using pdfminer to extract pdf. You can refer example code.

import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert_pdf_to_txt(path):
    '''Convert pdf content from a file path to text

    :path the file path
    '''
    rsrcmgr = PDFResourceManager()
    codec = 'utf-8'
    laparams = LAParams()

    with io.StringIO() as retstr:
        with TextConverter(rsrcmgr, retstr, codec=codec,
                           laparams=laparams) as device:
            with open(path, 'rb') as fp:
                interpreter = PDFPageInterpreter(rsrcmgr, device)
                password = ""
                maxpages = 0
                caching = True
                pagenos = set()

                for page in PDFPage.get_pages(fp,
                                              pagenos,
                                              maxpages=maxpages,
                                              password=password,
                                              caching=caching,
                                              check_extractable=True):
                    interpreter.process_page(page)

                return retstr.getvalue()

if __name__ == "__main__":
    print(convert_pdf_to_txt('test.pdf'))
  1. PYMUPDF

Using pymupdf to extract pdf. You can refer example code.

import fitz        
doc = fitz.open("file.pdf")        
for page in doc:
    text = page.getText()
    print(text)
  1. OCR To extract Text from PDF you need use OCR, in my opinion best OCR its Tesseract OCR, developed by Google, you can just install pytesseract and use it like you use on your pdf, but i highly recommend use with openCV for use OCR just on text

https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052

  1. SLATE

Using lib slate (pip install slate3k)

import slate3k as slate

with open(file.pdf, 'rb') as f:
   extracted_text = slate.PDF(f)
   print(extracted_text)

Good luck!!!

Advertisement