PDF reading, returning empty rows

Question

I have a function to read PDF as below: it is working fine on a normal PDF file (like books) I am able to extract the texts easily, but when I tried it at work on &#8220;meeting minutes&#8221; I got only empty lines like below: Very sorry that I can not share the original PDF however here is a picture

Accepted Answer

There are many ways to extract text PDF. Please try the following:PDFMINERUsing pdfminer to extract pdf. You can refer example code.import iofrom pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreterfrom pdfminer.converter import TextConverterfrom pdfminer.layout import LAParamsfrom pdfminer.pdfpage import PDFPagedef convert_pdf_to_txt(path):    '''Convert pdf content from a file path to text    :path the file path    '''    rsrcmgr = PDFResourceManager()    codec = 'utf-8'    laparams = LAParams()    with io.StringIO() as retstr:        with TextConverter(rsrcmgr, retstr, codec=codec,                           laparams=laparams) as device:            with open(path, 'rb') as fp:                interpreter = PDFPageInterpreter(rsrcmgr, device)                password = ""                maxpages = 0                caching = True                pagenos = set()                for page in PDFPage.get_pages(fp,                                              pagenos,                                              maxpages=maxpages,                                              password=password,                                              caching=caching,                                              check_extractable=True):                    interpreter.process_page(page)                return retstr.getvalue()if __name__ == "__main__":    print(convert_pdf_to_txt('test.pdf'))PYMUPDFUsing pymupdf to extract pdf. You can refer example code.import fitz        doc = fitz.open("file.pdf")        for page in doc:    text = page.getText()    print(text)OCRTo extract Text from PDF you need use OCR, in my opinion best OCR its Tesseract OCR, developed by Google, you can just install pytesseract and use it like you use on your pdf, but i highly recommend use with openCV for use OCR just on texthttps://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052SLATEUsing lib slate (pip install slate3k)import slate3k as slatewith open(file.pdf, 'rb') as f:   extracted_text = slate.PDF(f)   print(extracted_text)Good luck!!!

Advertisement

Answer