I have a function to read PDF as below:
#PDF files def Readingpdf(pdfname): pdfRead=PyPDF2.PdfFileReader(pdfname) comp = "" for i in range(pdfRead.getNumPages()): comp += pdfRead.getPage(i).extractText() return comp
it is working fine on a normal PDF file (like books) I am able to extract the texts easily, but when I tried it at work on “meeting minutes” I got only empty lines like below:
' nnnnn nnn nn nn nnnn n nn nn nn n n n n n nnnnn n nnnn n nn nn nn nn nn nn nn nn nn nn nn nn nn nn nnn nn nn nn nn n nn nn nn n nn nn nn n nn nn nn n nn n nn n nnn n nn n nn n n n n nnnnnnnnn n n nn n nn n nn n nn n nn nn nn n n n n n n n n nnn nn nn n n n n n n n n nnn nn n nn n n n n n n n n-n nnn n nn nn n nn n n n nn nn n nn nn n nn n n n n nn nnnn n nn n n n n n n n n-n nnn n nn nn n nn n n n nn nn nn nn n n n n n n n n nnn nnnn n nn n n n n n n n n-n nnnn n nn nn n nn n n n n n nn nn n nn n nnnn n nn nn n nn nnnnnnnn n nn n nnn n nn nn n nn nn nn nn nn n nn n n nnnnnnn nnn2n nn nn n nn n nn n nn n nn nnn nnnn n nnnnn n n n n n n n n-n nnnnn n nn nnnnnnnn n nn n n n n n nn nnnnnn n nn n n n n n nn nnnnnnn n nn n nnnn n nn nn n nn nnnnnnnn n nn n nnnn n nn nn n nn nnnnnnnnnnn n nn n nnnnnnnn n nn nn nnn nn n n n n n n n n nnn nnn n nn n n n n n n n n-n nnnnn n nn nn n nn n n n n n nn nnnnnnnnn n nn n n n n n nn nnnnnnnnnnnnnnnn n nn n n n n n-n nnnnnn n nn nnnnn n nn n n n n n nn nnn n nn n n n n n nn nn n nn n n n n n nn nnnn n nn n nnnn n nn nn nn nn n nn n n n n n n n n-n nnnnnn n nn nn n n n n n n n nn nnn nn nnn n nn n nn n n n n n nn nnnnnnnnnnn n nn n n n n n nn nnnnnnnnnnnnnn nn nnnn n nn n nn n n n nnnn n n n nn nn n nn nnn n nn n n n n n nn nnnn n nn n nnn n nnn n n n nnnnnnn nnn3n nn nn nn n nn n n n n n n n n-n nnnnn n nn nn n nn n n n n n nn nnn n nn n n n n n nn nn nnn nnn n nn n n n n n nn nn n nn n n n n n n nn nnnnnnnn n nn n nn n nn nn n nn nn n nn n nn n nn nnn n nn nnnnnnnnn n nn n nnn n nnn n nn nnnnnnn n nn n nnnnn n nn nnnn nn nn n n n n n n n n nn nn n nn n n n n n n n nn n n n n n n n n n-n nnn n nn nnnnnnnnnnnn n nn n n n n n nn nn n nn n nnn n nn nn n-n nnnn n nn nnnnn n nn n n n n n nn nnnnn nnnnnnnnnnnnnnn nnnnn n nn n n n n n-n nn n nn nnnnn n nn n n n n n nn nnnnnnnnnnnnnnnnnnnnnn n nn n n n n n nn nnnnnnnnn
Very sorry that I can not share the original PDF however here is a picture below to explain its structure:
I am not getting any errors it is just empty lines – Is it because the files has table format? Or am I missing something in this simple function?
It is part of NLP project where I need to upload 100 of these files and try to see pattern in them. Any help is really appreciated.
Advertisement
Answer
There are many ways to extract text PDF. Please try the following:
- PDFMINER
Using pdfminer to extract pdf. You can refer example code.
import io from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfpage import PDFPage def convert_pdf_to_txt(path): '''Convert pdf content from a file path to text :path the file path ''' rsrcmgr = PDFResourceManager() codec = 'utf-8' laparams = LAParams() with io.StringIO() as retstr: with TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams) as device: with open(path, 'rb') as fp: interpreter = PDFPageInterpreter(rsrcmgr, device) password = "" maxpages = 0 caching = True pagenos = set() for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True): interpreter.process_page(page) return retstr.getvalue() if __name__ == "__main__": print(convert_pdf_to_txt('test.pdf'))
- PYMUPDF
Using pymupdf to extract pdf. You can refer example code.
import fitz doc = fitz.open("file.pdf") for page in doc: text = page.getText() print(text)
- OCR To extract Text from PDF you need use OCR, in my opinion best OCR its Tesseract OCR, developed by Google, you can just install pytesseract and use it like you use on your pdf, but i highly recommend use with openCV for use OCR just on text
- SLATE
Using lib slate (pip install slate3k)
import slate3k as slate with open(file.pdf, 'rb') as f: extracted_text = slate.PDF(f) print(extracted_text)
Good luck!!!