Skip to content
Advertisement

PDF reading, returning empty rows

I have a function to read PDF as below:

JavaScript

it is working fine on a normal PDF file (like books) I am able to extract the texts easily, but when I tried it at work on “meeting minutes” I got only empty lines like below:

JavaScript

Very sorry that I can not share the original PDF however here is a picture below to explain its structure: enter image description here

I am not getting any errors it is just empty lines – Is it because the files has table format? Or am I missing something in this simple function?

It is part of NLP project where I need to upload 100 of these files and try to see pattern in them. Any help is really appreciated.

Advertisement

Answer

There are many ways to extract text PDF. Please try the following:

  1. PDFMINER

Using pdfminer to extract pdf. You can refer example code.

JavaScript
  1. PYMUPDF

Using pymupdf to extract pdf. You can refer example code.

JavaScript
  1. OCR To extract Text from PDF you need use OCR, in my opinion best OCR its Tesseract OCR, developed by Google, you can just install pytesseract and use it like you use on your pdf, but i highly recommend use with openCV for use OCR just on text

https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052

  1. SLATE

Using lib slate (pip install slate3k)

JavaScript

Good luck!!!

Advertisement