Skip to content
Advertisement

How to improve Hindi text extraction?

I am trying to extract Hindi text from a PDF. I tried all the methods to exract from the PDF, but none of them worked. There are explanations why it doesn’t work, but no answers as such. So, I decided to convert the PDF to an image, and then use pytesseract to extract texts. I have downloaded the Hindi trained data, however that also gives highly inaccurate text.

That’s the actual Hindi text from the PDF (download link):

Actual Hindi

That’s my code so far:

JavaScript

That’s some output sample:

JavaScript

There is an answer to this question I want to scrape a Hindi(Indian Langage) pdf file with python, which seems to tell how to do this, but provides no explanation whatsoever.

Is there any way to do this other than to train the language model myself?

Advertisement

Answer

I will give some ideas how to process your image, but I will limit that to page 3 of the given document, i.e. the page shown in the question.

For converting the PDF page to some image, I used pdf2image.

For the OCR, I use pytesseract, but instead of lang='hin', I use lang='Devanagari', cf. the Tesseract GitHub. In general, make sure to work through Improving the quality of the output from the Tesseract documentation, especially the page segmentation method.

Here’s a (lengthy) description of the whole procedure:

  1. Inverse binarize the image for contour finding: white texts, shapes, etc. on black background.
  2. Find all contours, and filter out the two very large contours, i.e. these are the two tables.
  3. Extract texts outside of the two tables:
    1. Mask out tables in the binarized image.
    2. Do morphological closing to connect remaining lines of text.
    3. Find contours, and bounding rectangles of these lines of text.
    4. Run pytesseract to extract the texts.
  4. Extract texts inside of the two tables:
    1. Extract the cells, better: their bounding rectangles, from the current table.
    2. For the first table:
      1. Run pytesseract to extract the texts as-is.
    3. For the second table:
      1. Floodfill the rectangle around the number to prevent faulty OCR output.
      2. Mask the left (Hindi) and right (English) part.
      3. Run pytesseract using lang='Devaganari' on the left, and using lang='eng' on the right part to improve OCR quality for both.

That’d be the whole code:

JavaScript

And, here are the first few lines of the output:

JavaScript

I checked a few texts using manual character-by-character comparison, and thought it looked quite good, but unable to understand Hindi or reading Devanagari script, I can’t comment on the overall quality of the OCR. Please let me know!

Annoyingly, the number 9 from the corresponding “card” is falsely extracted as 2. I assume, that happens due to the different font compared to the rest of the text, and in combination with lang='Devanagari'. Couldn’t find a solution for that – without extracting the rectangle separately from the “card”.

JavaScript
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement