I am trying to extract Hindi text from a PDF. I tried all the methods to exract from the PDF, but none of them worked. There are explanations why it doesn’t work, but no answers as such. So, I decided to convert the PDF to an image, and then use pytesseract
to extract texts. I have downloaded the Hindi trained data, however that also gives highly inaccurate text.
That’s the actual Hindi text from the PDF (download link):
That’s my code so far:
import fitz filepath = "D:\BADI KA BANS-Ward No-002.pdf" doc = fitz.open(filepath) page = doc.loadPage(3) # number of page pix = page.getPixmap() output = "outfile.png" pix.writePNG(output) from PIL import Image import pytesseract # Include tesseract executable in your path pytesseract.pytesseract.tesseract_cmd = r"C:Program FilesTesseract-OCRtesseract.exe" # Create an image object of PIL library image = Image.open('outfile.png') # pass image into pytesseract module # pytesseract is trained in many languages image_to_text = pytesseract.image_to_string(image, lang='hin') # Print the text print(image_to_text)
That’s some output sample:
कार बिता देवी व ०... नाम बाइुनान िक०क नाक तो पति का नाव: रवजी लात. “50९... पिला का सामशामाव.... “पति का नाम: बादुलल कान सब: 43 लसमनंध्या: 93९. मकान ंब्या: 3९ आप: 29 _ लिंग सी. | आइ 57 लिंग पुरुष आप: 62 लिंग सी एजगल्णब्णस्य (बन्द जगाख्मिणण्य नमः बायगी बसों ०४... नि बयावर्णो ०५०... निफर सनक नी चिता का नामजबूजल वर्ष.“ ००० | पिला का नामब्राइलाल वर्षो... 0 2... | पिता कामामशुल चब्द .... “20० |सकानसंब्या: 43९ बसवकंब्या: 43९. कान संब्या: 44 जाए: 27 लिंग सो कई: 27 नि खी मा लिंग पुरुष
There is an answer to this question I want to scrape a Hindi(Indian Langage) pdf file with python, which seems to tell how to do this, but provides no explanation whatsoever.
Is there any way to do this other than to train the language model myself?
Advertisement
Answer
I will give some ideas how to process your image, but I will limit that to page 3 of the given document, i.e. the page shown in the question.
For converting the PDF page to some image, I used pdf2image
.
For the OCR, I use pytesseract
, but instead of lang='hin'
, I use lang='Devanagari'
, cf. the Tesseract GitHub. In general, make sure to work through Improving the quality of the output from the Tesseract documentation, especially the page segmentation method.
Here’s a (lengthy) description of the whole procedure:
- Inverse binarize the image for contour finding: white texts, shapes, etc. on black background.
- Find all contours, and filter out the two very large contours, i.e. these are the two tables.
- Extract texts outside of the two tables:
- Mask out tables in the binarized image.
- Do morphological closing to connect remaining lines of text.
- Find contours, and bounding rectangles of these lines of text.
- Run
pytesseract
to extract the texts.
- Extract texts inside of the two tables:
- Extract the cells, better: their bounding rectangles, from the current table.
- For the first table:
- Run
pytesseract
to extract the texts as-is.
- Run
- For the second table:
- Floodfill the rectangle around the number to prevent faulty OCR output.
- Mask the left (Hindi) and right (English) part.
- Run
pytesseract
usinglang='Devaganari'
on the left, and usinglang='eng'
on the right part to improve OCR quality for both.
That’d be the whole code:
import cv2 import numpy as np import pdf2image import pytesseract # Extract page 3 from PDF in proper quality page_3 = np.array(pdf2image.convert_from_path('BADI KA BANS-Ward No-002.pdf', first_page=3, last_page=3, dpi=300, grayscale=True)[0]) # Inverse binarize for contour finding thr = cv2.threshold(page_3, 128, 255, cv2.THRESH_BINARY_INV)[1] # Find contours w.r.t. the OpenCV version cnts = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE) cnts = cnts[0] if len(cnts) == 2 else cnts[1] # STEP 1: Extract texts outside of the two tables # Mask out the two tables cnts_tables = [cnt for cnt in cnts if cv2.contourArea(cnt) > 10000] no_tables = cv2.drawContours(thr.copy(), cnts_tables, -1, 0, cv2.FILLED) # Find bounding rectangles of texts outside of the two tables no_tables = cv2.morphologyEx(no_tables, cv2.MORPH_CLOSE, np.full((21, 51), 255)) cnts = cv2.findContours(no_tables, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE) cnts = cnts[0] if len(cnts) == 2 else cnts[1] rects = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda r: r[1]) # Extract texts from each bounding rectangle print('nExtract texts outside of the two tablesn') for (x, y, w, h) in rects: text = pytesseract.image_to_string(page_3[y:y+h, x:x+w], config='--psm 6', lang='Devanagari') text = text.replace('n', '').replace('f', '') print('x: {}, y: {}, text: {}'.format(x, y, text)) # STEP 2: Extract texts from inside of the two tables rects = sorted([cv2.boundingRect(cnt) for cnt in cnts_tables], key=lambda r: r[1]) # Iterate each table for i_r, (x, y, w, h) in enumerate(rects, start=1): # Find bounding rectangles of cells inside of the current table cnts = cv2.findContours(page_3[y+2:y+h-2, x+2:x+w-2], cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE) cnts = cnts[0] if len(cnts) == 2 else cnts[1] inner_rects = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda r: (r[1], r[0])) # Extract texts from each cell of the current table print('nExtract texts inside table {}n'.format(i_r)) for (xx, yy, ww, hh) in inner_rects: # Set current coordinates w.r.t. full image xx += x yy += y # Get current cell cell = page_3[yy+2:yy+hh-2, xx+2:xx+ww-2] # For table 1, simply extract texts as-is if i_r == 1: text = pytesseract.image_to_string(cell, config='--psm 6', lang='Devanagari') text = text.replace('n', '').replace('f', '') print('x: {}, y: {}, text: {}'.format(xx, yy, text)) # For table 2, extract single elements if i_r == 2: # Floodfill rectangles around numbers ys, xs = np.min(np.argwhere(cell == 0), axis=0) temp = cv2.floodFill(cell.copy(), None, (xs, ys), 255)[1] mask = cv2.floodFill(thr[yy+2:yy+hh-2, xx+2:xx+ww-2].copy(), None, (xs, ys), 0)[1] # Extract left (Hindi) and right (English) parts mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE, np.full((2 * hh, 5), 255)) cnts = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE) cnts = cnts[0] if len(cnts) == 2 else cnts[1] boxes = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda b: b[0]) # Extract texts from each part of the current cell for i_b, (x_b, y_b, w_b, h_b) in enumerate(boxes, start=1): # For the left (Hindi) part, extract Hindi texts if i_b == 1: text = pytesseract.image_to_string( temp[y_b:y_b+h_b, x_b:x_b+w_b], config='--psm 6', lang='Devanagari') text = text.replace('f', '') # For the left (English) part, extract English texts if i_b == 2: text = pytesseract.image_to_string( temp[y_b:y_b+h_b, x_b:x_b+w_b], config='--psm 6', lang='eng') text = text.replace('f', '') print('x: {}, y: {}, text:n{}'.format(xx, yy, text))
And, here are the first few lines of the output:
Extract texts outside of the two tables x: 972, y: 93, text: राज्य निर्वाचन आयोग, राजस्थान x: 971, y: 181, text: पंचायत चुनाव निर्वाचक नामावली, 2021 x: 166, y: 610, text: मिश्र का बाढ़ ,श्रीराम की नॉगल x: 151, y: 3417, text: आयु 1 जनवरी 2021 के अनुसार x: 778, y: 3419, text: पृष्ठ संख्या : 3 / 10 Extract texts inside table 1 x: 146, y: 240, text: जिलापरिषद का नाम : जयपुर x: 1223, y: 240, text: जि° प° सदस्य निर्वाचन क्षेत्र : 21 x: 146, y: 327, text: पंचायत समिति का नाम : सांगानेर x: 1223, y: 327, text: पं° स° सदस्य निर्वाचन क्षेत्र : 6 x: 146, y: 415, text: ग्रामपंचायत : बडी का बांस x: 1223, y: 415, text: वार्ड क्रमांक : 2 x: 146, y: 502, text: विधानसभा क्षेत्र की संख्या एवं नाम:- 56-बगरु Extract texts inside table 2 x: 142, y: 665, text: 1 RBP2469583 नाम: आरती चावला पिता का नामःलाला राम चावला मकान संख्याः १९ आयुः 21 लिंगः स्त्री x: 142, y: 665, text: Photo is Available x: 867, y: 665, text: 2 MRQ3101367 नामः सूरज देवी पिता का नामःरामावतार मकान संख्याः डी /18 आयुः 44 लिंगः स्त्री x: 867, y: 665, text: Photo is Available
I checked a few texts using manual character-by-character comparison, and thought it looked quite good, but unable to understand Hindi or reading Devanagari script, I can’t comment on the overall quality of the OCR. Please let me know!
Annoyingly, the number 9
from the corresponding “card” is falsely extracted as 2
. I assume, that happens due to the different font compared to the rest of the text, and in combination with lang='Devanagari'
. Couldn’t find a solution for that – without extracting the rectangle separately from the “card”.
---------------------------------------- System information ---------------------------------------- Platform: Windows-10-10.0.19041-SP0 Python: 3.9.1 PyCharm: 2021.1.1 NumPy: 1.19.5 OpenCV: 4.5.2 pdf2image 1.14.0 pytesseract: 5.0.0-alpha.20201127 ----------------------------------------