How to improve Hindi text extraction?

Question

I am trying to extract Hindi text from a PDF. I tried all the methods to exract from the PDF, but none of them worked. There are explanations why it doesn&#8217;t work, but no answers as such. So, I decided to convert the PDF to an image, and then use pytesseract to extract texts. I have downloaded the Hindi …

Accepted Answer

I will give some ideas how to process your image, but I will limit that to page 3 of the given document, i.e. the page shown in the question.For converting the PDF page to some image, I used pdf2image.For the OCR, I use pytesseract, but instead of lang='hin', I use lang='Devanagari', cf. the Tesseract GitHub. In general, make sure to work through Improving the quality of the output from the Tesseract documentation, especially the page segmentation method.Here&#8217;s a (lengthy) description of the whole procedure:Inverse binarize the image for contour finding: white texts, shapes, etc. on black background.Find all contours, and filter out the two very large contours, i.e. these are the two tables.Extract texts outside of the two tables:Mask out tables in the binarized image.Do morphological closing to connect remaining lines of text.Find contours, and bounding rectangles of these lines of text.Run pytesseract to extract the texts.Extract texts inside of the two tables:Extract the cells, better: their bounding rectangles, from the current table.For the first table:Run pytesseract to extract the texts as-is.For the second table:Floodfill the rectangle around the number to prevent faulty OCR output.Mask the left (Hindi) and right (English) part.Run pytesseract using lang='Devaganari' on the left, and using lang='eng' on the right part to improve OCR quality for both.That&#8217;d be the whole code:import cv2import numpy as npimport pdf2imageimport pytesseract# Extract page 3 from PDF in proper qualitypage_3 = np.array(pdf2image.convert_from_path('BADI KA BANS-Ward No-002.pdf',                                              first_page=3, last_page=3,                                              dpi=300, grayscale=True)[0])# Inverse binarize for contour findingthr = cv2.threshold(page_3, 128, 255, cv2.THRESH_BINARY_INV)[1]# Find contours w.r.t. the OpenCV versioncnts = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)cnts = cnts[0] if len(cnts) == 2 else cnts[1]# STEP 1: Extract texts outside of the two tables# Mask out the two tablescnts_tables = [cnt for cnt in cnts if cv2.contourArea(cnt) > 10000]no_tables = cv2.drawContours(thr.copy(), cnts_tables, -1, 0, cv2.FILLED)# Find bounding rectangles of texts outside of the two tablesno_tables = cv2.morphologyEx(no_tables, cv2.MORPH_CLOSE, np.full((21, 51), 255))cnts = cv2.findContours(no_tables, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)cnts = cnts[0] if len(cnts) == 2 else cnts[1]rects = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda r: r[1])# Extract texts from each bounding rectangleprint('nExtract texts outside of the two tablesn')for (x, y, w, h) in rects:    text = pytesseract.image_to_string(page_3[y:y+h, x:x+w],                                       config='--psm 6', lang='Devanagari')    text = text.replace('n', '').replace('f', '')    print('x: {}, y: {}, text: {}'.format(x, y, text))# STEP 2: Extract texts from inside of the two tablesrects = sorted([cv2.boundingRect(cnt) for cnt in cnts_tables],               key=lambda r: r[1])# Iterate each tablefor i_r, (x, y, w, h) in enumerate(rects, start=1):    # Find bounding rectangles of cells inside of the current table    cnts = cv2.findContours(page_3[y+2:y+h-2, x+2:x+w-2],                            cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)    cnts = cnts[0] if len(cnts) == 2 else cnts[1]    inner_rects = sorted([cv2.boundingRect(cnt) for cnt in cnts],                         key=lambda r: (r[1], r[0]))    # Extract texts from each cell of the current table    print('nExtract texts inside table {}n'.format(i_r))    for (xx, yy, ww, hh) in inner_rects:        # Set current coordinates w.r.t. full image        xx += x        yy += y        # Get current cell        cell = page_3[yy+2:yy+hh-2, xx+2:xx+ww-2]        # For table 1, simply extract texts as-is        if i_r == 1:            text = pytesseract.image_to_string(cell, config='--psm 6',                                               lang='Devanagari')            text = text.replace('n', '').replace('f', '')            print('x: {}, y: {}, text: {}'.format(xx, yy, text))        # For table 2, extract single elements        if i_r == 2:            # Floodfill rectangles around numbers            ys, xs = np.min(np.argwhere(cell == 0), axis=0)            temp = cv2.floodFill(cell.copy(), None, (xs, ys), 255)[1]            mask = cv2.floodFill(thr[yy+2:yy+hh-2, xx+2:xx+ww-2].copy(),                                 None, (xs, ys), 0)[1]            # Extract left (Hindi) and right (English) parts            mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE,                                    np.full((2 * hh, 5), 255))            cnts = cv2.findContours(mask,                                    cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)            cnts = cnts[0] if len(cnts) == 2 else cnts[1]            boxes = sorted([cv2.boundingRect(cnt) for cnt in cnts],                           key=lambda b: b[0])            # Extract texts from each part of the current cell            for i_b, (x_b, y_b, w_b, h_b) in enumerate(boxes, start=1):                # For the left (Hindi) part, extract Hindi texts                if i_b == 1:                    text = pytesseract.image_to_string(                        temp[y_b:y_b+h_b, x_b:x_b+w_b],                        config='--psm 6',                        lang='Devanagari')                    text = text.replace('f', '')                # For the left (English) part, extract English texts                if i_b == 2:                    text = pytesseract.image_to_string(                        temp[y_b:y_b+h_b, x_b:x_b+w_b],                        config='--psm 6',                        lang='eng')                    text = text.replace('f', '')                print('x: {}, y: {}, text:n{}'.format(xx, yy, text))And, here are the first few lines of the output:Extract texts outside of the two tablesx: 972, y: 93, text: राज्य निर्वाचन आयोग, राजस्थानx: 971, y: 181, text: पंचायत चुनाव निर्वाचक नामावली, 2021x: 166, y: 610, text: मिश्र का बाढ़ ,श्रीराम की नॉगलx: 151, y: 3417, text: आयु 1 जनवरी 2021 के अनुसारx: 778, y: 3419, text: पृष्ठ संख्या : 3 / 10Extract texts inside table 1x: 146, y: 240, text: जिलापरिषद का नाम : जयपुरx: 1223, y: 240, text: जि° प° सदस्य निर्वाचन क्षेत्र : 21x: 146, y: 327, text: पंचायत समिति का नाम : सांगानेरx: 1223, y: 327, text: पं° स° सदस्य निर्वाचन क्षेत्र : 6x: 146, y: 415, text: ग्रामपंचायत : बडी का बांसx: 1223, y: 415, text: वार्ड क्रमांक : 2x: 146, y: 502, text: विधानसभा क्षेत्र की संख्या एवं नाम:- 56-बगरुExtract texts inside table 2x: 142, y: 665, text:1 RBP2469583नाम: आरती चावलापिता का नामःलाला राम चावलामकान संख्याः १९आयुः 21 लिंगः स्त्रीx: 142, y: 665, text:Photo isAvailablex: 867, y: 665, text:2 MRQ3101367नामः सूरज देवीपिता का नामःरामावतारमकान संख्याः डी /18आयुः 44 लिंगः स्त्रीx: 867, y: 665, text:Photo isAvailableI checked a few texts using manual character-by-character comparison, and thought it looked quite good, but unable to understand Hindi or reading Devanagari script, I can&#8217;t comment on the overall quality of the OCR. Please let me know!Annoyingly, the number 9 from the corresponding &#8220;card&#8221; is falsely extracted as 2. I assume, that happens due to the different font compared to the rest of the text, and in combination with lang='Devanagari'. Couldn&#8217;t find a solution for that – without extracting the rectangle separately from the &#8220;card&#8221;.----------------------------------------System information----------------------------------------Platform:      Windows-10-10.0.19041-SP0Python:        3.9.1PyCharm:       2021.1.1NumPy:         1.19.5OpenCV:        4.5.2pdf2image      1.14.0pytesseract:   5.0.0-alpha.20201127----------------------------------------

Advertisement

Answer