Pytesseract not detecting a digit which might be a picture within a picture

Question

I&#8217;m trying to extract the number from the image string given below I have no problem in extracting digits from normal text, but the digit in the above strip seems to be a picture within a picture. This is the code I&#8217;m using to extract the digit. I&#8217;ve tries all possible psm from 1 to 13, and …

Accepted Answer

First you need to apply adaptive-thresholding with bitwise-not operation to the image.After adaptive-thresholding:After bitwise-not:To know more about those operations, you can look at Morphological Transformations, Arithmetic Operations and Image Thresholding.Now we need to read column by column.Therefore, to set column-by-column reading we need page-segmentation-mode 4:&#8220;4: Assume a single column of text of variable sizes.&#8221; sourceNow when we read:txt = pytesseract.image_to_string(bnt, config="--psm 4")Result:WEEK © +4 hours te complete5 Softwarein the fifth week af this course, we'll learn about tcomputer software. We'll learn about what software actually is and the...We have a lot of informations, we want only the 5 and 6 values.The logic is: if WEEK string is available in the current sentence, get the next line and print:txt = txt.strip().split("n")get_nxt_ln = Falsefor t in txt:    if t and get_nxt_ln:        print(t)        get_nxt_ln = False    if "WEEK" in t:        get_nxt_ln = TrueResult:5 Software: 6 TroubleshootingNow to get only the integers, we can use regular-expressiont = re.sub("[^0-9]", "", t)print(t)Result:56Code:import reimport cv2import pytesseractimg = cv2.imread("BWSFU.jpg")gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)thr = cv2.adaptiveThreshold(gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C,                            cv2.THRESH_BINARY_INV, 11, 2)bnt = cv2.bitwise_not(thr)txt = pytesseract.image_to_string(bnt, config="--psm 4")txt = txt.strip().split("n")get_nxt_ln = Falsefor t in txt:    if t and get_nxt_ln:        t = re.sub("[^0-9]", "", t)        print(t)        get_nxt_ln = False    if "WEEK" in t:        get_nxt_ln = True

Advertisement

Answer