Pytesseract not detecting a digit which might be a picture within a picture

Tags: , ,



I’m trying to extract the number from the image string given below

I have no problem in extracting digits from normal text, but the digit in the above strip seems to be a picture within a picture. This is the code I’m using to extract the digit.

import pytesseract
from PIL import Image

pytesseract.pytesseract.tesseract_cmd = r'C:Program FilesTesseract-OCRtesseract.exe'
img = Image.open(r"C:UsersUserNamePycharmProjectsCOLLEGE PROJ65.png")
text=pytesseract.image_to_string(img, config='--psm 6')
file = open("c.txt", 'w')
file.write(text)
file.close()
print(text)

I’ve tries all possible psm from 1 to 13, and they all display just week. The code works if I crop out just the digit. But my project requires me to extract it from a similar strip. Could someone please help me? I’ve been stuck on this aspect of my project for some time now.

I’ve attached the complete image in case it would help anyone understand the problem better.

I can extract digits in the texts to the right, but I am not able to extract it from the left most week strip!

Answer

First you need to apply adaptive-thresholding with bitwise-not operation to the image.

After adaptive-thresholding:

enter image description here

After bitwise-not:

enter image description here

To know more about those operations, you can look at Morphological Transformations, Arithmetic Operations and Image Thresholding.

Now we need to read column by column.

Therefore, to set column-by-column reading we need page-segmentation-mode 4:

“4: Assume a single column of text of variable sizes.” source

Now when we read:

txt = pytesseract.image_to_string(bnt, config="--psm 4")

Result:

WEEK © +4 hours te complete

5 Software

in the fifth week af this course, we'll learn about tcomputer software. We'll learn about what software actually is and the
.
.
.

We have a lot of informations, we want only the 5 and 6 values.

The logic is: if WEEK string is available in the current sentence, get the next line and print:

txt = txt.strip().split("n")
get_nxt_ln = False
for t in txt:
    if t and get_nxt_ln:
        print(t)
        get_nxt_ln = False
    if "WEEK" in t:
        get_nxt_ln = True

Result:

5 Software
: 6 Troubleshooting

Now to get only the integers, we can use regular-expression

t = re.sub("[^0-9]", "", t)
print(t)

Result:

5
6

Code:


import re
import cv2
import pytesseract

img = cv2.imread("BWSFU.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.adaptiveThreshold(gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C,
                            cv2.THRESH_BINARY_INV, 11, 2)
bnt = cv2.bitwise_not(thr)
txt = pytesseract.image_to_string(bnt, config="--psm 4")
txt = txt.strip().split("n")
get_nxt_ln = False
for t in txt:
    if t and get_nxt_ln:
        t = re.sub("[^0-9]", "", t)
        print(t)
        get_nxt_ln = False
    if "WEEK" in t:
        get_nxt_ln = True


Source: stackoverflow