I’m trying to extract the number from the image string given below
I have no problem in extracting digits from normal text, but the digit in the above strip seems to be a picture within a picture. This is the code I’m using to extract the digit.
import pytesseract from PIL import Image pytesseract.pytesseract.tesseract_cmd = r'C:Program FilesTesseract-OCRtesseract.exe' img = Image.open(r"C:UsersUserNamePycharmProjectsCOLLEGE PROJ65.png") text=pytesseract.image_to_string(img, config='--psm 6') file = open("c.txt", 'w') file.write(text) file.close() print(text)
I’ve tries all possible psm from 1 to 13, and they all display just week. The code works if I crop out just the digit. But my project requires me to extract it from a similar strip. Could someone please help me? I’ve been stuck on this aspect of my project for some time now.
I’ve attached the complete image in case it would help anyone understand the problem better.
I can extract digits in the texts to the right, but I am not able to extract it from the left most week strip!
Advertisement
Answer
First you need to apply adaptive-thresholding
with bitwise-not
operation to the image.
After adaptive-thresholding
:
After bitwise-not
:
To know more about those operations, you can look at Morphological Transformations, Arithmetic Operations and Image Thresholding.
Now we need to read column by column.
Therefore, to set column-by-column reading we need page-segmentation-mode 4:
“4: Assume a single column of text of variable sizes.” source
Now when we read:
txt = pytesseract.image_to_string(bnt, config="--psm 4")
Result:
WEEK © +4 hours te complete 5 Software in the fifth week af this course, we'll learn about tcomputer software. We'll learn about what software actually is and the . . .
We have a lot of informations, we want only the 5 and 6 values.
The logic is: if WEEK
string is available in the current sentence, get the next line and print:
txt = txt.strip().split("n") get_nxt_ln = False for t in txt: if t and get_nxt_ln: print(t) get_nxt_ln = False if "WEEK" in t: get_nxt_ln = True
Result:
5 Software : 6 Troubleshooting
Now to get only the integers, we can use regular-expression
t = re.sub("[^0-9]", "", t) print(t)
Result:
5 6
Code:
import re import cv2 import pytesseract img = cv2.imread("BWSFU.jpg") gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) thr = cv2.adaptiveThreshold(gry, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY_INV, 11, 2) bnt = cv2.bitwise_not(thr) txt = pytesseract.image_to_string(bnt, config="--psm 4") txt = txt.strip().split("n") get_nxt_ln = False for t in txt: if t and get_nxt_ln: t = re.sub("[^0-9]", "", t) print(t) get_nxt_ln = False if "WEEK" in t: get_nxt_ln = True