How can I train my Python based OCR with Tesseract to train with different National Identity Cards?

Question

I am working with python to make an OCR system that reads from the ID Cards and give the exact results from the image but it is not giving me the righteous answers as there are so many wrong characters that the tesseract reads. How can I train tesseract in a way that it reads the ID card perfectly and

Accepted Answer

Steps to improve Pytesseract recognition:Clean your image arrays so there is only text(font generated, not handwritten). The edges of letters should be without distortion. Apply threshold (try different values). Also apply some smoothing filters. I also recommend to use Morpholofical opening/closing &#8211; but thats only a bonus. This is exaggerated example of what should enter pytesseract recognition in form of array: https://i.ytimg.com/vi/1ns8tGgdpLY/maxresdefault.jpgResize the image with text you want to recognize to higher resolutionPytesseract should generally recognize letters of any kind, but by installing font in which the text is written, you are superbly increasing accuracy.How to install new fonts into pytesseract:Get your desired font in TIFF formatUpload it to http://trainyourtesseract.com/ and receive trained data into your email (EDIT: This site doesnt exist anymore. At this moment you have to find alternative or train font yourself)add the trained data file (*.traineddata) to this folder C:Program Files (x86)Tesseract-OCRtessdataadd this string command to pytesseract reconition function:lets say you have 2 trained fonts:  font1.traineddata and font2.traineddataTo use both, use this commandtxt = pytesseract.image_to_string(img, lang=&#8216;font1+font2&#8217;)Here is a code to test your recognition on web images:import cv2import pytesseractimport cv2import numpy as npimport urllibimport requestspytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'TESSDATA_PREFIX = 'C:/Program Files (x86)/Tesseract-OCR'from PIL import Imagedef url_to_image(url):    resp = urllib.request.urlopen(url)    image = np.asarray(bytearray(resp.read()), dtype="uint8")    image = cv2.imdecode(image, cv2.IMREAD_COLOR)    return imageurl='http://jeroen.github.io/images/testocr.png'img = url_to_image(url)#img = cv2.GaussianBlur(img,(5,5),0)img = cv2.medianBlur(img,5) retval, img = cv2.threshold(img,150,255, cv2.THRESH_BINARY)txt = pytesseract.image_to_string(img, lang='eng')print('recognition:', txt)>>> txt'This ts a lot of 12 point text to test thenocr code and see if it works on all typesnof file formatnnThe quick brown dog jumped over thenlazy fox The quick brown dog jumpednover the lazy fox The quick brown dognjumped over the lazy fox The quicknbrown dog jumped over the lazy fox'

Advertisement

Answer