How to solve Tesseract “Failed loading language ‘eng'” problem in a Docker image

Question

I recently received an error such as: I have a python file where I specify the pytesseract location: pytesseract.pytesseract.tesseract_cmd = r&#8221;to/my/path&#8221; then I also included the tesseract and pytessearch in requirements and install tesseract-ocr in dockerfile. I do not understand why it happens …

Accepted Answer

You have two problems here&#8230;The primary problem is a strange one.  The apt-get package tesseract-ocr-eng is installed as a transient dependency of one of the other packages you install with apt-get:# apt-get install tesseract-ocr-eng...tesseract-ocr-eng is already the newest version (1:4.00~git30-7274cfa-1).and that package installs an English trained data file in the right place:# ls -l /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata-rw-r--r-- 1 root root 4113088 Sep 15  2017 /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddataand yet you get this error message suggesting that Tesseract can&#8217;t find this file.I did some Googling, and after trying a number of different things that allowed Tesseract to work, I came to this most concise solution to your problem.  Just add this line near the end of your Dockerfile, like just before the last CMD line that sets the Docker command to be executed:RUN wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata -O /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata 2> /dev/nullThis command replaces the previously installed eng.traineddata file with another one that I found on the internet.  It is much larger than the previously installed file:# ls -l /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata-rw-r--r-- 1 root root 23466654 Feb 14 20:17 /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddataBy replacing the previously installed eng.traineddata file with this new version, your code starts to run fine.  I didn&#8217;t have your image data, obviously, so I had to change your code a bit to use my own image for testing.  When I supplied an image with some text in it, I got back the text as the result of calling pytesseract.image_to_string.  So this one fix should be all you need.There is a second problem here.  Your pytesseract.image_to_string call is being garbled somehow by the fact that you&#8217;re breaking it across multiple lines. To fix just this one issue, you can edit the call so that the string constant is all on one line:infor = pytesseract.image_to_string(im,                                     lang="eng",                                     config='--dpi 300 --psm 6 --oem 2 -c tessedit_char_blacklist=][|~_}{=!#%&«§><:;—?¢°*@,')When I made just this change, the part of the error message you&#8217;re getting about &#8220;Can&#8217;t open &#8230;&#8221; goes away.  If you fix just that, you&#8217;re left with the error message:pytesseract.pytesseract.TesseractError: (1, "Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.")It&#8217;s interesting that if you apply just the first fix, both problems go away, as you don&#8217;t get an error message at all.  I don&#8217;t know what&#8217;s up with that.I believe that I&#8217;ve given you all that you need to know.  If you have additional problems, let us know.  If you want me to share my versions of your Dockerfile and main.py files, I can do that.Happy Tesseracting!PS: I would recommend that you move the installation calls in your Dockerfile, the calls to apt-get and pip, to the top of the file.  This way, you can modify the parts of your Dockerfile specific to your application later on in the file, and your image build will happen quickly  rather than all of the long installation steps having to be done again.  This is an important practice to understand when building Docker images.  It will save you a ton of time watching long Docker image builds over and over again.  I did this right away when working on your problem, and I could rebuild and run the next version of the Docker image in just a few seconds rather than it taking more than a minute to rebuild and run each new image.

Advertisement

Answer