Skip to content
Advertisement

How to solve Tesseract “Failed loading language ‘eng'” problem in a Docker image

I recently received an error such as:

JavaScript

I have a python file where I specify the pytesseract location: pytesseract.pytesseract.tesseract_cmd = r”to/my/path”

then I also included the tesseract and pytessearch in requirements and install tesseract-ocr in dockerfile.

I do not understand why it happens but can anyone assist?

I actually also copied my tesseract-ocr folder to image in dockerfile

JavaScript

Edited:

Below is my requirements:

JavaScript

Below is my dockerfile

JavaScript

Advertisement

Answer

You have two problems here…

The primary problem is a strange one. The apt-get package tesseract-ocr-eng is installed as a transient dependency of one of the other packages you install with apt-get:

JavaScript

and that package installs an English trained data file in the right place:

JavaScript

and yet you get this error message suggesting that Tesseract can’t find this file.

I did some Googling, and after trying a number of different things that allowed Tesseract to work, I came to this most concise solution to your problem. Just add this line near the end of your Dockerfile, like just before the last CMD line that sets the Docker command to be executed:

JavaScript

This command replaces the previously installed eng.traineddata file with another one that I found on the internet. It is much larger than the previously installed file:

JavaScript

By replacing the previously installed eng.traineddata file with this new version, your code starts to run fine. I didn’t have your image data, obviously, so I had to change your code a bit to use my own image for testing. When I supplied an image with some text in it, I got back the text as the result of calling pytesseract.image_to_string. So this one fix should be all you need.

There is a second problem here. Your pytesseract.image_to_string call is being garbled somehow by the fact that you’re breaking it across multiple lines. To fix just this one issue, you can edit the call so that the string constant is all on one line:

JavaScript

When I made just this change, the part of the error message you’re getting about “Can’t open …” goes away. If you fix just that, you’re left with the error message:

JavaScript

It’s interesting that if you apply just the first fix, both problems go away, as you don’t get an error message at all. I don’t know what’s up with that.

I believe that I’ve given you all that you need to know. If you have additional problems, let us know. If you want me to share my versions of your Dockerfile and main.py files, I can do that.

Happy Tesseracting!

PS: I would recommend that you move the installation calls in your Dockerfile, the calls to apt-get and pip, to the top of the file. This way, you can modify the parts of your Dockerfile specific to your application later on in the file, and your image build will happen quickly rather than all of the long installation steps having to be done again. This is an important practice to understand when building Docker images. It will save you a ton of time watching long Docker image builds over and over again. I did this right away when working on your problem, and I could rebuild and run the next version of the Docker image in just a few seconds rather than it taking more than a minute to rebuild and run each new image.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement