How to solve Tesseract “Failed loading language ‘eng’” problem in a Docker image

Tags: , , ,



I recently received an error such as:

File "/usr/local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 287, in run_and_get_output
    run_tesseract(**kwargs)
File "/usr/local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 263, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, "read_params_file: Can't open ][|~_}{=!#%&«§><:;—?¢°*@, Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.")

I have a python file where I specify the pytesseract location: pytesseract.pytesseract.tesseract_cmd = r”to/my/path”

then I also included the tesseract and pytessearch in requirements and install tesseract-ocr in dockerfile.

I do not understand why it happens but can anyone assist?

I actually also copied my tesseract-ocr folder to image in dockerfile

COPY tesseract-ocr .

Edited:

Below is my requirements:

opencv-python==4.5.1.48
openpyxl==3.0.6
packaging==20.8
pandas==1.2.1
pathlib==1.0.1
patsy==0.5.1
pdfminer.six==20200517
pdfplumber==0.5.25
Pillow==8.1.0
prov==2.0.0
pycryptodome==3.9.9
pydot==1.4.1
PyMuPDF==1.16.14
pyparsing==2.4.7
PyPDF2==1.26.0
pytesseract==0.3.7
tesseract

Below is my dockerfile

FROM python:3.8.7-slim
WORKDIR /usr/src/app
ARG src_folder= "folder/"
ARG src_ocr= "Tesseract-OCR/"
COPY ${src_folder} .
COPY ${src_ocr} .
COPY requirements.txt .

# Install all the required dependencies
RUN apt-get update 
    && apt-get install -y 
        build-essential 
        cmake 
        git 
        wget 
        unzip 
        yasm 
        pkg-config 
        libswscale-dev 
        libtbb2 
        libtbb-dev 
        libjpeg-dev 
        libpng-dev 
        libtiff-dev 
        libavformat-dev 
        libpq-dev 
    && rm -rf /var/lib/apt/lists/*
RUN apt-get --fix-missing update && apt-get --fix-broken install && apt-get install -y poppler-utils && apt-get install -y tesseract-ocr && 
    apt-get install -y libtesseract-dev && apt-get install -y libleptonica-dev && ldconfig && apt install -y libsm6 libxext6 && apt install -y python-opencv
RUN pip install -r requirements.txt

# command to run on container start
CMD [ "python", "./folder/main.py" ]

Answer

You have two problems here…

The primary problem is a strange one. The apt-get package tesseract-ocr-eng is installed as a transient dependency of one of the other packages you install with apt-get:

# apt-get install tesseract-ocr-eng
...
tesseract-ocr-eng is already the newest version (1:4.00~git30-7274cfa-1).

and that package installs an English trained data file in the right place:

# ls -l /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata
-rw-r--r-- 1 root root 4113088 Sep 15  2017 /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata

and yet you get this error message suggesting that Tesseract can’t find this file.

I did some Googling, and after trying a number of different things that allowed Tesseract to work, I came to this most concise solution to your problem. Just add this line near the end of your Dockerfile, like just before the last CMD line that sets the Docker command to be executed:

RUN wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata -O /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata 2> /dev/null

This command replaces the previously installed eng.traineddata file with another one that I found on the internet. It is much larger than the previously installed file:

# ls -l /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata
-rw-r--r-- 1 root root 23466654 Feb 14 20:17 /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata

By replacing the previously installed eng.traineddata file with this new version, your code starts to run fine. I didn’t have your image data, obviously, so I had to change your code a bit to use my own image for testing. When I supplied an image with some text in it, I got back the text as the result of calling pytesseract.image_to_string. So this one fix should be all you need.

There is a second problem here. Your pytesseract.image_to_string call is being garbled somehow by the fact that you’re breaking it across multiple lines. To fix just this one issue, you can edit the call so that the string constant is all on one line:

infor = pytesseract.image_to_string(im,
                                     lang="eng",
                                     config='--dpi 300 --psm 6 --oem 2 -c tessedit_char_blacklist=][|~_}{=!#%&«§><:;—?¢°*@,')

When I made just this change, the part of the error message you’re getting about “Can’t open …” goes away. If you fix just that, you’re left with the error message:

pytesseract.pytesseract.TesseractError: (1, "Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.")

It’s interesting that if you apply just the first fix, both problems go away, as you don’t get an error message at all. I don’t know what’s up with that.

I believe that I’ve given you all that you need to know. If you have additional problems, let us know. If you want me to share my versions of your Dockerfile and main.py files, I can do that.

Happy Tesseracting!

PS: I would recommend that you move the installation calls in your Dockerfile, the calls to apt-get and pip, to the top of the file. This way, you can modify the parts of your Dockerfile specific to your application later on in the file, and your image build will happen quickly rather than all of the long installation steps having to be done again. This is an important practice to understand when building Docker images. It will save you a ton of time watching long Docker image builds over and over again. I did this right away when working on your problem, and I could rebuild and run the next version of the Docker image in just a few seconds rather than it taking more than a minute to rebuild and run each new image.



Source: stackoverflow