I recently received an error such as:
File "/usr/local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 287, in run_and_get_output run_tesseract(**kwargs) File "/usr/local/lib/python3.8/site-packages/pytesseract/pytesseract.py", line 263, in run_tesseract raise TesseractError(proc.returncode, get_errors(error_string)) pytesseract.pytesseract.TesseractError: (1, "read_params_file: Can't open ][|~_}{=!#%&«§><:;—?¢°*@, Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.")
I have a python file where I specify the pytesseract location: pytesseract.pytesseract.tesseract_cmd = r”to/my/path”
then I also included the tesseract and pytessearch in requirements and install tesseract-ocr in dockerfile.
I do not understand why it happens but can anyone assist?
I actually also copied my tesseract-ocr folder to image in dockerfile
COPY tesseract-ocr .
Edited:
Below is my requirements:
opencv-python==4.5.1.48 openpyxl==3.0.6 packaging==20.8 pandas==1.2.1 pathlib==1.0.1 patsy==0.5.1 pdfminer.six==20200517 pdfplumber==0.5.25 Pillow==8.1.0 prov==2.0.0 pycryptodome==3.9.9 pydot==1.4.1 PyMuPDF==1.16.14 pyparsing==2.4.7 PyPDF2==1.26.0 pytesseract==0.3.7 tesseract
Below is my dockerfile
FROM python:3.8.7-slim WORKDIR /usr/src/app ARG src_folder= "folder/" ARG src_ocr= "Tesseract-OCR/" COPY ${src_folder} . COPY ${src_ocr} . COPY requirements.txt . # Install all the required dependencies RUN apt-get update && apt-get install -y build-essential cmake git wget unzip yasm pkg-config libswscale-dev libtbb2 libtbb-dev libjpeg-dev libpng-dev libtiff-dev libavformat-dev libpq-dev && rm -rf /var/lib/apt/lists/* RUN apt-get --fix-missing update && apt-get --fix-broken install && apt-get install -y poppler-utils && apt-get install -y tesseract-ocr && apt-get install -y libtesseract-dev && apt-get install -y libleptonica-dev && ldconfig && apt install -y libsm6 libxext6 && apt install -y python-opencv RUN pip install -r requirements.txt # command to run on container start CMD [ "python", "./folder/main.py" ]
Advertisement
Answer
You have two problems here…
The primary problem is a strange one. The apt-get
package tesseract-ocr-eng
is installed as a transient dependency of one of the other packages you install with apt-get
:
# apt-get install tesseract-ocr-eng ... tesseract-ocr-eng is already the newest version (1:4.00~git30-7274cfa-1).
and that package installs an English trained data file in the right place:
# ls -l /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata -rw-r--r-- 1 root root 4113088 Sep 15 2017 /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata
and yet you get this error message suggesting that Tesseract can’t find this file.
I did some Googling, and after trying a number of different things that allowed Tesseract to work, I came to this most concise solution to your problem. Just add this line near the end of your Dockerfile
, like just before the last CMD
line that sets the Docker command to be executed:
RUN wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata -O /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata 2> /dev/null
This command replaces the previously installed eng.traineddata
file with another one that I found on the internet. It is much larger than the previously installed file:
# ls -l /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata -rw-r--r-- 1 root root 23466654 Feb 14 20:17 /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata
By replacing the previously installed eng.traineddata
file with this new version, your code starts to run fine. I didn’t have your image data, obviously, so I had to change your code a bit to use my own image for testing. When I supplied an image with some text in it, I got back the text as the result of calling pytesseract.image_to_string
. So this one fix should be all you need.
There is a second problem here. Your pytesseract.image_to_string
call is being garbled somehow by the fact that you’re breaking it across multiple lines. To fix just this one issue, you can edit the call so that the string constant is all on one line:
infor = pytesseract.image_to_string(im, lang="eng", config='--dpi 300 --psm 6 --oem 2 -c tessedit_char_blacklist=][|~_}{=!#%&«§><:;—?¢°*@,')
When I made just this change, the part of the error message you’re getting about “Can’t open …” goes away. If you fix just that, you’re left with the error message:
pytesseract.pytesseract.TesseractError: (1, "Failed loading language 'eng' Tesseract couldn't load any languages! Could not initialize tesseract.")
It’s interesting that if you apply just the first fix, both problems go away, as you don’t get an error message at all. I don’t know what’s up with that.
I believe that I’ve given you all that you need to know. If you have additional problems, let us know. If you want me to share my versions of your Dockerfile
and main.py
files, I can do that.
Happy Tesseracting!
PS: I would recommend that you move the installation calls in your Dockerfile
, the calls to apt-get
and pip
, to the top of the file. This way, you can modify the parts of your Dockerfile
specific to your application later on in the file, and your image build will happen quickly rather than all of the long installation steps having to be done again. This is an important practice to understand when building Docker images. It will save you a ton of time watching long Docker image builds over and over again. I did this right away when working on your problem, and I could rebuild and run the next version of the Docker image in just a few seconds rather than it taking more than a minute to rebuild and run each new image.