Skip to content
Advertisement

OpenCV tesserocr watermark detection

So I have about 12000 image links in my SQL table. Point is to detect which of those images contain watermarked text and which don’t. All text and borders is like this.

I’ve tried with OpenCV and tesserocr

JavaScript

But doesn’t seem it recognizes text on image at all.

My second approach was to use some external open API site.

JavaScript

enter image description here

It works but its super slow. For 11000 images it would last few days.

Advertisement

Answer

tesserocr isn’t detecting any text due to the small text height or small text size. By cropping the text region and using that image, pytesseract could extract the text. Using contour and dilation to detect text area didn’t work either due to small text size. To detect the text region, I used EAST model to extract all regions using this solution and combined all the regions. Passing the extracted combined region image to tesseract returns the text. To run this script, You need to download the model which can be found here and install the required dependencies.
Python Script:

JavaScript

Here’s the output of the script
Bounding Box of the text regions
Bounding Box of the text regions
Cropped text region
Cropped text region
Extracted Text
Extracted Text

As mentioned in the post that all of your images are similar, you can extract the text from them by repurposing this script. The script is fairly fast, takes 2.1 seconds for this image.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement