Part 1: (The Problem and Its Dimensions)

A while ago, I had more than 10,000 screenshots taken over a long period of time. I wanted to find a single screenshot that contained a specific word I was looking for for personal purposes.

Naturally, I thought I might have to abandon the idea.

If I spent one second on each screenshot looking for that word, I would spend about 3 hours continuously searching. This would be practically exhausting and impossible to stay focused for the entire 3 hours, and my accuracy in searching with the naked eye would be very low.

That's when the idea came to me: I could use an OCR (Optical Character Recognition) model that is open-source called EasyOCR.

This model has strong results with structured text in images but is not suitable for images with text from natural scenes (like text on a sign).

I downloaded it from this respected gentleman:

However, there were some differences in the video that I didn't want to include based on my problem with the 10,000 images.

The problems were as follows:

I had to create a Virtual Environment because he had the code on Jupyter Notebook, and unfortunately, I don't have strong experience dealing with it since most of my work is with JavaScript or the Node.js environment with the terminal. I wanted to run it on VSCode, so any package he used, I would have to manually install from the terminal.
He downloaded the model, but there were outdated elements in the code, so I had to look for the latest versions of the packages that might have changed. (There weren't many, but the video was from 3 years ago, so I had to make sure.)
The third issue was that the code he wrote worked on the path of one image at a time, so that didn't solve my problem. (I would have to feed the model the path of ten thousand images one by one myself, which didn't solve anything.)
I wanted the result to appear in a specific way; I didn't want it to modify the image it found directly by placing a green box on it (which is what it did).
Finally, this isn't a difference, but knowing my screenshots, I could filter them in a way that would reduce the number of images to search through from 10,000 to 7,000, for example.

PART 2: (The Solutions and the Outcome)

The Solutions Were as Follows:

I created the virtual environment.
I downloaded the packages he used, but the updated versions.
I made sure the model worked on a random image by entering its path manually.
I filtered the images using some rules I set based on my knowledge of these images. For example, I said the image I wanted might have been taken between certain dates, so there was no need to search before or after those dates. I started filtering based on that.
After gaining a basic understanding of his code and what it does (since deep learning or any related field is not my area of study or specialization),
I began modifying some aspects to solve the problem of processing one image path at a time.
I told it to scan a directory I would specify for any file with the following extensions: (jpg – jpeg - png) and work on all of them to produce the desired result.
The final condition I set was that after finding the image containing the text and successfully retrieving it, it should create a completely new file with a random name (random number) in a new folder and also a txt file containing the text found on the image.
To keep track of what it was doing, I added some print statements to indicate whether it found anything in the image it processed.

Of course, everything seemed great and smooth, but “little did he know his joy would not last.”

The Outcome:

The model worked perfectly and started producing results. So, what was the problem?

First: I spent about an hour setting this all up.
Second: After all this, it still took 4 hours to produce the result.

This was something I hadn't considered well or accounted for, which was the processing power. These models are not lightweight and to work quickly, they need at least a decent GPU, which my poor PC couldn't handle.

In short, "I succeeded in failing."

But to be fair, I wasn't entirely at a loss because while it was running, I was already doing other things (studying – playing – having lunch – browsing TikTok, etc.). As I mentioned, there's no way I could stay focused for 3 hours on 10k screenshots to find one word.

In the end, I'm just an end user of this model and all deep learning models in general. As I said, this isn't my field of expertise (I only studied a bit from two courses in college and that was it).

However, it was a nice experience to use my limited knowledge to solve a personal problem with some small tweaks to open-source code.

Thankfully, everything is fine now, but if you know a place where I can buy a new CPU to replace the one that got fried, I'd really appreciate it 😂.

If you liked this blog post, give it a like and share, and don't forget to check out the GitHub repository and take a look at the code here: My Code On GitHub 💻

And Follow Me on LinkedIn for more: My LinkedIn Profile 🤖

The beauty of using open source deep learning models personal use.

Table of contents

Part 1: (The Problem and Its Dimensions)

PART 2: (The Solutions and the Outcome)

Subscribe to my newsletter

Mohamed Abusaif

Mohamed Abusaif