Numbers from screenshots with Tesseract OCR sandbox

DESCRIPTION

This repo contains all necessary bits to OCR number images grabbed from the screen like that

If you got a bunch of number images and wish to convert them into plain text that is what you need

ON WINDOWS

HOW TO USE

Clone that repository
Install tesseract-3.01. If it's gone than install tesseract from distros subfolder.

So, you got subfolders:

Samples

It is full of sample number images. It is convenient to OCR them all together. That is why I created total.png file:

exp1 - as is

cd exp1 - as is

That folder contais run.cmd which ocrs total.png. The result text is in total.txt. You can see the errors:

Tesseract recognizes 6 and 8 as 5 and misses decimal dot .

exp2 - trained

cd exp2 - trained

That folder contais train.cmd which automatically trains tesseract for such images. See it and read userguide to learn how to train tesseract.

To train tesseract automatically just launch train.cmd

Launch run.cmd to ocr total.png with trained tesseract. The result text is in total.txt. You can see the errors:

You can see that tesseract learned how to distinct 6 and 8 from 5, but still misses decimal dots .

exp3 - scaled

As soon as thare are errors try to scale total.png. To do that cd exp3 - scaled

It contains total-scaled.png the fragment of which you can see below:

To ocr total-scaled launch run.cmd. The result text is in total.txt. You can see the errors:

It mixes 7 with 2 and adds 3 redundant spaces between digits

exp4 - resized

You can scale total.png different way: cd exp4 - resized. It contains total-resized.png the fragment of which you can see below:

To ocr total-resized launch run.cmd. The result text is in total.txt. You can see the errors:

exp5 - one by one

What will happen if you wish ocr number images on by one?

cd exp5 - one by one

It contains 10 sample images and corresponding txt files which are the results of recognition

To ocr them launch run.cmd. See text files to find errors. Some 2 and 3 digit numbers are not recognized at all!

exp6 - ten in line

What will happen if you wish to ocr 10 images all together?

cd exp5 - ten in line

It contains teninline.png and corresponding txt file with the result of recognition

To ocr it launch run.cmd. See text file - it contains no errors!

ON LINUX

It takes little efforts to port all those cmd oneliners to bash ones. Write them, test them and submit pull request if you wish to contribute.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Numbers from screenshots with Tesseract OCR sandbox

DESCRIPTION

ON WINDOWS

HOW TO USE

Samples

exp1 - as is

exp2 - trained

exp3 - scaled

exp4 - resized

exp5 - one by one

exp6 - ten in line

ON LINUX

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
distros		distros
exp1 - as is		exp1 - as is
exp2 - trained		exp2 - trained
exp3 - scaled		exp3 - scaled
exp4 - resized		exp4 - resized
exp5 - one by one		exp5 - one by one
exp6 - ten in line		exp6 - ten in line
images		images
samples		samples
.gitignore		.gitignore
README.md		README.md
compare.rb		compare.rb
compare_all.cmd		compare_all.cmd
model_text.txt		model_text.txt
total.png		total.png

Zloy/tesseract-training

Folders and files

Latest commit

History

Repository files navigation

Numbers from screenshots with Tesseract OCR sandbox

DESCRIPTION

ON WINDOWS

HOW TO USE

Samples

exp1 - as is

exp2 - trained

exp3 - scaled

exp4 - resized

exp5 - one by one

exp6 - ten in line

ON LINUX

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages