Skip to content

Prepare train data before running training #418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Shebuka
Copy link

@Shebuka Shebuka commented Mar 24, 2025

Added an option to prepare the train and eval indices from the ground truth before starting the training with make prepare-data.

This gives you more control over what’s used to evaluate the training, which is super handy if you have different sets of training data (like different fonts or special symbols). Now, we go through the folder and all the subfolders in the specified ground truth folder to gather all the different sets of training data, combine them into one training set, and then fine-tune what gets used for evaluation."

@zdenop
Copy link
Contributor

zdenop commented Mar 25, 2025

Can you please provide information on how or why to use this new command with the provided example data?
E.g. Now you can try tesseract training like this:

git clone --depth 1 https://github.com/tesseract-ocr/tesstrain.git
cd tesstrain
make tesseract-langdata
mkdir tessdata_best
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata -P tessdata_best
unzip ocrd-testset.zip -d data/ocrd-ground-truth
make training MODEL_NAME=ocrd TESSDATA=tessdata_best MAX_ITERATIONS=10000

In which step should the user run make prepare-data and why?

@Shebuka
Copy link
Author

Shebuka commented Mar 26, 2025

ocrd set is not a useful example for the use of prepare-data as it's all very similar scripts.

prepare-data is used as first step to fine-tune the train/eval set. The case I'm using it right now is this:
My GT folder has 6 folders inside, each has specific parts of the document I want to recognize in two very different fonts. After running prepare-data I will have the GT data merged, but from each subfolder eval/train will be picked according to RATIO. Next, I open both train and eval side by side ina text editor and now I can fine-tune what is in the eval. This way I will make sure that all special cases appear at least once in the eval set.

make prepare-data MODEL_NAME=eng_sc START_MODEL=eng TESSDATA=data/tessdata GROUND_TRUTH_DIR=data/eng_sc-ground-truth
// manually review and fine-tune the train/eval training set
make training MODEL_NAME=eng_sc START_MODEL=eng TESSDATA=data/tessdata GROUND_TRUTH_DIR=data/eng_sc-ground-truth

training will also now skip the creation of train/eval if they already exist.

I can provide my training set if you want to explore further this fine-tuning usage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants