Prepare train data before running training #418

Shebuka · 2025-03-24T17:18:01Z

Added an option to prepare the train and eval indices from the ground truth before starting the training with make prepare-data.

This gives you more control over what’s used to evaluate the training, which is super handy if you have different sets of training data (like different fonts or special symbols). Now, we go through the folder and all the subfolders in the specified ground truth folder to gather all the different sets of training data, combine them into one training set, and then fine-tune what gets used for evaluation."

…re running the training

zdenop · 2025-03-25T19:07:27Z

Can you please provide information on how or why to use this new command with the provided example data?
E.g. Now you can try tesseract training like this:

git clone --depth 1 https://github.com/tesseract-ocr/tesstrain.git
cd tesstrain
make tesseract-langdata
mkdir tessdata_best
wget https://github.com/tesseract-ocr/tessdata/raw/main/eng.traineddata -P tessdata_best
unzip ocrd-testset.zip -d data/ocrd-ground-truth
make training MODEL_NAME=ocrd TESSDATA=tessdata_best MAX_ITERATIONS=10000

In which step should the user run make prepare-data and why?

Shebuka · 2025-03-26T09:09:07Z

ocrd set is not a useful example for the use of prepare-data as it's all very similar scripts.

prepare-data is used as first step to fine-tune the train/eval set. The case I'm using it right now is this:
My GT folder has 6 folders inside, each has specific parts of the document I want to recognize in two very different fonts. After running prepare-data I will have the GT data merged, but from each subfolder eval/train will be picked according to RATIO. Next, I open both train and eval side by side ina text editor and now I can fine-tune what is in the eval. This way I will make sure that all special cases appear at least once in the eval set.

make prepare-data MODEL_NAME=eng_sc START_MODEL=eng TESSDATA=data/tessdata GROUND_TRUTH_DIR=data/eng_sc-ground-truth
// manually review and fine-tune the train/eval training set
make training MODEL_NAME=eng_sc START_MODEL=eng TESSDATA=data/tessdata GROUND_TRUTH_DIR=data/eng_sc-ground-truth

training will also now skip the creation of train/eval if they already exist.

I can provide my training set if you want to explore further this fine-tuning usage

… train and eval sets

Shebuka added 3 commits March 22, 2025 14:13

Added option to prepare train and eval indices from ground truth befo…

a669dfd

…re running the training

Fixed shuffling of the all LSTMF

7383244

Reverted back some tab to space changes

584560d

Now every subfolder with at least one .lstmf file contributes to both…

dd678a9

… train and eval sets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Prepare train data before running training #418

Prepare train data before running training #418

Uh oh!

Shebuka commented Mar 24, 2025

Uh oh!

zdenop commented Mar 25, 2025

Uh oh!

Shebuka commented Mar 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Prepare train data before running training #418

Are you sure you want to change the base?

Prepare train data before running training #418

Uh oh!

Conversation

Shebuka commented Mar 24, 2025

Uh oh!

zdenop commented Mar 25, 2025

Uh oh!

Shebuka commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Shebuka commented Mar 26, 2025 •

edited

Loading