Skip to content

Latest commit

 

History

History
74 lines (54 loc) · 2.94 KB

README.md

File metadata and controls

74 lines (54 loc) · 2.94 KB

gogosseract

Coverage Go Report Card Go Reference

A reimplementation of https://github.com/otiai10/gosseract without CGo, running Tesseract compiled to WASM with Emscripten via Wazero.

Tesseract is an Optical Character Recognition library written in C++.

The WASM is generated from my personal fork of robertknight's well written tesseract-wasm project.

Note that Tesseract is only compiled with support for the LSTM neural network OCR engine, and not for "classic" Tesseract.

Training Data

Tesseract requires training data in order to accurately recognize text. The official source is here. Strategies for dealing with this include downloading it at runtime, or embedding the file within your Go binary using go:embed at compile time.

Accuracy

Tesseract can work better if the input images are preprocessed. See this page for tips.

https://tesseract-ocr.github.io/tessdoc/ImproveQuality.html

Examples

Using Tesseract to parse text from an image.

    trainingDataFile, err := os.Open("eng.traineddata")
    handleErr(err)

    cfg := gogosseract.Config{
        Language: "eng",
        TrainingData: trainingDataFile,
    }
    // While Tesseract's logs are very useful for debugging, you have the option to silence or redirect it
    cfg.Stderr = io.Discard
    cfg.Stdout = io.Discard
    // Compile the Tesseract WASM and run it, loading in the TrainingData and setting any Config Variables provided
    tess, err := gogosseract.New(ctx, cfg)
    handleErr(err)

    imageFile, err := os.Open("image.png")
    handleErr(err)

    err = tess.LoadImage(ctx, imageFile, gogosseract.LoadImageOptions{})
    handleErr(err)

    text, err = tess.GetText(ctx, func(progress int32) { log.Printf("Tesseract parsing is %d%% complete.", progress) })
    handleErr(err)
    // Closing the Tesseract instance will clean up everything used by Tesseract and it's WASM module
    handleErr(tess.Close(ctx))

Using a Pool of Tesseract workers for thread safe concurrent image parsing.

    cfg := gogosseract.Config{
        Language: "eng",
        TrainingData: trainingDataFile,
    }
    // Create 10 Tesseract instances that can process image requests concurrently.
    pool, err := gogosseract.NewPool(ctx, 10, gogosseract.PoolConfig{Config: cfg})
    handleErr(err)
    // ParseImage loads the image and waits until the Tesseract worker sends back your result.
    hocr, err := pool.ParseImage(ctx, img, gogosseract.ParseImageOptions{
        IsHOCR: true,
    })
    handleErr(err)
    // Always remember to Close the pool to release resources
    handleErr(pool.Close())