Inspired by the original code
A small desktop utility to select a region of the screen, run OCR via OpenRouter vision models, and copy the result to the clipboard.
src/- All source code packagestests/- Integration and debug test files- Unit tests (
*_test.go) remain with their respective packages
At a high level, the Windows app is structured as:
-
src/main:- Parses flags and normalizes
--run-once. - Ensures single resident instance via a TCP preflight on the configured port.
- Loads configuration (
src/config), enables DPI awareness, and configures logging (src/logutil). - Initializes the LLM client (
src/llm) and performs a 1-token startup ping (blocking dialog on failure). - In resident mode:
- Starts the central event loop (
src/eventloop), the system tray (src/tray), and global hotkey listener (src/hotkey).
- Starts the central event loop (
- In
--run-oncemode:- First tries to delegate to a running resident via
src/singleinstance.Client. - If no resident is available, runs a standalone capture+OCR flow.
- First tries to delegate to a running resident via
- Parses flags and normalizes
-
src/eventloop:- Owns the single-instance TCP server (
src/singleinstance.Server). - Listens for:
- Global hotkey triggers.
- Delegated
--run-oncerequests.
- For each request:
- Uses
src/overlay.Selector/src/guito run the interactive region selector. - Submits OCR work to a bounded worker pool (
src/worker+src/ocr+src/llm). - Updates UI via
src/popupand writes results viasrc/clipboard.
- Uses
- Enforces that only one OCR job runs at a time ("busy" behavior).
- Owns the single-instance TCP server (
-
src/overlay+src/gui+src/screenshot:- Implement the Windows overlay window and mouse-driven region selection.
- Capture the selected region (multi-monitor aware) as PNG bytes for OCR.
-
src/llm+src/ocr:src/ocrcaptures the region and forwards it tosrc/llm.src/llmcalls the OpenRouter Chat Completions API with a strict OCR-style prompt and optionalPROVIDERSrouting.
-
src/singleinstance:- TCP-based discovery and delegation so
--run-onceclients hand work to the resident when available.
- TCP-based discovery and delegation so
-
src/tray+src/popup+src/notification:- System tray icon, About/Exit menu, and small non-intrusive popups.
- Countdown popup appears during OCR and is updated/closed when results arrive.
The Linux CLI (src/cmd/cli) is a separate, GUI-free binary that reuses src/config and src/llm to run OCR on PNG input (file or stdin) and print plain-text or JSON output.
- Go toolchain installed (go build) if you want to build yourself; otherwise use .exe from releases.
- Windows (current overlay/hotkey path targets Windows)
- OpenRouter API key and a vision-capable model (
:freemodels are also supported, but not recommended)
A standalone CLI utility for Linux users:
# Build
cd src/cmd/cli
go build -o ocr-tool .
# Usage
./ocr-tool -file screenshot.pngSee src/cmd/cli/README.md for details.
-
Create a
.envfile in the same directory as the executable with the following required keys:OPENROUTER_API_KEY=MODEL=(e.g.,google/gemma-2-9b-it)
Alternatively, you can set each of these as an environment variable.
-
Alternatively, you can point the app to a config file via an environment variable:
- Set
SCREEN_OCR_LLMto the full path of a.env-format file. If.envis not found in the executable directory, the app will load configuration from this path.
- Set
-
You can also add these optional keys to your
.envfile to customize behavior:HOTKEY=Ctrl+Alt+qENABLE_FILE_LOGGING=truePROVIDERS=providerA,providerBOCR_DEADLINE_SEC=20(default is 20 seconds if unset)SINGLEINSTANCE_PORT_START=49500SINGLEINSTANCE_PORT_END=49550
-
Using Go directly:
- On Windows (no console window):
go build -ldflags "-H=windowsgui" -o screen-ocr-llm.exe ./src/main - On Linux/macOS:
go build -o screen-ocr-llm ./src/main
- On Windows (no console window):
-
Using the Makefile (for a Windows GUI binary):
make build-windows
This creates a
screen-ocr-llm.exefile that runs without a console window.
The application offers two primary modes of operation:
This is the standard mode for continuous, everyday use. The application runs quietly in the background, accessible via a system tray icon and a global hotkey.
- How to run: Execute the binary without any command-line flags.
./screen-ocr-llm.exe
- Functionality:
- Manages a system tray icon with "About" and "Exit" options.
- Listens for a global hotkey (default:
Ctrl+Alt+q) to start a screen capture. - After a region is selected, the extracted text is automatically copied to your clipboard and shown in a brief popup notification.
- It ensures that only one instance of the application is running at any time.
This mode is intended for single, on-demand captures initiated from the command line or within scripts.
- How to run: Execute the binary using the
--run-onceflag../screen-ocr-llm.exe --run-once
- Functionality:
- Bypasses the system tray and immediately prompts you to select a region on the screen.
- Copies the resulting text to the clipboard.
- Exits silently as soon as the capture and OCR process is finished.
The two modes are designed to work together intelligently to prevent conflicts and ensure smooth operation.
- When you start a new capture with
--run-once, the application first checks if a resident instance is already running. - If a resident instance is found, the
--run-onceprocess delegates the capture request to the running instance and exits. The resident application then takes over, presenting the screen selection UI. - If no resident instance is active, the
--run-onceprocess will handle the capture itself in a temporary standalone mode before exiting. - Startup validation: On launch, the app performs a minimal LLM connectivity check (1-token ping). If it fails, a blocking error dialog is shown and the app exits. In
--run-once, if a resident is detected and the request is delegated, the client does not ping. - High-DPI: The app enables DPI awareness and uses the full virtual screen for overlays and screenshots to work correctly on scaled multi-monitor setups.
- Logging: Controlled by
ENABLE_FILE_LOGGING. Whenfalse, logs are suppressed; whentrue, logs are written toscreen_ocr_debug.log(size-rotated). In GUI builds, stdout/stderr are hidden, so enable file logging for diagnostics.
This delegation mechanism ensures a stable and predictable user experience by guaranteeing that only one screen selection process can be active at a time.
- Logging: Controlled by
ENABLE_FILE_LOGGING. Whenfalse, logs are suppressed; whentrue, logs are written toscreen_ocr_debug.logwith size-based rotation. In GUI builds, stdout/stderr are hidden, so enable file logging for diagnostics. - Single Instance: The tool uses a loopback TCP port to enforce a single resident instance and to manage delegation from
--run-onceclients.