scalable, fault-tolerant vLLM-powered image captioning.
a fast websocket-based orchestrator paired with lightweight gpu workers achieves exceptional performance for batched requests through vLLM.
- orchestrator: hands out work in chunked shards, collects captions, checkpoints progress, and keeps simple stats.
- workers (vLLM): connect to the orchestrator, stream in image samples, batch them, and generate 1..N captions per image using prompts supplied by the orchestrator.
- config-driven: all components read YAML config; flags can override.
no conda. just
venv
+pip
.
python -m venv .venv
source .venv/bin/activate # windows: .venv\Scripts\activate
pip install caption-flow
- copy + edit the sample configs
cp examples/orchestrator/local_image_files.yaml my-orchestrator.yaml
cp examples/worker.yaml my-worker.yaml
cp examples/monitor.yaml my-monitor.yaml # optional terminal interface
set a unique shared token in both my-orchestrator.yaml
and my-worker.yaml
(see auth.worker_tokens
in the orchestrator config and worker.token
in the worker config).
if you use private hugging face datasets/models, export HUGGINGFACE_HUB_TOKEN
before starting anything.
- start the orchestrator
caption-flow orchestrator --config my-orchestrator.yaml
- start one or more vLLM workers
# gpu 0 on the same host
caption-flow worker --config my-worker.yaml --gpu-id 0
# your second GPU
caption-flow worker --config my-worker.yaml --gpu-id 1
# on a remote host
caption-flow worker --config my-worker.yaml --server ws://your.hostname.address:8765
- (optional) start the monitor
caption-flow monitor --config my-monitor.yaml
- export the data
% caption-flow export --help
Usage: caption-flow export [OPTIONS]
Export caption data to various formats.
Options:
--format [jsonl|json|csv|txt|huggingface_hub|all] Export format (default: jsonl)
- jsonl: create JSON line file in the specified
--output
path - csv: exports CSV-compatible data columns to the
--output
path containing incomplete metadata - json: creates a
.json
file for each sample inside the--output
subdirectory containing complete metadata; useful for webdatasets - txt: creates
.txt
file for each sample inside the--output
subdirectory containing ONLY captions - huggingface_hub: creates a dataset on Hugging Face Hub, possibly
--private
and--nsfw
where necessary - all: creates all export formats in a specified
--output
directory
- websocket server (default
0.0.0.0:8765
) with three client roles: workers, data-feeders, and admin. - dataset control: the orchestrator centrally defines the dataset (
huggingface
orlocal
) and version/name. it chunk-slices shards and assigns work. - data serving to remote workers: local files can be captioned by remote workers that don't have access to the same files, automatically.
- vLLM config broadcast: model, tp size, dtype, max seq len, memory targets, batching, sampling params, and inference prompts are all pushed to workers; workers can apply many changes without a model reload.
- storage + checkpoints: captions buffer to disk with periodic checkpoints. chunk state is tracked so restarts don’t double-work.
- auth: token lists for
worker
,monitor
, andadmin
roles.
- one process per gpu. select the device with
--gpu-id
(orworker.gpu_id
in YAML). - gets its marching orders from the orchestrator: dataset info, model, prompts, batch size, and sampling.
- resilient: detects disconnects, abandons the current chunk cleanly, clears queues, reconnects, and resumes.
- batched generate(): images are resized down for consistent batching; each image can get multiple captions (one per prompt).
- huggingface hub or local based URL list datasets that are compatible with the datasets library
- webdatasets shards containing full image data; also can be hosted on the hub
- local folder filled with images; orchestrator will serve the data to workers
for any component, the CLI looks for config in this order (first match wins):
--config /path/to/file.yaml
./<component>.yaml
(current directory)~/.caption-flow/<component>.yaml
$XDG_CONFIG_HOME/caption-flow/<component>.yaml
/etc/caption-flow/<component>.yaml
- any
$XDG_CONFIG_DIRS
entries undercaption-flow/
./examples/<component>.yaml
(fallback)
use the built-in helpers during development:
# self-signed certs for quick local testing
caption-flow generate_cert --self-signed --domain localhost --output-dir ./certs
# inspect any certificate file
caption-flow inspect_cert ./certs/fullchain.pem
then point the orchestrator at the resulting cert/key (or run --no-ssl
for dev-only ws://).
- multi-gpu: start one worker process per gpu (set
--gpu-id
orworker.gpu_id
). - throughput: tune
vllm.batch_size
in the orchestrator config (or override with--batch-size
at worker start). higher isn’t always better; watch VRAM. - prompts: add more strings under
vllm.inference_prompts
to get multiple captions per image; the worker returns only non-empty generations. - private HF: if your dataset/model needs auth, export
HUGGINGFACE_HUB_TOKEN
beforecaption-flow worker ...
. - self-signed ssl: pass
--no-verify-ssl
to workers/monitors in dev. - recovery: if you hard-crash mid-run,
caption-flow scan_chunks --fix
can reset abandoned chunks so the orchestrator can reissue them cleanly.
- hot config reload via the admin websocket path.
- dedicated data-feeder clients (separate from gpu workers) that push samples into the orchestrator.
- richer monitor TUI.
PRs welcome. keep it simple and fast.
┌─────────────┐ WebSocket ┌─────────────┐
│ Worker │◄──────────────────►│ │
│ │ │ │ ┌──────────────┐
│ │◄───────────────────│ │────►│Arrow/Parquet │
└─────────────┘ HTTP (img data) │ Orchestrator│ │ Storage │
│ │ └──────────────┘
┌─────────────┐ │ │
│ Worker │◄──────────────────►│ │
│ │ │ │
│ │◄───────────────────│ │
└─────────────┘ HTTP (img data) └─────────────┘
▲
┌─────────────┐ │
│ Monitor │◄──────────────────────────┘
└─────────────┘
To contribute compute to a cluster:
- Install caption-flow:
pip install caption-flow
- Get a worker token from the project maintainer
- Run:
caption-flow worker --server wss://project.domain.com:8765 --token YOUR_TOKEN
Your contributions will be tracked and attributed in the final dataset!
AGPLv3