CaptionFlow

scalable, fault-tolerant vLLM-powered image captioning.

a fast websocket-based orchestrator paired with lightweight gpu workers achieves exceptional performance for batched requests through vLLM.

orchestrator: hands out work in chunked shards, collects captions, checkpoints progress, and keeps simple stats.
workers (vLLM): connect to the orchestrator, stream in image samples, batch them, and generate 1..N captions per image using prompts supplied by the orchestrator.
config-driven: all components read YAML config; flags can override.

no conda. just venv + pip.

install from pypi

python -m venv .venv
source .venv/bin/activate  # windows: .venv\Scripts\activate
pip install caption-flow

quickstart (single box)

copy + edit the sample configs

cp examples/orchestrator/local_image_files.yaml my-orchestrator.yaml
cp examples/worker.yaml my-worker.yaml
cp examples/monitor.yaml my-monitor.yaml   # optional terminal interface

set a unique shared token in both my-orchestrator.yaml and my-worker.yaml (see auth.worker_tokens in the orchestrator config and worker.token in the worker config).

if you use private hugging face datasets/models, export HUGGINGFACE_HUB_TOKEN before starting anything.

start the orchestrator

caption-flow orchestrator --config my-orchestrator.yaml

start one or more vLLM workers

# gpu 0 on the same host
caption-flow worker --config my-worker.yaml --gpu-id 0

# your second GPU
caption-flow worker --config my-worker.yaml --gpu-id 1

# on a remote host
caption-flow worker --config my-worker.yaml --server ws://your.hostname.address:8765

(optional) start the monitor

caption-flow monitor --config my-monitor.yaml

export the data

% caption-flow export --help                                                                                                                                      
Usage: caption-flow export [OPTIONS]

  Export caption data to various formats.

Options:
  --format [jsonl|json|csv|txt|huggingface_hub|all] Export format (default: jsonl)

jsonl: create JSON line file in the specified --output path
csv: exports CSV-compatible data columns to the --output path containing incomplete metadata
json: creates a .json file for each sample inside the --output subdirectory containing complete metadata; useful for webdatasets
txt: creates .txt file for each sample inside the --output subdirectory containing ONLY captions
huggingface_hub: creates a dataset on Hugging Face Hub, possibly --private and --nsfw where necessary
all: creates all export formats in a specified --output directory

how it’s wired

orchestrator

websocket server (default 0.0.0.0:8765) with three client roles: workers, data-feeders, and admin.
dataset control: the orchestrator centrally defines the dataset (huggingface or local) and version/name. it chunk-slices shards and assigns work.
data serving to remote workers: local files can be captioned by remote workers that don't have access to the same files, automatically.
vLLM config broadcast: model, tp size, dtype, max seq len, memory targets, batching, sampling params, and inference prompts are all pushed to workers; workers can apply many changes without a model reload.
storage + checkpoints: captions buffer to disk with periodic checkpoints. chunk state is tracked so restarts don’t double-work.
auth: token lists for worker, monitor, and admin roles.

vLLM worker

one process per gpu. select the device with --gpu-id (or worker.gpu_id in YAML).
gets its marching orders from the orchestrator: dataset info, model, prompts, batch size, and sampling.
resilient: detects disconnects, abandons the current chunk cleanly, clears queues, reconnects, and resumes.
batched generate(): images are resized down for consistent batching; each image can get multiple captions (one per prompt).

dataset formats

huggingface hub or local based URL list datasets that are compatible with the datasets library
webdatasets shards containing full image data; also can be hosted on the hub
local folder filled with images; orchestrator will serve the data to workers

configuration path

config discovery order

for any component, the CLI looks for config in this order (first match wins):

--config /path/to/file.yaml
./<component>.yaml (current directory)
~/.caption-flow/<component>.yaml
$XDG_CONFIG_HOME/caption-flow/<component>.yaml
/etc/caption-flow/<component>.yaml
any $XDG_CONFIG_DIRS entries under caption-flow/
./examples/<component>.yaml (fallback)

tls / certificates

use the built-in helpers during development:

# self-signed certs for quick local testing
caption-flow generate_cert --self-signed --domain localhost --output-dir ./certs

# inspect any certificate file
caption-flow inspect_cert ./certs/fullchain.pem

then point the orchestrator at the resulting cert/key (or run --no-ssl for dev-only ws://).

tips & notes

multi-gpu: start one worker process per gpu (set --gpu-id or worker.gpu_id).
throughput: tune vllm.batch_size in the orchestrator config (or override with --batch-size at worker start). higher isn’t always better; watch VRAM.
prompts: add more strings under vllm.inference_prompts to get multiple captions per image; the worker returns only non-empty generations.
private HF: if your dataset/model needs auth, export HUGGINGFACE_HUB_TOKEN before caption-flow worker ....
self-signed ssl: pass --no-verify-ssl to workers/monitors in dev.
recovery: if you hard-crash mid-run, caption-flow scan_chunks --fix can reset abandoned chunks so the orchestrator can reissue them cleanly.

roadmap

hot config reload via the admin websocket path.
dedicated data-feeder clients (separate from gpu workers) that push samples into the orchestrator.
richer monitor TUI.

PRs welcome. keep it simple and fast.

architecture

┌─────────────┐     WebSocket      ┌─────────────┐
│   Worker    │◄──────────────────►│             │
│             │                    │             │     ┌──────────────┐
│             │◄───────────────────│             │────►│Arrow/Parquet │
└─────────────┘   HTTP (img data)  │ Orchestrator│     │   Storage    │
                                   │             │     └──────────────┘
┌─────────────┐                    │             │
│   Worker    │◄──────────────────►│             │
│             │                    │             │
│             │◄───────────────────│             │
└─────────────┘   HTTP (img data)  └─────────────┘
                                           ▲
┌─────────────┐                           │
│   Monitor   │◄──────────────────────────┘
└─────────────┘

Community Clusters

To contribute compute to a cluster:

Install caption-flow: pip install caption-flow
Get a worker token from the project maintainer
Run: caption-flow worker --server wss://project.domain.com:8765 --token YOUR_TOKEN

Your contributions will be tracked and attributed in the final dataset!

License

AGPLv3

Name		Name	Last commit message	Last commit date
Latest commit History 389 Commits
.github/workflows		.github/workflows
examples		examples
src/caption_flow		src/caption_flow
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CaptionFlow

install from pypi

quickstart (single box)

how it’s wired

orchestrator

vLLM worker

dataset formats

configuration path

config discovery order

tls / certificates

tips & notes

roadmap

architecture

Community Clusters

License

About

Uh oh!

Releases 14

Packages

Contributors 2

Uh oh!

Languages

License

bghira/CaptionFlow

Folders and files

Latest commit

History

Repository files navigation

CaptionFlow

install from pypi

quickstart (single box)

how it’s wired

orchestrator

vLLM worker

dataset formats

configuration path

config discovery order

tls / certificates

tips & notes

roadmap

architecture

Community Clusters

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 14

Packages 0

Contributors 2

Uh oh!

Languages

Packages