Releases: bghira/CaptionFlow
Releases · bghira/CaptionFlow
v0.4.2 - config parser + local filesystem processor fixes
What's Changed
- add warning + fallback for wrong orchestrator config layout by @bghira in #59
- attempt to resolve condition where local filesystem processor recaptions successful images by @bghira in #60
Full Changelog: v0.4.1...v0.4.2
v0.4.1 - bugfixes for export
What's Changed
- handle KeyError on reload for captionworker by @bghira in #57
- resolve error during export, unexpected arguments by @bghira in #58
Full Changelog: v0.4.0...v0.4.1
v0.4.0 - migration to Lance format
What's Changed
- use webshart caching iterator helper to avoid blocking by @bghira in #41
- worker split cache by @bghira in #42
- fix gpu_id based subdir by @bghira in #43
- use pylance to export instead of pandas by @bghira in #44
- hf url dataset should use relative indexing by @bghira in #45
- cleanup examples by @bghira in #46
- add pytest-asyncio by @bghira in #47
- add tests for captionworker by @bghira in #48
- storage manager: use pylance for more optimised appends by @bghira in #50
- use correct starting index for chunks, add regression tests by @bghira in #51
- config reload drops auth by @bghira in #52
- fix tests taking forever by @bghira in #53
- test coverage improvements by @bghira in #54
- mark bad chunks by @bghira in #55
- add auth subcommand to cli module for managing tokens by @bghira in #56
Full Changelog: v0.3.4...v0.4.0
v0.3.4 - even more scalability
What's Changed
Full Changelog: v0.3.3...v0.3.4b
v0.3.3
v0.3.2 - fix for webdataset caption job resumption
What's Changed
Full Changelog: v0.3.1...v0.3.2
v0.3.1 - lightweight captionworker
v0.3.0 - massive memory and throughput improvements
- reimplemented huggingface processor with focus on memory reduction and throughput saturation, can hit 5000 captions/sec
- reimplemented webdatasets processor to use the webshart library for massive throughput boost and memory use reduction thanks to the spicy Rust implementation
overall, the orchestrator and worker both will use about 0.5GiB of memory to run, as opposed to several GiB of memory.
added a mock_results mode for the dataset loaders and caption generator to assist in rapid development iteration.
What's Changed
Full Changelog: v0.2.4...v0.3.0
v0.2.4 - dataset viewer and export
What's Changed
- feature: add storage export subcommand by @bghira in #34
- add urwid based dataset viewer that uses term-image to render by @bghira in #35
Full Changelog: v0.2.3...v0.2.4
v0.2.3 - local file support, refactored storage backend, job distribution
What's Changed
- remove hf shardwise dataset support by @bghira in #25
- Refactor webdataset dataloader abstraction by @bghira in #26
- remove duplicate assignment; apply more consistent usage of dataclasses by @bghira in #27
- add rate tracking log outputs to the storage subsystem by @bghira in #28
- eliminate re-processing of samples that were already processed by a disconnecting worker by @bghira in #29
- state tracking fixes for worker & workunit tracker by @bghira in #30
- huggingface URL dataset processor v2 by @bghira in #31
- local dataset processor by @bghira in #32
- simplify schema management by @bghira in #33
Full Changelog: v0.2.2...v0.2.3