Releases: instructlab/sdg
v0.2.2
What's Changed
- Add data checkpointing capability by @shivchander, @derekhiggins, @markmc in #222
- Remove calls to logging.basicConfig on import by @tiran, @markmc in #194
- Add gen_kwargs support for ConditionalLLMBlock by @derekhiggins in #221
- tests: Add unit tests for taxonomy and model family by @hickeyma in #188
Full Changelog: v0.2.1...v0.2.2
v0.2.1
What's Changed
Full Changelog: v0.2.0...v0.2.1
New Features ✨
- Introduce a way to mix generated datasets before sending to training by @shivchander @khaledsulayman @abhi1092 @aakankshaduggal @bbrowning @markmc in #163 #215
- Introduce data mixing recipe yaml files by @shivchander @khaledsulayman @abhi1092 @aakankshaduggal @bbrowning @markmc in #203
- Add 4 new pipeline blocks by @abhi1092 @shivchander @derekhiggins @markmc in #182
- Generate data for model evaluation using the MMLU benchmark by @shivchander @abhi1092 @aakankshaduggal @derekhiggins @markmc in #180 #212 #209 #193
Fixes 🐛
- Remove temporary e2e hack to use knowledge v3 PR by @markmc in #187
- Remove sys_prompt from contexts.yaml by @shivchander @derekhiggins in #189
- Move Block._validate to llmblock by @abhi1092 @derekhiggins in #191
- generate_data: introduce argument
client
to replace 6 others by @makelinux @tiran in #114 - Fix logging string formatting by @derekhiggins in #197
- Add utility function to convert from Pandas dataframe to Hugging Face dataset by @hickeyma in #199
- Update ConditionalLLMBlock's config_paths schema by @derekhiggins in #211
- Move system pipelines to /usr/share/instructlab/sdg/pipelines by @markmc in #214
New Contributors
- @bbrowning made their first contribution in #163
- @hickeyma made their first contribution in #199
v0.2.0
⚠️ Introducing v3 knowledge format - no backwards compat for v1/v2 ⚠️
The newly introduced v3 knowledge format is incompatible with the previous v1 and v2 formats. As a result, all existing knowledge contributions must be re-formatted to comply with the v3 specifications.
For detailed information and guidelines on how to re-format your contributions, please refer to the issue discussion on GitHub.
What's Changed
- Add v3 knowledge schema support by @abhi1092, @shivchander, @aakankshaduggal, @russellb, @markmc, @derekhiggins in #161
Full Changelog: v0.1.3...v0.2.0
v0.1.3
What's Changed
- Add a YAML based file format for pipelines by @markmc in #86
- llmblock: Set a more reasonable default for num_tokens by @russellb in #125
- pipeline: Fail explicitly on an empty dataset by @russellb in #127
- Automate validation of pipeline configs by @russellb in #132
- Update grounded_skills.yaml to add seed value by @aakankshaduggal in #137
- ci: Run lint job if pipeline configs change by @russellb in #140
- Set gen_kwargs['n'] dynamically in the simple pipelines by @russellb in #144
- Add
model_prompt
config param for LLMBlock by @russellb in #141 - filterblock: add default_value for use with convert_dtype by @markmc in #143
- Export public APIs in top-level package by @tiran in #73
- Indent simple pipeline "principle" content by @derekhiggins in #150
- Move
gen_kwargs
down toLLMBlock
by @markmc in #146 - ci: run e2e on pipeline config related changes by @russellb in #151
- importblock: resolve circular import issue by @markmc in #153
- Remove unused requirements by @russellb in #152
- Drop
__index_level_0__
columns by @aakankshaduggal in #142 - Fix SamplePopulatorBlock by @markmc in #156
- Block Name In Errors by @gabe-l-hart in #155
- Load custom pipelines from shared data dir by @derekhiggins in #166
- LLMBlock concurrency by @gabe-l-hart in #157
New Contributors
- @tiran made their first contribution in #73
- @derekhiggins made their first contribution in #150
- @gabe-l-hart made their first contribution in #155
Full Changelog: v0.1.2...v0.1.3
v0.1.2
What's Changed
- Update messages datafile extension to be .jsonl by @Maxusmusti in #115
New Contributors
- @Maxusmusti made their first contribution in #115
Full Changelog: v0.1.1...v0.1.2
v0.1.1
What's Changed
- Update generate_data.py to capture context key by @aakankshaduggal in #98
- Add CI workflow that runs the full SDG pipeline by @russellb in #93
- Remove two files that are now unused by @russellb in #104
- Batch support with vllm by @aakankshaduggal in #105
- converts dataset format messages required for training by @oindrillac in #94
Full Changelog: v0.1.0...v0.1.1
v0.1.0
This version introduces an effective rewrite of the library. There is a simple
pipeline aimed at maintaining compatibility with small environments supported by the ilab
CLI. There is also a new full
pipeline that is much more extensive and can produce higher quality results for environments capable of running it, along with the required teacher model, Mixtral-8x7b-instruct.
What's Changed
- Update e2e config to optimize pip caching by @nathan-weinberg in #44
- github: Automate some labels with mergify by @russellb in #40
- Add SDG library code by @shivchander, @aakankshaduggal, @oindrillac, et. al. in #42
- 📚 Adding Knowledge llm blocks by @abhi1092 in #50
- e2e: Fix permissions error by @russellb in #51
- Initial CLI integration with new SDG interfaces by @russellb in #46
- Fix dataset formatting for pipeline differences by @russellb in #57
- updates to grounded flow by @oindrillac, @shivchanderm, @oindrillac in #53
- e2e: Only run one job at a time for a given PR by @russellb in #68
- Fix prompt file paths for an installed library by @russellb in #67
- Resolve some trivial TODOs in generate_data() by @markmc in #74
- Fix mismatch in full pipeline outputs by @russellb in #75
- Updated chunking_document. by @PalmPalm7 in #65
- Handle type conversion errors in FilterByValueBlock by @russellb in #78
- Make SynthSkillsFlow honor the num_iters parameter by @russellb in #82
- Bump actions/download-artifact from 4.1.7 to 4.1.8 by @dependabot in #91
- Drop remaining import from main instructlab package by @russellb in #89
- generate_data: Fix check for
output
in results by @russellb in #71 - generate_data: fix support for multiple leaf nodes by @russellb in #85
- Allow FilterByValueBlock to handle one or many values by @russellb in #81
- Bump pypa/gh-action-pypi-publish from 1.8.14 to 1.9.0 by @dependabot in #24
- iterblock: remove duplicate line of code by @russellb in #83
New Contributors
- @shivchander made their first contribution in #42
- @aakankshaduggal made their first contribution in #42
- @abhi1092 made their first contribution in #50
- @oindrillac made their first contribution in #53
- @markmc made their first contribution in #74
- @PalmPalm7 made their first contribution in #65
Full Changelog: v0.0.4...v0.1.0
v0.0.4.1
Full Changelog: v0.0.4...v0.0.4.1
v0.0.4
v0.0.3
What's Changed
- Add e2e test workflow by @russellb in #33
- offshoot gen_test_data() from very long generate_data() by @makelinux in #15
- Add py.typed marker file by @russellb in #32
- Move some code from instructlab.utils by @russellb in #35
New Contributors
- @makelinux made their first contribution in #15
Full Changelog: v0.0.2...v0.0.3