v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)
LatestSummarization: 200+ files changed with 18,535 additions and 3,720 deletions.
🔧 Major Refactors & Improvements
-
🔄 Sandbox Usability (#686):
- Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
- Includes the InternVL example as a showcase.
-
📘 DJ-Doc Redesign (#675):
- Now with multilingual support (English / Chinese) and a modernized style.
-
📦 Dependency Management Update (#660, #680):
- Migrated to
uv
for faster dependency resolution. - Added sub-groups for better organization.
- Migrated to
🌍 New Features & Integrations (#683, #688, #692)
-
🆕 Additional Repo Supported:
- Trinity-RFT now supported by Data-Juicer.
-
📜 DJ-Awesome-List:
- A survey paper accepted by TPAMI'25!
-
🧪 Synthetic Benchmark Added:
- DetailMaster – a new benchmark for synthetic data evaluation.
-
🛠️ New Operators Introduced (#673, #701):
llm_analysis_filter
general_field_filter
🚀 Core Optimizations & Bug Fixes
-
✅ Ray Executor Enhancements (#697):
- File extension detection added.
- Support for more data formats.
-
⏱️ Startup Time Optimization:
- Improved startup performance. (#684)
-
🧠 Text Embedding Support:
- Added support for text embedding via API and local model. (#681)
-
🐳 Docker Build Improvement:
- Ignore installed
distutils
libraries during Docker image building. (#668)
- Ignore installed
-
🛠️ Mapper Module Fix:
- Fixed issue with module initialization. (#700)
-
🗑️ Warning Suppression:
- Suppressed unnecessary warnings from fasttext. (#696)