- Many bug fixes across the repo, plus the following specifics.
- Enhanced CI/CD and makefile improvements include definition of top-level targets (clean, set-verions, build, publish, test)
- Automation of release process branch/tag management
- Documentation improvements
- Split libraries into 3 runtime-specific implementations
- Fix missing final count of processed and add percentages
- Improved fault tolerance in python and ray runtimes
- Report global DataAccess retry metric
- Support for binary data transforms
- Updated to Ray version to 2.24
- Updated to PyArrow version 16.1.0
- Add KFP V2 support
- Create a distinct (timestamped) execution.log file for each retry
- Support for multiple inputs/outputs
- Added language/lang_id - detects language in documents
- Added universal/profiler - counts works/tokens in documents
- Converted ingest2parquet tool to transform named code2parquet
- Split transforms, as appropriate, into python, ray and/or spark.
- Added spark implementations of filter, doc_id and noop transforms.
- Switch from using requirements.txt to pyproject.toml file for each transform runtime
- Repository restructured to move kfp workflow definitions to associated transform project directory