Here we discuss the structure, use and approach to code management in the repo.
There are various entry points that you can choose based on the use case. Each entry point has its pre-requirements and setup steps. The common part of are:
- Python 3.10 or 3.11 -Docker/Podman
Two important development tools will also be installed using the steps below:
pip install pre-commit
pip install twine
...
git clone [email protected]:IBM/data-prep-kit.git
cd data-prep-kit
pre-commit install
Please note that there are further installation steps for running the transforms in general, as documented here and on a local Kind cluster or on an existing Kubernetes cluster, as documented here.
-
data_processing_lib - provides the core transform framework and library supporting data transformations in 3 runtimes
- python
- ray
- spark
-
transform
- universal
- noop
- python
- ray
- spark
- kfp_ray
- ...
- noop
- code
- code_quality
- ray
- kfp_ray
- ...
- code_quality
- language
- ...
- universal
-
kfp - Kubeflow pipeline support
- kfp_support_lib - Data Preparation Kit Library. KFP support
- kfp_ray_components - Kubflow pipeline components used in the pipelines
-
scripts
Makefiles are used for operations performed across all projects in the directory tree. There are two types of users envisioned to use the make files.
- adminstrators - perform git actions and release management
- developers - work with core libraries and transforms
Each directory has access to a make help
target that will show all available targets.
Generally, administrators will issue make commands from the top of the repository to, for example publish a new release. The top level make file provides a set of targets that are executed recursively, which as a result are expected to be implementd by sub-directories. These and their semantics are expected to be implemented, as appropriate, in the sub-directories are as follows:
- clean - Restore the directory to as close to initial repository clone state as possible.
- build - Build all components contained in a given sub-directory.
This might include pypi distributions, images, etc. - test - Test all components contained in a given sub-directory.
- publish - Publish any components in sub-directory. This might include things published to pypi or the docker registry.
- set-versions - apply the DPK_VERSION to all published components.
Sub-directories are free to define these as empty/no-op targets, but generally are required to define them unless a parent directory does not recurse into the directory.
Generally, developers will be working in a python project directory (e.g., data-processing-lib/python, transforms/universal/filter, etc.) and can issue the administrator's make targets (e g., build, test, etc) or others that might be defined locally (e.g., venv, test-image, test-src in transform projects). Key targets are as follows:
- venv - creates the virtual environment from either a pyproject.toml or requirements.txt file.
- publish - publish libraries or docker images as appropriate.
This is generally only used during release generation.
If working with an IDE, one generally makes the venv, then configures the IDE to reference the venv, src and test directories.
Transform projects generally include these transform project-specific targets for convenience,
which are triggered with the the test
target.
- test-src - test python tests in the test directory
- test-image - build and test the docker image for the transform
Please also consult transform project conventions for additional considerations when developing transforms.
The kfp_ray directories in the transform projects provide
workflow-
targets and are dedicated to handling the
Kubeflow Pipelines
workflows for the specified transforms.