From eaa235b8c927b27c1531847d76797ea60945295f Mon Sep 17 00:00:00 2001 From: Kelly A Date: Thu, 17 Aug 2023 16:03:32 -0400 Subject: [PATCH] add cluster provisioning ADR Signed-off-by: Kelly A --- docs/adrs/001-cluster-provisioning.md | 49 +++++++++++++++++++++++++++ 1 file changed, 49 insertions(+) create mode 100644 docs/adrs/001-cluster-provisioning.md diff --git a/docs/adrs/001-cluster-provisioning.md b/docs/adrs/001-cluster-provisioning.md new file mode 100644 index 0000000..e645553 --- /dev/null +++ b/docs/adrs/001-cluster-provisioning.md @@ -0,0 +1,49 @@ +# ADR 001: Ray Cluster Provisioning + +The current implementation of this module makes the assumption that a Ray cluster is pre-existing and is managed by the user. Each training is submitted to this pre-existing Ray cluster as a job. The storage of training job state is delegated entirely to this Ray cluster. + +### Current state: Assume existing Ray cluster + +Pros: + +* Simple (from the perspective of this module, at least) +* Is platform agnostic. i.e. The Ray cluster could be a user's laptop, KubeRay, AWS offerings, etc. + +Cons: + +* Shifts complexity to user +* A static Ray cluster could be squatting on precious GPUs when not actively training + +Possible mitigating enhancements: + +* Include cluster creation and management to an operator such as the Caikit operator. +* Have clusters autoscale, in order to minimize resource wastage. (Need to test how this works in real life) + +### Option: Spin up a new Ray cluster for every training + +Pros: +* This allows the size and resources of the cluster to be fully customizable +* Could be useful for very large training/tuning jobs + +Cons: +* Caikit-ray-backend now cesases to be platform agnostic, as it will have to embed logic for creating Ray clusters in onre or more platform-specific ways (i.e. K8s, AWS, GCP.. etc) +* We will need to add state management to the caikit ray module to keep pointers to the various Ray clusters +* Ray clusters must explicitly be deleted. They will squat on precious resources like GPUs until they are deleted. Once a Ray cluster is deleted, information about the job run is also deleted. If we want to persist it, it must be stored in the Caikit Ray backend's own data store. +* Total overkill for things like single GPU / single node prompt tuning jobs + + +## Decision + +Since our current use cases are for `caikit-nlp` tuning jobs that run within one node, we will assume a **pre-existing** Ray cluster. Operators that install caikit into a K8s environemnt can take responsibility for pre-creating the Ray cluster for use of Caikit tuning jobs. + + +## Status + +Approved + + +## Consequences + +* caikit-ray-backend remains platform agnostic +* If or when caikit runs very large multi-node tuning jobs, we may have to revisit this decision +* The responsibility of creating Ray clusters falls on the user of Caikit (and/or future operators in the K8s context) \ No newline at end of file