From eaa235b8c927b27c1531847d76797ea60945295f Mon Sep 17 00:00:00 2001
From: Kelly A <kellyaa@users.noreply.github.com>
Date: Thu, 17 Aug 2023 16:03:32 -0400
Subject: [PATCH] add cluster provisioning ADR

Signed-off-by: Kelly A <kellyaa@users.noreply.github.com>
---
 docs/adrs/001-cluster-provisioning.md | 49 +++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)
 create mode 100644 docs/adrs/001-cluster-provisioning.md

diff --git a/docs/adrs/001-cluster-provisioning.md b/docs/adrs/001-cluster-provisioning.md
new file mode 100644
index 0000000..e645553
--- /dev/null
+++ b/docs/adrs/001-cluster-provisioning.md
@@ -0,0 +1,49 @@
+# ADR 001: Ray Cluster Provisioning
+
+The current implementation of this module makes the assumption that a Ray cluster is pre-existing and is managed by the user. Each training is submitted to this pre-existing Ray cluster as a job. The storage of training job state is delegated entirely to this Ray cluster.
+
+### Current state: Assume existing Ray cluster
+
+Pros:
+
+*  Simple (from the perspective of this module, at least)
+* Is platform agnostic. i.e. The Ray cluster could be a user's laptop, KubeRay, AWS offerings, etc.
+
+Cons:
+
+* Shifts complexity to user
+* A static Ray cluster could be squatting on precious GPUs when not actively training
+
+Possible mitigating enhancements:
+
+* Include cluster creation and management to an operator such as the Caikit operator.
+* Have clusters autoscale, in order to minimize resource wastage. (Need to test how this works in real life)
+
+### Option: Spin up a new Ray cluster for every training
+
+Pros:
+* This allows the size and resources of the cluster to be fully customizable
+* Could be useful for very large training/tuning jobs
+
+Cons:
+* Caikit-ray-backend now cesases to be platform agnostic, as it will have to embed logic for creating Ray clusters in onre or more platform-specific ways (i.e. K8s, AWS, GCP.. etc)
+* We will need to add state management to the caikit ray module to keep pointers to the various Ray clusters
+* Ray clusters must explicitly be deleted. They will squat on precious resources like GPUs until they are deleted. Once a Ray cluster is deleted, information about the job run is also deleted. If we want to persist it, it must be stored in the Caikit Ray backend's own data store.
+* Total overkill for things like single GPU / single node prompt tuning jobs
+
+
+## Decision
+
+Since our current use cases are for `caikit-nlp` tuning jobs that run within one node, we will assume a **pre-existing** Ray cluster. Operators that install caikit into a K8s environemnt can take responsibility for pre-creating the Ray cluster for use of Caikit tuning jobs.
+
+
+## Status
+
+Approved
+
+
+## Consequences
+
+* caikit-ray-backend remains platform agnostic
+* If or when caikit runs very large multi-node tuning jobs, we may have to revisit this decision
+* The responsibility of creating Ray clusters falls on the user of Caikit (and/or future operators in the K8s context)
\ No newline at end of file