Skip to content

Commit c0f7788

Browse files
committed
Merge branch 'kebe-temp'
2 parents e370fc6 + 2e97818 commit c0f7788

File tree

1 file changed

+42
-1
lines changed

1 file changed

+42
-1
lines changed

packages/2024-08-21-kubecon-hk/slides.md

+42-1
Original file line numberDiff line numberDiff line change
@@ -1233,6 +1233,20 @@ glowSeed: 230
12331233
12341234
</div>
12351235

1236+
<!--
1237+
Let’s dive into the core features of kcover.
1238+
1239+
[click] Firstly, there is the “Firewatch of Workloads.” This feature will continuously monitor the status of training task workloads, such as PyTorchJobs, and promptly detect any error messages.
1240+
1241+
[click] Next, “Enhanced Observability” will be implemented by utilizing various means to determine the status of jobs, such as observing logs and real-time system calls, thus enhancing the observability of training jobs.
1242+
1243+
[click] Through “Periodic Inspection,” we will regularly test the status of jobs, the environment, or infrastructure to ensure that the resources committed to training jobs meet the required conditions, ensuring smooth training progress.
1244+
1245+
[click] With “Cascading Shutdown,” when a fault occurs that prevents the training task from continuing, the entire task will be restarted through Cascading Shutdown. This prevents the training framework from waiting due to a non-working part, thus avoiding the waste of valuable hardware resources.
1246+
1247+
[click] Finally, “Intelli-Migration” will intelligently assess the health status of nodes to determine whether they can continue running jobs, ensuring maximized resource utilization while safeguarding training efficiency.
1248+
-->
1249+
12361250
---
12371251
class: py-10
12381252
---
@@ -1290,6 +1304,14 @@ class: py-10
12901304

12911305
</div>
12921306

1307+
<!--
1308+
The architecture of Kcover consists of two parts: Collector and Controller (also known as Recovery Manager).
1309+
1310+
The Collector runs as a Daemonset on each Node, responsible for gathering information. This includes executing the dcgmi command, analyzing the logs and events of each Pod, and invoking some system calls, such as checking the status of PCIE devices, to determine the operational status of tasks. It reports any exceptional events back to the APIServer.
1311+
1312+
The Controller monitors these events from the APIServer and makes further assessments of the data collected by the Collector to determine whether a Job needs to be restarted. If a restart is required, it will execute the restart of the entire Job and may mark the node as unschedulable.
1313+
-->
1314+
12931315
---
12941316
class: py-10
12951317
glow: right
@@ -1317,6 +1339,13 @@ glowSeed: 230
13171339

13181340
</v-clicks>
13191341

1342+
<!--
1343+
[click] Once a training task is labeled, kcover will continuously analyze this information.
1344+
This includes [click] node status, [click] container logs (such as CUDA, NCCL, or OOM errors, [click] as well as specific exit codes).
1345+
[click] If a problem is detected,
1346+
[click] we will record the event through the Collector [click] and may initiate a Cascading Shutdown to restart the task, allowing it to resume training from the last known state. [click] Additionally, through ongoing diagnostic tools, we will analyze network status, GPU hardware status, PCIE status, and kernel status to ensure that the system always operates at optimal conditions.
1347+
-->
1348+
13201349
---
13211350
class: py-10
13221351
---
@@ -1327,7 +1356,7 @@ class: py-10
13271356

13281357
#### Install
13291358

1330-
```shell
1359+
```shell {|3}
13311360
helm repo add baizeai https://baizeai.github.io/charts
13321361
helm repo update baizeai
13331362
helm -n kcover-system --create-namespace install kcover baizeai/kcover
@@ -1363,6 +1392,13 @@ metadata:
13631392

13641393
</div>
13651394

1395+
<!--
1396+
To start using kcover, you can initially install kcover onto your system with a few simple helm commands.
1397+
[click] You only need to execute the helm install command to install kcover on your cluster.
1398+
[click] Subsequently, when submitting training tasks, such as a PyTorchJob, you only need to set a label for the job.
1399+
[click] This allows kcover to continuously monitor the job, ensuring that it can be quickly recovered after a failure without the need for manual intervention.
1400+
-->
1401+
13661402
---
13671403
class: py-10
13681404
---
@@ -1453,6 +1489,11 @@ class: py-10
14531489
</div>
14541490
</div>
14551491

1492+
<!--
1493+
The above discusses some of the current features and technical details of kcover.
1494+
[click] This project is now open source, and you can find it at here. We warmly welcome everyone to help us; suggestions or feedback are greatly appreciated.
1495+
-->
1496+
14561497
---
14571498
class: py-10
14581499
---

0 commit comments

Comments
 (0)