You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: packages/2024-08-21-kubecon-hk/slides.md
+42-1
Original file line number
Diff line number
Diff line change
@@ -1233,6 +1233,20 @@ glowSeed: 230
1233
1233
1234
1234
</div>
1235
1235
1236
+
<!--
1237
+
Let’s dive into the core features of kcover.
1238
+
1239
+
[click] Firstly, there is the “Firewatch of Workloads.” This feature will continuously monitor the status of training task workloads, such as PyTorchJobs, and promptly detect any error messages.
1240
+
1241
+
[click] Next, “Enhanced Observability” will be implemented by utilizing various means to determine the status of jobs, such as observing logs and real-time system calls, thus enhancing the observability of training jobs.
1242
+
1243
+
[click] Through “Periodic Inspection,” we will regularly test the status of jobs, the environment, or infrastructure to ensure that the resources committed to training jobs meet the required conditions, ensuring smooth training progress.
1244
+
1245
+
[click] With “Cascading Shutdown,” when a fault occurs that prevents the training task from continuing, the entire task will be restarted through Cascading Shutdown. This prevents the training framework from waiting due to a non-working part, thus avoiding the waste of valuable hardware resources.
1246
+
1247
+
[click] Finally, “Intelli-Migration” will intelligently assess the health status of nodes to determine whether they can continue running jobs, ensuring maximized resource utilization while safeguarding training efficiency.
1248
+
-->
1249
+
1236
1250
---
1237
1251
class: py-10
1238
1252
---
@@ -1290,6 +1304,14 @@ class: py-10
1290
1304
1291
1305
</div>
1292
1306
1307
+
<!--
1308
+
The architecture of Kcover consists of two parts: Collector and Controller (also known as Recovery Manager).
1309
+
1310
+
The Collector runs as a Daemonset on each Node, responsible for gathering information. This includes executing the dcgmi command, analyzing the logs and events of each Pod, and invoking some system calls, such as checking the status of PCIE devices, to determine the operational status of tasks. It reports any exceptional events back to the APIServer.
1311
+
1312
+
The Controller monitors these events from the APIServer and makes further assessments of the data collected by the Collector to determine whether a Job needs to be restarted. If a restart is required, it will execute the restart of the entire Job and may mark the node as unschedulable.
1313
+
-->
1314
+
1293
1315
---
1294
1316
class: py-10
1295
1317
glow: right
@@ -1317,6 +1339,13 @@ glowSeed: 230
1317
1339
1318
1340
</v-clicks>
1319
1341
1342
+
<!--
1343
+
[click] Once a training task is labeled, kcover will continuously analyze this information.
1344
+
This includes [click] node status, [click] container logs (such as CUDA, NCCL, or OOM errors, [click] as well as specific exit codes).
1345
+
[click] If a problem is detected,
1346
+
[click] we will record the event through the Collector [click] and may initiate a Cascading Shutdown to restart the task, allowing it to resume training from the last known state. [click] Additionally, through ongoing diagnostic tools, we will analyze network status, GPU hardware status, PCIE status, and kernel status to ensure that the system always operates at optimal conditions.
To start using kcover, you can initially install kcover onto your system with a few simple helm commands.
1397
+
[click] You only need to execute the helm install command to install kcover on your cluster.
1398
+
[click] Subsequently, when submitting training tasks, such as a PyTorchJob, you only need to set a label for the job.
1399
+
[click] This allows kcover to continuously monitor the job, ensuring that it can be quickly recovered after a failure without the need for manual intervention.
1400
+
-->
1401
+
1366
1402
---
1367
1403
class: py-10
1368
1404
---
@@ -1453,6 +1489,11 @@ class: py-10
1453
1489
</div>
1454
1490
</div>
1455
1491
1492
+
<!--
1493
+
The above discusses some of the current features and technical details of kcover.
1494
+
[click] This project is now open source, and you can find it at here. We warmly welcome everyone to help us; suggestions or feedback are greatly appreciated.
0 commit comments