You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: packages/2024-08-21-kubecon-hk/slides.md
+7-7
Original file line number
Diff line number
Diff line change
@@ -1236,13 +1236,13 @@ glowSeed: 230
1236
1236
<!--
1237
1237
Let’s dive into the core features of kcover.
1238
1238
1239
-
[click] Firstly, there is the “Firewatch of Workloads.” This feature will continuously monitor the status of training task workloads, such as PyTorchJobs, and promptly detect any error messages.
1239
+
[click] Firstly, there is the “Firewatch of Workloads.” This feature will continuously monitor the status of training job workloads, such as PyTorchJobs, and promptly detect any error messages.
1240
1240
1241
1241
[click] Next, “Enhanced Observability” will be implemented by utilizing various means to determine the status of jobs, such as observing logs and real-time system calls, thus enhancing the observability of training jobs.
1242
1242
1243
1243
[click] Through “Periodic Inspection,” we will regularly test the status of jobs, the environment, or infrastructure to ensure that the resources committed to training jobs meet the required conditions, ensuring smooth training progress.
1244
1244
1245
-
[click] With “Cascading Shutdown,” when a fault occurs that prevents the training task from continuing, the entire task will be restarted through Cascading Shutdown. This prevents the training framework from waiting due to a non-working part, thus avoiding the waste of valuable hardware resources.
1245
+
[click] With “Cascading Shutdown,” when a fault occurs that prevents the training job from continuing, the entire job will be restarted through Cascading Shutdown. This prevents the training framework from waiting due to a non-working part, thus avoiding the waste of valuable hardware resources.
1246
1246
1247
1247
[click] Finally, “Intelli-Migration” will intelligently assess the health status of nodes to determine whether they can continue running jobs, ensuring maximized resource utilization while safeguarding training efficiency.
1248
1248
-->
@@ -1307,9 +1307,9 @@ class: py-10
1307
1307
<!--
1308
1308
The architecture of Kcover consists of two parts: Collector and Controller (also known as Recovery Manager).
1309
1309
1310
-
The Collector runs as a Daemonset on each Node, responsible for gathering information. This includes executing the dcgmi command, analyzing the logs and events of each Pod, and invoking some system calls, such as checking the status of PCIE devices, to determine the operational status of tasks. It reports any exceptional events back to the APIServer.
1310
+
The Collector runs as a Daemonset on each Node, responsible for gathering information. This includes executing the dcgmi command, analyzing the logs and events of each Pod, and invoking some system calls, such as checking the status of PCIE devices, to determine the operational status of jobs. It reports any exceptional events back to the APIServer.
1311
1311
1312
-
The Controller monitors these events from the APIServer and makes further assessments of the data collected by the Collector to determine whether a Job needs to be restarted. If a restart is required, it will execute the restart of the entire Job and may mark the node as unschedulable.
1312
+
The Controller keeps an eye on events relayed by the APIServer and further analyzes the data gathered by the Collector. Based on this analysis, it decides if a Job needs to be restarted. If a restart is necessary, the Controller will reboot the entire Job and may also mark the node as unschedulable to prevent further assignments.
1313
1313
-->
1314
1314
1315
1315
---
@@ -1340,10 +1340,10 @@ glowSeed: 230
1340
1340
</v-clicks>
1341
1341
1342
1342
<!--
1343
-
[click] Once a training task is labeled, kcover will continuously analyze this information.
1343
+
[click] Once a training job is labeled, kcover will continuously analyze this information.
1344
1344
This includes [click] node status, [click] container logs (such as CUDA, NCCL, or OOM errors, [click] as well as specific exit codes).
1345
1345
[click] If a problem is detected,
1346
-
[click] we will record the event through the Collector [click] and may initiate a Cascading Shutdown to restart the task, allowing it to resume training from the last known state. [click] Additionally, through ongoing diagnostic tools, we will analyze network status, GPU hardware status, PCIE status, and kernel status to ensure that the system always operates at optimal conditions.
1346
+
[click] we will record the event through the Collector [click] and may initiate a Cascading Shutdown to restart the job, allowing it to resume training from the last known state. [click] Additionally, through ongoing diagnostic tools, we will analyze network status, GPU hardware status, PCIE status, and kernel status to ensure that the system always operates at optimal conditions.
1347
1347
-->
1348
1348
1349
1349
---
@@ -1395,7 +1395,7 @@ metadata:
1395
1395
<!--
1396
1396
To start using kcover, you can initially install kcover onto your system with a few simple helm commands.
1397
1397
[click] You only need to execute the helm install command to install kcover on your cluster.
1398
-
[click] Subsequently, when submitting training tasks, such as a PyTorchJob, you only need to set a label for the job.
1398
+
[click] Subsequently, when submitting training jobs, such as a PyTorchJob, you only need to set a label for the job.
1399
1399
[click] This allows kcover to continuously monitor the job, ensuring that it can be quickly recovered after a failure without the need for manual intervention.
0 commit comments