Skip to content

Commit b7df1b3

Browse files
committed
tyoo
1 parent c0f7788 commit b7df1b3

File tree

1 file changed

+7
-7
lines changed

1 file changed

+7
-7
lines changed

packages/2024-08-21-kubecon-hk/slides.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -1236,13 +1236,13 @@ glowSeed: 230
12361236
<!--
12371237
Let’s dive into the core features of kcover.
12381238
1239-
[click] Firstly, there is the “Firewatch of Workloads.” This feature will continuously monitor the status of training task workloads, such as PyTorchJobs, and promptly detect any error messages.
1239+
[click] Firstly, there is the “Firewatch of Workloads.” This feature will continuously monitor the status of training job workloads, such as PyTorchJobs, and promptly detect any error messages.
12401240
12411241
[click] Next, “Enhanced Observability” will be implemented by utilizing various means to determine the status of jobs, such as observing logs and real-time system calls, thus enhancing the observability of training jobs.
12421242
12431243
[click] Through “Periodic Inspection,” we will regularly test the status of jobs, the environment, or infrastructure to ensure that the resources committed to training jobs meet the required conditions, ensuring smooth training progress.
12441244
1245-
[click] With “Cascading Shutdown,” when a fault occurs that prevents the training task from continuing, the entire task will be restarted through Cascading Shutdown. This prevents the training framework from waiting due to a non-working part, thus avoiding the waste of valuable hardware resources.
1245+
[click] With “Cascading Shutdown,” when a fault occurs that prevents the training job from continuing, the entire job will be restarted through Cascading Shutdown. This prevents the training framework from waiting due to a non-working part, thus avoiding the waste of valuable hardware resources.
12461246
12471247
[click] Finally, “Intelli-Migration” will intelligently assess the health status of nodes to determine whether they can continue running jobs, ensuring maximized resource utilization while safeguarding training efficiency.
12481248
-->
@@ -1307,9 +1307,9 @@ class: py-10
13071307
<!--
13081308
The architecture of Kcover consists of two parts: Collector and Controller (also known as Recovery Manager).
13091309
1310-
The Collector runs as a Daemonset on each Node, responsible for gathering information. This includes executing the dcgmi command, analyzing the logs and events of each Pod, and invoking some system calls, such as checking the status of PCIE devices, to determine the operational status of tasks. It reports any exceptional events back to the APIServer.
1310+
The Collector runs as a Daemonset on each Node, responsible for gathering information. This includes executing the dcgmi command, analyzing the logs and events of each Pod, and invoking some system calls, such as checking the status of PCIE devices, to determine the operational status of jobs. It reports any exceptional events back to the APIServer.
13111311
1312-
The Controller monitors these events from the APIServer and makes further assessments of the data collected by the Collector to determine whether a Job needs to be restarted. If a restart is required, it will execute the restart of the entire Job and may mark the node as unschedulable.
1312+
The Controller keeps an eye on events relayed by the APIServer and further analyzes the data gathered by the Collector. Based on this analysis, it decides if a Job needs to be restarted. If a restart is necessary, the Controller will reboot the entire Job and may also mark the node as unschedulable to prevent further assignments.
13131313
-->
13141314

13151315
---
@@ -1340,10 +1340,10 @@ glowSeed: 230
13401340
</v-clicks>
13411341

13421342
<!--
1343-
[click] Once a training task is labeled, kcover will continuously analyze this information.
1343+
[click] Once a training job is labeled, kcover will continuously analyze this information.
13441344
This includes [click] node status, [click] container logs (such as CUDA, NCCL, or OOM errors, [click] as well as specific exit codes).
13451345
[click] If a problem is detected,
1346-
[click] we will record the event through the Collector [click] and may initiate a Cascading Shutdown to restart the task, allowing it to resume training from the last known state. [click] Additionally, through ongoing diagnostic tools, we will analyze network status, GPU hardware status, PCIE status, and kernel status to ensure that the system always operates at optimal conditions.
1346+
[click] we will record the event through the Collector [click] and may initiate a Cascading Shutdown to restart the job, allowing it to resume training from the last known state. [click] Additionally, through ongoing diagnostic tools, we will analyze network status, GPU hardware status, PCIE status, and kernel status to ensure that the system always operates at optimal conditions.
13471347
-->
13481348

13491349
---
@@ -1395,7 +1395,7 @@ metadata:
13951395
<!--
13961396
To start using kcover, you can initially install kcover onto your system with a few simple helm commands.
13971397
[click] You only need to execute the helm install command to install kcover on your cluster.
1398-
[click] Subsequently, when submitting training tasks, such as a PyTorchJob, you only need to set a label for the job.
1398+
[click] Subsequently, when submitting training jobs, such as a PyTorchJob, you only need to set a label for the job.
13991399
[click] This allows kcover to continuously monitor the job, ensuring that it can be quickly recovered after a failure without the need for manual intervention.
14001400
-->
14011401

0 commit comments

Comments
 (0)