Skip to content

Commit 9f449a3

Browse files
committedAug 20, 2024··
docs: more comments
Signed-off-by: Neko Ayaka <neko@ayaka.moe>
1 parent b7df1b3 commit 9f449a3

File tree

1 file changed

+163
-34
lines changed

1 file changed

+163
-34
lines changed
 

‎packages/2024-08-21-kubecon-hk/slides.md

+163-34
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ glowSeed: 205
7474
<!--
7575
Before we start, let's introduce ourselves.
7676
77-
[click] We are the software engineers come from DaoCloud. We are primarily focusing on field where we will cohere [click] Kubernetes and AI workloads together.
77+
[click] We are the software engineers come from DaoCloud, one of the famously known corporation that put much efforts into open source and Kubernetes ecosystems. We are now primarily focusing on field where we will cohere [click] Kubernetes and AI workloads together.
7878
-->
7979

8080
---
@@ -120,7 +120,7 @@ glowSeed: 205
120120
<!--
121121
As background, [click] Kebe Liu is one of the member of Istio Steering Committee, while working on AI stuff, he is also focusing on cloud-native and Istio, eBPF and other areas in recent years.
122122
123-
[click] Me, Fanshi Zhang, I am a software engineer at DaoCloud, focusing on AI and Kubernetes. I am also a contributor to the Kubernetes community.
123+
[click] Me, Fanshi Zhang, I am a software engineer at DaoCloud, focusing on AI and Kubernetes. I am also a contributor to the Kubernetes community. As well as contributor to Go, Vue, founder of Nolebase, Guii, and Lintic.
124124
-->
125125

126126
---
@@ -132,7 +132,7 @@ class: flex justify-center items-center gap-20 px40 text-xl
132132
<!--
133133
Without further ado, let's jump right into the rabbit hole and see what we have prepared for you today.
134134
135-
The very first one thing to get off is distributed training...
135+
The very first one thing to get off... is distributed training...
136136
-->
137137

138138
---
@@ -280,7 +280,9 @@ class: py-10
280280
</div>
281281
282282
<!--
283-
This is the "visualization" fundamental building block of every machine learning models. (or Hinton diagram if you like). With that, the concept of training is just splitting data into different slices and blocks (which we call batches), what to do? [click] we will then feed them into the [click] CPU or GPU hardware devices to do the computation as well as inference.
283+
This is the "visualization" of fundamental building block of every machine learning models. (or Hinton diagram if you like).
284+
285+
With that, the concept of training is just splitting data into different slices and blocks (which we call batches), what to do? [click] we will then feed them into the [click] CPU or GPU hardware devices to do the computation as well as inference.
284286
-->
285287

286288
---
@@ -362,9 +364,13 @@ glowSeed: 120
362364
</div>
363365

364366
<!--
365-
Surely we know what is model and what was doing during training. [click] But in modern days, models are getting larger and larger, [click] they wouldn't be able to fit into a single instance of GPU. Therefore, to deal with the "too large" problem, [click] we will need to distribute them to multiple GPU clusters.
367+
Surely we now understand what model is and what was doing during training. There comes challenges.
368+
369+
[click] In modern days, models are getting larger and larger, [click] they wouldn't be able to fit into a single instance of GPU. Therefore, to deal with the "too large" problem, [click] we will need to distribute them to multiple GPU clusters.
366370
367-
Ok, everything seems fine. More GPUs means faster training, right? Or is it? [click] It turns out the memory and power consumption will not be the only problems we will face, But also the failures.
371+
Ok, everything seems fine. More GPUs means faster training, right? Or is it?
372+
373+
[click] It turns out the memory and power consumption will not be the only problems we will face, but also the failures.
368374
-->
369375

370376
---
@@ -399,9 +405,9 @@ glowSeed: 368
399405
</div>
400406

401407
<!--
402-
So, why do failures occur?
408+
Ok, so... why do failures occur?
403409
404-
Before we get into the rabbit hole any further, let's take a look at the common [click] hardware failures, [click] network issues, [click] or even software bugs.
410+
Before we dive much deeper, let's take a step back to the common issues: [click] hardware failures, [click] network issues, [click] and, software bugs.
405411
-->
406412

407413
---
@@ -419,19 +425,6 @@ class: py-10
419425

420426
<div flex flex-col>
421427

422-
<v-clicks>
423-
424-
```txt {|4,6-9}
425-
[ 4254.197816] NVRM: GPU at PCI:0000:5d:00: GPU-f1906b9b-557a-e961-045c-9fe4be3ce012
426-
[ 4254.197854] NVRM: GPU Board Serial Number: 1653923026510
427-
[ 4254.197860] NVRM: Xid (PCI:0000:5d:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
428-
[ 4254.197871] NVRM: GPU 0000:5d:00.0: GPU has fallen off the bus.
429-
[ 4254.197878] NVRM: GPU 0000:5d:00.0: GPU serial number is 1653923026510.
430-
[ 4254.197913] NVRM: A GPU crash dump has been created. If possible, please run
431-
NVRM: nvidia-bug-report.sh as root to collect this data before
432-
NVRM: the NVIDIA kernel module is unloaded.
433-
```
434-
435428
```txt {|3,4-5}
436429
[14387.209961] NVRM: The NVIDIA GPU 0000:5d:00.0
437430
NVRM: (PCI ID: 10de:2330) installed in this system has
@@ -444,12 +437,23 @@ class: py-10
444437
[14387.573380] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 525.125.06 Tue May 30 04:58:48 UTC 2023
445438
```
446439

447-
</v-clicks>
440+
```txt {|4,6-9}
441+
[ 4254.197816] NVRM: GPU at PCI:0000:5d:00: GPU-f1906b9b-557a-e961-045c-9fe4be3ce012
442+
[ 4254.197854] NVRM: GPU Board Serial Number: 1653923026510
443+
[ 4254.197860] NVRM: Xid (PCI:0000:5d:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
444+
[ 4254.197871] NVRM: GPU 0000:5d:00.0: GPU has fallen off the bus.
445+
[ 4254.197878] NVRM: GPU 0000:5d:00.0: GPU serial number is 1653923026510.
446+
[ 4254.197913] NVRM: A GPU crash dump has been created. If possible, please run
447+
NVRM: nvidia-bug-report.sh as root to collect this data before
448+
NVRM: the NVIDIA kernel module is unloaded.
449+
```
448450

449451
</div>
450452

451453
<!--
454+
Let's take a look at this log we've captured by executing `dmesg` to inspect the syslog, [click] we can see that the GPU has fallen off the bus, [click] and the NVIDIA probe routine failed for 1 device(s).
452455
456+
Those are the common issue that we will face when dealing with GPU and PCIe. From the perspective of kernel.
453457
-->
454458

455459
---
@@ -469,7 +473,7 @@ class: py-10
469473

470474
<v-clicks>
471475

472-
```txt {|5,10-13}
476+
```txt {5,10-13}
473477
node-1:185:1027 [7] NCCL INFO [Service thread] Connection closed by localRank 0
474478
node-1:180:1028 [2] NCCL INFO [Service thread] Connection closed by localRank 0
475479
node-1:184:1030 [6] NCCL INFO [Service thread] Connection closed by localRank 0
@@ -494,7 +498,7 @@ NET/IB : Got completion from peer 10.42.0.2<47534> with error 5, opcode 48, len
494498
</div>
495499

496500
<!--
497-
501+
This is another one, related to NCCL, and was captured during some training experiments with PyTorch. [click] We can see that the connection was closed by localRank 0, and the NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error.
498502
-->
499503

500504
---
@@ -538,7 +542,7 @@ RuntimeError: expected scalar type BFloat16 but found Float
538542
</div>
539543

540544
<!--
541-
545+
And the last one! [click] This is a software bug that was captured during the training process. [click] We can see that the error was caused by the expected scalar type BFloat16 but found Float.
542546
-->
543547

544548
---
@@ -570,7 +574,21 @@ glow: right
570574
</v-clicks>
571575

572576
<!--
573-
[click] Instead of a normal deployment, distributed training jobs are more like a StatefulSet. [click] When bootstrapping, the main node (rank 0) will be the first to start, [click] then negotiate with other nodes (rank != 0) to join the training through NCCL. While both [click] calculate topology, [click] calculate connectivity, [click] calculate bandwidth, etc. [click] Once everyone is ready, minibatch will be calculated and sent to each node. [click] During training, every step, or epoch, a Ring AllReduce or AllReduce operation will be performed across the nodes.
577+
Hmm, so where is the so called "Irreversible" issues?
578+
579+
Sorry to spent so many slides to build up the fundamental, please allow me to explain how distributed training works in PyTorch with one additional slide.
580+
581+
Ok...
582+
583+
[click] Instead of a normal deployment, distributed training jobs are more like a StatefulSet.
584+
585+
[click] When bootstrapping, the main node (rank 0) will be the first to start, [click] then negotiate with other nodes (rank != 0) to join the training through NCCL.
586+
587+
While both [click] calculate topology, [click] calculate connectivity, [click] calculate bandwidth, etc.
588+
589+
[click] Once everyone is ready, minibatch will be calculated and sent to each node.
590+
591+
[click] During training, every step, or epoch, a Ring AllReduce or AllReduce operation will be performed across the nodes.
574592
-->
575593

576594
---
@@ -686,7 +704,21 @@ clicks: 7
686704
<div mt-12 />
687705

688706
<!--
707+
That's so many words. Don't worry, we've prepared a simulation animation to explain one of the known toughest issues for distributed training.
708+
709+
Let's look the overview first.
710+
711+
[click] Say we have a distributed training job running on a GPU cluster, [click] pay attentions to how the error is propagated from one node to another...
689712
713+
[click] For giving nodes in cluster that running distributed training workloads. [click] the main node (rank 0) will be the first to start, then negotiate with other nodes (rank != 0) to join the training through NCCL.
714+
715+
[click] Once everything is ready, the training will be started across the nodes.
716+
717+
[click] However...... one of the node (or pod) encountered some critical issues due to NCCL, or GPU failures. [click] Now the interesting part kicks in, see the propagation of the error from one node to another?
718+
719+
That means, when one of the node failed, for NCCL based distributed training, every node requires the others to respond, while the NCCL is hanging, nothing gets done, every one is waiting.
720+
721+
Ok, how can we resolve this issue? We need to kill the ALL related nodes (or pods) and restart them.
690722
-->
691723

692724
---
@@ -772,6 +804,18 @@ glowSeed: 100
772804

773805
</div>
774806

807+
<!--
808+
There must be something wrong.
809+
810+
Let me explain it more:
811+
812+
First, [click] the distribution algorithm is purely implemented by PyTorch, or NCCL itself. It's hard to debug, trace, make it managed, and controlled.
813+
814+
Second, [click] unlike nowadays Kubernetes Operators, healing, orchestrating still hard to achieve. It's hard to auto-heal, auto-recover, auto-mitigate. Obviously, we have no ways to detect what's happening.
815+
816+
Third, [click] detecting failures of drivers, hardwares, GPUs, or even network is still a challenge. It's hard to know the root cause, collect needed NPD events & logs, and lack of observability.
817+
-->
818+
775819
---
776820
class: py-10
777821
clicks: 5
@@ -831,14 +875,28 @@ clicks: 5
831875

832876
</div>
833877

878+
<!--
879+
Things didn't stop there. There is actually more.
880+
881+
Remember how the node (or pod) went wrong? When we recovering the training job, checkpoint files must be transferred too! However,
882+
883+
[click] Checkpoints are large. For example, Llama 2 has roughly 83GB of checkpoint files.
884+
885+
[click] Limited bandwidth of NFS, shared Volumes, RDMA. Saving 80G and above levels checkpoint files require high speed of IO to reduce the downtime.
886+
887+
[click] Mitigation requires transferring across nodes. If one of the GPU node went down, hundreds GB of files must be transferred to another node.
888+
889+
So, IO, and storage are other challenges we need to face.
890+
-->
891+
834892
---
835893
class: py-10
836894
clicks: 2
837895
---
838896

839897
# Tune the factors
840898

841-
<span>Checkpoints, weights are more even critical</span>
899+
<span>Mathematically...</span>
842900

843901
<div mt-12 v-click="2">
844902

@@ -878,6 +936,14 @@ $$
878936

879937
</div>
880938

939+
<!--
940+
Let's sum it up much "mathematical", we illustrated such factors when dealing with distributed training.
941+
942+
[click] with them, we could get the formula for training time cost.
943+
944+
I know this is hard to understand in a glance, let's simplify it.
945+
-->
946+
881947
---
882948
class: py-10
883949
---
@@ -940,6 +1006,18 @@ class: py-10
9401006
9411007
</div>
9421008

1009+
<!--
1010+
In a nut shell, there are three major factors that we can improve:
1011+
1012+
[click] Reduce diagnostic time
1013+
1014+
[click] Reduce reconcile time
1015+
1016+
[click] Speed up checkpoints
1017+
1018+
Eventually, we can reduce the total training time cost.
1019+
-->
1020+
9431021
---
9441022
class: py-10
9451023
---
@@ -981,6 +1059,18 @@ They managed to automate most things...
9811059
[^1]: [The Technology Behind BLOOM Training](https://huggingface.co/blog/bloom-megatron-deepspeed)
9821060
[^2]: [Training chronicles](https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md)
9831061

1062+
<!--
1063+
We that said, we've understood the issues. Let's take a look at some of the state of the art blogs, tryouts, and researches.
1064+
1065+
The first one is from BigScience.
1066+
1067+
[click] They encountered hardware issues, where the frequency of GPU failures was 1-2 a week. Each time a GPU failed, they would lose 1.5h of training on hardware crash.
1068+
1069+
[click] They described the same issue I showed you before in the simulation failure section. Sometimes the training gets stuck despite a crashing process and it won't quit.
1070+
1071+
[click] Fortunately, they finally managed to automate most things! Yeeey!
1072+
-->
1073+
9841074
---
9851075
class: py-10
9861076
---
@@ -1022,6 +1112,18 @@ After improvements...
10221112

10231113
[^1]: [Introducing Meta Llama 3: The most capable openly available LLM to date](https://ai.meta.com/blog/meta-llama-3/)
10241114

1115+
<!--
1116+
Ok, what about Meta? They trained Llama 3 405B model on a massive 24,000 GPU cluster.
1117+
1118+
Surely!
1119+
1120+
[click] They developed an advanced new training stack that automates [click] error detection, [click] handling, and [click] maintenance to maximize GPU uptime.
1121+
1122+
[click] They managed to detect [click] silent data corruption, and developed new system to [click] speed up the checkpointing and rollback.
1123+
1124+
[click] The improvements is huge, they managed to [click] reduce the training time by more than 95%, and [click] increased the efficiency of Llama 3 training by ~three times compared to Llama 2.
1125+
-->
1126+
10251127
---
10261128
class: py-10
10271129
glow: right
@@ -1103,6 +1205,16 @@ glow: right
11031205

11041206
</div>
11051207

1208+
<!--
1209+
Who else have tried to solve the issues? Clearly everyone understand the challenges now. There are two projects that solved the problems from two different perspectives.
1210+
1211+
[click] JobSet, [click] a Kubernetes SIG project. Easily to extend, however, it cannot handle events from pods, and log analysis, and cannot perform periodic inspection.
1212+
1213+
[click] DLRover, [click] a trainer-oriented project. It's PyTorch native, ready to use out of box, however, it cannot perform periodic inspection, and not extensible to various of frameworks & scenarios since it's built for PyTorch, as a extended trainer.
1214+
1215+
There are some of the arxiv papers that you can read to understand more about the researches. I put them here for your reference.
1216+
-->
1217+
11061218
---
11071219
class: flex justify-center items-center gap-20 px40 text-xl
11081220
---
@@ -1138,6 +1250,8 @@ Since we have spent our time on layering concepts and knowledges, let's see what
11381250
[click] Introducing Kcover.
11391251
11401252
[click] This is our one simple intall-to-go plugin solution combines both NPD (Node Problem Detector) and operator.
1253+
1254+
To learn more, here's Kebe.
11411255
-->
11421256

11431257
---
@@ -1340,9 +1454,9 @@ glowSeed: 230
13401454
</v-clicks>
13411455

13421456
<!--
1343-
[click] Once a training job is labeled, kcover will continuously analyze this information.
1344-
This includes [click] node status, [click] container logs (such as CUDA, NCCL, or OOM errors, [click] as well as specific exit codes).
1345-
[click] If a problem is detected,
1457+
[click] Once a training job is labeled, kcover will continuously analyze this information.
1458+
This includes [click] node status, [click] container logs (such as CUDA, NCCL, or OOM errors, [click] as well as specific exit codes).
1459+
[click] If a problem is detected,
13461460
[click] we will record the event through the Collector [click] and may initiate a Cascading Shutdown to restart the job, allowing it to resume training from the last known state. [click] Additionally, through ongoing diagnostic tools, we will analyze network status, GPU hardware status, PCIE status, and kernel status to ensure that the system always operates at optimal conditions.
13471461
-->
13481462

@@ -1393,9 +1507,9 @@ metadata:
13931507
</div>
13941508

13951509
<!--
1396-
To start using kcover, you can initially install kcover onto your system with a few simple helm commands.
1510+
To start using kcover, you can initially install kcover onto your system with a few simple helm commands.
13971511
[click] You only need to execute the helm install command to install kcover on your cluster.
1398-
[click] Subsequently, when submitting training jobs, such as a PyTorchJob, you only need to set a label for the job.
1512+
[click] Subsequently, when submitting training jobs, such as a PyTorchJob, you only need to set a label for the job.
13991513
[click] This allows kcover to continuously monitor the job, ensuring that it can be quickly recovered after a failure without the need for manual intervention.
14001514
-->
14011515

@@ -1458,6 +1572,10 @@ class: py-10
14581572

14591573
</div>
14601574

1575+
<!--
1576+
There is much work to do, for example, having more advanced event analysis, more types of analysis, and more integrated solutions.
1577+
-->
1578+
14611579
---
14621580
class: py-10
14631581
---
@@ -1490,8 +1608,8 @@ class: py-10
14901608
</div>
14911609

14921610
<!--
1493-
The above discusses some of the current features and technical details of kcover.
1494-
[click] This project is now open source, and you can find it at here. We warmly welcome everyone to help us; suggestions or feedback are greatly appreciated.
1611+
The above discusses some of the current features and technical details of Kcover.
1612+
[click] This project is now open sourced, and you can find it at here. We warmly welcome everyone to help us; suggestions or feedback are greatly appreciated.
14951613
-->
14961614

14971615
---
@@ -1519,6 +1637,17 @@ class: py-10
15191637
</div>
15201638
</div>
15211639

1640+
<!--
1641+
We couldn't make it without any of the communties. We wanted to shout out to the community for the following improvements to discuss with.
1642+
1643+
This is the list of it.
1644+
1645+
- We propose a universal trainer health check implementation for PyTorch.
1646+
- Together to build better analysis and root cause debugging on top of Kubernetes.
1647+
- Try to expose more observability metrics for tracing, logging, and monitoring.
1648+
- How about implement a stateless negotiator layer on top of TensorFlow, PyTorch, and Jax?
1649+
-->
1650+
15221651
---
15231652
class: py-10
15241653
---

0 commit comments

Comments
 (0)
Please sign in to comment.