You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: packages/2024-08-21-kubecon-hk/slides.md
+163-34
Original file line number
Diff line number
Diff line change
@@ -74,7 +74,7 @@ glowSeed: 205
74
74
<!--
75
75
Before we start, let's introduce ourselves.
76
76
77
-
[click] We are the software engineers come from DaoCloud. We are primarily focusing on field where we will cohere [click] Kubernetes and AI workloads together.
77
+
[click] We are the software engineers come from DaoCloud, one of the famously known corporation that put much efforts into open source and Kubernetes ecosystems. We are now primarily focusing on field where we will cohere [click] Kubernetes and AI workloads together.
78
78
-->
79
79
80
80
---
@@ -120,7 +120,7 @@ glowSeed: 205
120
120
<!--
121
121
As background, [click] Kebe Liu is one of the member of Istio Steering Committee, while working on AI stuff, he is also focusing on cloud-native and Istio, eBPF and other areas in recent years.
122
122
123
-
[click] Me, Fanshi Zhang, I am a software engineer at DaoCloud, focusing on AI and Kubernetes. I am also a contributor to the Kubernetes community.
123
+
[click] Me, Fanshi Zhang, I am a software engineer at DaoCloud, focusing on AI and Kubernetes. I am also a contributor to the Kubernetes community. As well as contributor to Go, Vue, founder of Nolebase, Guii, and Lintic.
Without further ado, let's jump right into the rabbit hole and see what we have prepared for you today.
134
134
135
-
The very first one thing to get off is distributed training...
135
+
The very first one thing to get off... is distributed training...
136
136
-->
137
137
138
138
---
@@ -280,7 +280,9 @@ class: py-10
280
280
</div>
281
281
282
282
<!--
283
-
This is the "visualization" fundamental building block of every machine learning models. (or Hinton diagram if you like). With that, the concept of training is just splitting data into different slices and blocks (which we call batches), what to do? [click] we will then feed them into the [click] CPU or GPU hardware devices to do the computation as well as inference.
283
+
This is the "visualization" of fundamental building block of every machine learning models. (or Hinton diagram if you like).
284
+
285
+
With that, the concept of training is just splitting data into different slices and blocks (which we call batches), what to do? [click] we will then feed them into the [click] CPU or GPU hardware devices to do the computation as well as inference.
284
286
-->
285
287
286
288
---
@@ -362,9 +364,13 @@ glowSeed: 120
362
364
</div>
363
365
364
366
<!--
365
-
Surely we know what is model and what was doing during training. [click] But in modern days, models are getting larger and larger, [click] they wouldn't be able to fit into a single instance of GPU. Therefore, to deal with the "too large" problem, [click] we will need to distribute them to multiple GPU clusters.
367
+
Surely we now understand what model is and what was doing during training. There comes challenges.
368
+
369
+
[click] In modern days, models are getting larger and larger, [click] they wouldn't be able to fit into a single instance of GPU. Therefore, to deal with the "too large" problem, [click] we will need to distribute them to multiple GPU clusters.
366
370
367
-
Ok, everything seems fine. More GPUs means faster training, right? Or is it? [click] It turns out the memory and power consumption will not be the only problems we will face, But also the failures.
371
+
Ok, everything seems fine. More GPUs means faster training, right? Or is it?
372
+
373
+
[click] It turns out the memory and power consumption will not be the only problems we will face, but also the failures.
368
374
-->
369
375
370
376
---
@@ -399,9 +405,9 @@ glowSeed: 368
399
405
</div>
400
406
401
407
<!--
402
-
So, why do failures occur?
408
+
Ok, so... why do failures occur?
403
409
404
-
Before we get into the rabbit hole any further, let's take a look at the common [click] hardware failures, [click] network issues, [click] or even software bugs.
410
+
Before we dive much deeper, let's take a step back to the common issues: [click] hardware failures, [click] network issues, [click] and, software bugs.
405
411
-->
406
412
407
413
---
@@ -419,19 +425,6 @@ class: py-10
419
425
420
426
<divflexflex-col>
421
427
422
-
<v-clicks>
423
-
424
-
```txt {|4,6-9}
425
-
[ 4254.197816] NVRM: GPU at PCI:0000:5d:00: GPU-f1906b9b-557a-e961-045c-9fe4be3ce012
426
-
[ 4254.197854] NVRM: GPU Board Serial Number: 1653923026510
427
-
[ 4254.197860] NVRM: Xid (PCI:0000:5d:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
428
-
[ 4254.197871] NVRM: GPU 0000:5d:00.0: GPU has fallen off the bus.
429
-
[ 4254.197878] NVRM: GPU 0000:5d:00.0: GPU serial number is 1653923026510.
430
-
[ 4254.197913] NVRM: A GPU crash dump has been created. If possible, please run
431
-
NVRM: nvidia-bug-report.sh as root to collect this data before
432
-
NVRM: the NVIDIA kernel module is unloaded.
433
-
```
434
-
435
428
```txt {|3,4-5}
436
429
[14387.209961] NVRM: The NVIDIA GPU 0000:5d:00.0
437
430
NVRM: (PCI ID: 10de:2330) installed in this system has
@@ -444,12 +437,23 @@ class: py-10
444
437
[14387.573380] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 525.125.06 Tue May 30 04:58:48 UTC 2023
445
438
```
446
439
447
-
</v-clicks>
440
+
```txt {|4,6-9}
441
+
[ 4254.197816] NVRM: GPU at PCI:0000:5d:00: GPU-f1906b9b-557a-e961-045c-9fe4be3ce012
442
+
[ 4254.197854] NVRM: GPU Board Serial Number: 1653923026510
443
+
[ 4254.197860] NVRM: Xid (PCI:0000:5d:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
444
+
[ 4254.197871] NVRM: GPU 0000:5d:00.0: GPU has fallen off the bus.
445
+
[ 4254.197878] NVRM: GPU 0000:5d:00.0: GPU serial number is 1653923026510.
446
+
[ 4254.197913] NVRM: A GPU crash dump has been created. If possible, please run
447
+
NVRM: nvidia-bug-report.sh as root to collect this data before
448
+
NVRM: the NVIDIA kernel module is unloaded.
449
+
```
448
450
449
451
</div>
450
452
451
453
<!--
454
+
Let's take a look at this log we've captured by executing `dmesg` to inspect the syslog, [click] we can see that the GPU has fallen off the bus, [click] and the NVIDIA probe routine failed for 1 device(s).
452
455
456
+
Those are the common issue that we will face when dealing with GPU and PCIe. From the perspective of kernel.
453
457
-->
454
458
455
459
---
@@ -469,7 +473,7 @@ class: py-10
469
473
470
474
<v-clicks>
471
475
472
-
```txt {|5,10-13}
476
+
```txt {5,10-13}
473
477
node-1:185:1027 [7] NCCL INFO [Service thread] Connection closed by localRank 0
474
478
node-1:180:1028 [2] NCCL INFO [Service thread] Connection closed by localRank 0
475
479
node-1:184:1030 [6] NCCL INFO [Service thread] Connection closed by localRank 0
@@ -494,7 +498,7 @@ NET/IB : Got completion from peer 10.42.0.2<47534> with error 5, opcode 48, len
494
498
</div>
495
499
496
500
<!--
497
-
501
+
This is another one, related to NCCL, and was captured during some training experiments with PyTorch. [click] We can see that the connection was closed by localRank 0, and the NCCL watchdog thread terminated with exception: NCCL error: remote process exited or there was a network error.
498
502
-->
499
503
500
504
---
@@ -538,7 +542,7 @@ RuntimeError: expected scalar type BFloat16 but found Float
538
542
</div>
539
543
540
544
<!--
541
-
545
+
And the last one! [click] This is a software bug that was captured during the training process. [click] We can see that the error was caused by the expected scalar type BFloat16 but found Float.
542
546
-->
543
547
544
548
---
@@ -570,7 +574,21 @@ glow: right
570
574
</v-clicks>
571
575
572
576
<!--
573
-
[click] Instead of a normal deployment, distributed training jobs are more like a StatefulSet. [click] When bootstrapping, the main node (rank 0) will be the first to start, [click] then negotiate with other nodes (rank != 0) to join the training through NCCL. While both [click] calculate topology, [click] calculate connectivity, [click] calculate bandwidth, etc. [click] Once everyone is ready, minibatch will be calculated and sent to each node. [click] During training, every step, or epoch, a Ring AllReduce or AllReduce operation will be performed across the nodes.
577
+
Hmm, so where is the so called "Irreversible" issues?
578
+
579
+
Sorry to spent so many slides to build up the fundamental, please allow me to explain how distributed training works in PyTorch with one additional slide.
580
+
581
+
Ok...
582
+
583
+
[click] Instead of a normal deployment, distributed training jobs are more like a StatefulSet.
584
+
585
+
[click] When bootstrapping, the main node (rank 0) will be the first to start, [click] then negotiate with other nodes (rank != 0) to join the training through NCCL.
586
+
587
+
While both [click] calculate topology, [click] calculate connectivity, [click] calculate bandwidth, etc.
588
+
589
+
[click] Once everyone is ready, minibatch will be calculated and sent to each node.
590
+
591
+
[click] During training, every step, or epoch, a Ring AllReduce or AllReduce operation will be performed across the nodes.
574
592
-->
575
593
576
594
---
@@ -686,7 +704,21 @@ clicks: 7
686
704
<divmt-12 />
687
705
688
706
<!--
707
+
That's so many words. Don't worry, we've prepared a simulation animation to explain one of the known toughest issues for distributed training.
708
+
709
+
Let's look the overview first.
710
+
711
+
[click] Say we have a distributed training job running on a GPU cluster, [click] pay attentions to how the error is propagated from one node to another...
689
712
713
+
[click] For giving nodes in cluster that running distributed training workloads. [click] the main node (rank 0) will be the first to start, then negotiate with other nodes (rank != 0) to join the training through NCCL.
714
+
715
+
[click] Once everything is ready, the training will be started across the nodes.
716
+
717
+
[click] However...... one of the node (or pod) encountered some critical issues due to NCCL, or GPU failures. [click] Now the interesting part kicks in, see the propagation of the error from one node to another?
718
+
719
+
That means, when one of the node failed, for NCCL based distributed training, every node requires the others to respond, while the NCCL is hanging, nothing gets done, every one is waiting.
720
+
721
+
Ok, how can we resolve this issue? We need to kill the ALL related nodes (or pods) and restart them.
690
722
-->
691
723
692
724
---
@@ -772,6 +804,18 @@ glowSeed: 100
772
804
773
805
</div>
774
806
807
+
<!--
808
+
There must be something wrong.
809
+
810
+
Let me explain it more:
811
+
812
+
First, [click] the distribution algorithm is purely implemented by PyTorch, or NCCL itself. It's hard to debug, trace, make it managed, and controlled.
813
+
814
+
Second, [click] unlike nowadays Kubernetes Operators, healing, orchestrating still hard to achieve. It's hard to auto-heal, auto-recover, auto-mitigate. Obviously, we have no ways to detect what's happening.
815
+
816
+
Third, [click] detecting failures of drivers, hardwares, GPUs, or even network is still a challenge. It's hard to know the root cause, collect needed NPD events & logs, and lack of observability.
817
+
-->
818
+
775
819
---
776
820
class: py-10
777
821
clicks: 5
@@ -831,14 +875,28 @@ clicks: 5
831
875
832
876
</div>
833
877
878
+
<!--
879
+
Things didn't stop there. There is actually more.
880
+
881
+
Remember how the node (or pod) went wrong? When we recovering the training job, checkpoint files must be transferred too! However,
882
+
883
+
[click] Checkpoints are large. For example, Llama 2 has roughly 83GB of checkpoint files.
884
+
885
+
[click] Limited bandwidth of NFS, shared Volumes, RDMA. Saving 80G and above levels checkpoint files require high speed of IO to reduce the downtime.
886
+
887
+
[click] Mitigation requires transferring across nodes. If one of the GPU node went down, hundreds GB of files must be transferred to another node.
888
+
889
+
So, IO, and storage are other challenges we need to face.
890
+
-->
891
+
834
892
---
835
893
class: py-10
836
894
clicks: 2
837
895
---
838
896
839
897
# Tune the factors
840
898
841
-
<span>Checkpoints, weights are more even critical</span>
899
+
<span>Mathematically...</span>
842
900
843
901
<divmt-12v-click="2">
844
902
@@ -878,6 +936,14 @@ $$
878
936
879
937
</div>
880
938
939
+
<!--
940
+
Let's sum it up much "mathematical", we illustrated such factors when dealing with distributed training.
941
+
942
+
[click] with them, we could get the formula for training time cost.
943
+
944
+
I know this is hard to understand in a glance, let's simplify it.
945
+
-->
946
+
881
947
---
882
948
class: py-10
883
949
---
@@ -940,6 +1006,18 @@ class: py-10
940
1006
941
1007
</div>
942
1008
1009
+
<!--
1010
+
In a nut shell, there are three major factors that we can improve:
1011
+
1012
+
[click] Reduce diagnostic time
1013
+
1014
+
[click] Reduce reconcile time
1015
+
1016
+
[click] Speed up checkpoints
1017
+
1018
+
Eventually, we can reduce the total training time cost.
1019
+
-->
1020
+
943
1021
---
944
1022
class: py-10
945
1023
---
@@ -981,6 +1059,18 @@ They managed to automate most things...
We that said, we've understood the issues. Let's take a look at some of the state of the art blogs, tryouts, and researches.
1064
+
1065
+
The first one is from BigScience.
1066
+
1067
+
[click] They encountered hardware issues, where the frequency of GPU failures was 1-2 a week. Each time a GPU failed, they would lose 1.5h of training on hardware crash.
1068
+
1069
+
[click] They described the same issue I showed you before in the simulation failure section. Sometimes the training gets stuck despite a crashing process and it won't quit.
1070
+
1071
+
[click] Fortunately, they finally managed to automate most things! Yeeey!
1072
+
-->
1073
+
984
1074
---
985
1075
class: py-10
986
1076
---
@@ -1022,6 +1112,18 @@ After improvements...
1022
1112
1023
1113
[^1]: [Introducing Meta Llama 3: The most capable openly available LLM to date](https://ai.meta.com/blog/meta-llama-3/)
1024
1114
1115
+
<!--
1116
+
Ok, what about Meta? They trained Llama 3 405B model on a massive 24,000 GPU cluster.
1117
+
1118
+
Surely!
1119
+
1120
+
[click] They developed an advanced new training stack that automates [click] error detection, [click] handling, and [click] maintenance to maximize GPU uptime.
1121
+
1122
+
[click] They managed to detect [click] silent data corruption, and developed new system to [click] speed up the checkpointing and rollback.
1123
+
1124
+
[click] The improvements is huge, they managed to [click] reduce the training time by more than 95%, and [click] increased the efficiency of Llama 3 training by ~three times compared to Llama 2.
1125
+
-->
1126
+
1025
1127
---
1026
1128
class: py-10
1027
1129
glow: right
@@ -1103,6 +1205,16 @@ glow: right
1103
1205
1104
1206
</div>
1105
1207
1208
+
<!--
1209
+
Who else have tried to solve the issues? Clearly everyone understand the challenges now. There are two projects that solved the problems from two different perspectives.
1210
+
1211
+
[click] JobSet, [click] a Kubernetes SIG project. Easily to extend, however, it cannot handle events from pods, and log analysis, and cannot perform periodic inspection.
1212
+
1213
+
[click] DLRover, [click] a trainer-oriented project. It's PyTorch native, ready to use out of box, however, it cannot perform periodic inspection, and not extensible to various of frameworks & scenarios since it's built for PyTorch, as a extended trainer.
1214
+
1215
+
There are some of the arxiv papers that you can read to understand more about the researches. I put them here for your reference.
@@ -1138,6 +1250,8 @@ Since we have spent our time on layering concepts and knowledges, let's see what
1138
1250
[click] Introducing Kcover.
1139
1251
1140
1252
[click] This is our one simple intall-to-go plugin solution combines both NPD (Node Problem Detector) and operator.
1253
+
1254
+
To learn more, here's Kebe.
1141
1255
-->
1142
1256
1143
1257
---
@@ -1340,9 +1454,9 @@ glowSeed: 230
1340
1454
</v-clicks>
1341
1455
1342
1456
<!--
1343
-
[click] Once a training job is labeled, kcover will continuously analyze this information.
1344
-
This includes [click] node status, [click] container logs (such as CUDA, NCCL, or OOM errors, [click] as well as specific exit codes).
1345
-
[click] If a problem is detected,
1457
+
[click] Once a training job is labeled, kcover will continuously analyze this information.
1458
+
This includes [click] node status, [click] container logs (such as CUDA, NCCL, or OOM errors, [click] as well as specific exit codes).
1459
+
[click] If a problem is detected,
1346
1460
[click] we will record the event through the Collector [click] and may initiate a Cascading Shutdown to restart the job, allowing it to resume training from the last known state. [click] Additionally, through ongoing diagnostic tools, we will analyze network status, GPU hardware status, PCIE status, and kernel status to ensure that the system always operates at optimal conditions.
1347
1461
-->
1348
1462
@@ -1393,9 +1507,9 @@ metadata:
1393
1507
</div>
1394
1508
1395
1509
<!--
1396
-
To start using kcover, you can initially install kcover onto your system with a few simple helm commands.
1510
+
To start using kcover, you can initially install kcover onto your system with a few simple helm commands.
1397
1511
[click] You only need to execute the helm install command to install kcover on your cluster.
1398
-
[click] Subsequently, when submitting training jobs, such as a PyTorchJob, you only need to set a label for the job.
1512
+
[click] Subsequently, when submitting training jobs, such as a PyTorchJob, you only need to set a label for the job.
1399
1513
[click] This allows kcover to continuously monitor the job, ensuring that it can be quickly recovered after a failure without the need for manual intervention.
1400
1514
-->
1401
1515
@@ -1458,6 +1572,10 @@ class: py-10
1458
1572
1459
1573
</div>
1460
1574
1575
+
<!--
1576
+
There is much work to do, for example, having more advanced event analysis, more types of analysis, and more integrated solutions.
1577
+
-->
1578
+
1461
1579
---
1462
1580
class: py-10
1463
1581
---
@@ -1490,8 +1608,8 @@ class: py-10
1490
1608
</div>
1491
1609
1492
1610
<!--
1493
-
The above discusses some of the current features and technical details of kcover.
1494
-
[click] This project is now open source, and you can find it at here. We warmly welcome everyone to help us; suggestions or feedback are greatly appreciated.
1611
+
The above discusses some of the current features and technical details of Kcover.
1612
+
[click] This project is now open sourced, and you can find it at here. We warmly welcome everyone to help us; suggestions or feedback are greatly appreciated.
1495
1613
-->
1496
1614
1497
1615
---
@@ -1519,6 +1637,17 @@ class: py-10
1519
1637
</div>
1520
1638
</div>
1521
1639
1640
+
<!--
1641
+
We couldn't make it without any of the communties. We wanted to shout out to the community for the following improvements to discuss with.
1642
+
1643
+
This is the list of it.
1644
+
1645
+
- We propose a universal trainer health check implementation for PyTorch.
1646
+
- Together to build better analysis and root cause debugging on top of Kubernetes.
1647
+
- Try to expose more observability metrics for tracing, logging, and monitoring.
1648
+
- How about implement a stateless negotiator layer on top of TensorFlow, PyTorch, and Jax?
0 commit comments