@@ -1256,11 +1256,30 @@ class: py-10
1256
1256
1257
1257
---
1258
1258
class: py-10
1259
+ glow: right
1260
+ glowSeed: 230
1259
1261
---
1260
1262
1261
1263
# Why can it?
1262
1264
1263
- <span >How it was implemented</span >
1265
+ <span >Sum it up for architecture</span >
1266
+
1267
+ <v-clicks depth =" 2 " >
1268
+
1269
+ - After labelled, we will watch over the stopped <span text-sky-400 ><div inline-block i-carbon:cube translate-y-0.8 mr-1 />` Pod ` </span >, and analyze the:
1270
+ - <span text-violet-300 ><div inline-block i-carbon:cloud-alerting translate-y-0.8 mr-2 />Node issues</span >
1271
+ - <span text-purple-300 ><div inline-block i-carbon:ibm-open-enterprise-languages translate-y-0.8 mr-2 />Logs</span > (e.g. <span text =" [#64b023] " ><div inline-block translate-y-0.8 mr-1 i-bi:nvidia />CUDA</span >, <span text =" [#64b023] " ><div inline-block translate-y-0.8 mr-1 i-bi:nvidia />cuDNN</span >, <span text =" [#64b023] " ><div inline-block translate-y-0.8 mr-1 i-bi:nvidia />NCCL</span >, ` OOM ` errors)
1272
+ - <span text-pink-300 ><div inline-block i-carbon:exit translate-y-0.8 mr-2 />Exit codes</span >
1273
+ - Once Issue identified:
1274
+ - <span text-purple-300 ><div inline-block i-carbon:flow-stream-reference translate-y-0.8 mr-2 />event will be recorded</span > (e.g. container logs, syscalls)
1275
+ - <span text-pink-300 >trigger cascading shutdown</span > (which results in job restarting by <div i-devicon:kubernetes inline-block translate-y-0.5 mr-2 /><span text =" [#5791f7] " >Controller & Operator</span >)
1276
+ - For continues diagnostics, <span text =" [#64b023] " ><div inline-block translate-y-0.8 mr-1 i-bi:nvidia />` dcgmi ` </span >, <span text =" [#64b023] " ><div inline-block translate-y-0.8 mr-1 i-bi:nvidia />` nvidia-smi ` </span >, <span text =" [#64b023] " ><div inline-block translate-y-0.8 mr-1 i-bi:nvidia />` nccl-test ` </span > will be executed periodically to:
1277
+ - <span text-purple-300 ><div inline-block i-carbon:flow-stream-reference translate-y-0.8 mr-2 />Network & IO connectivity & throughput</span >
1278
+ - <span text-indigo-300 ><div inline-block i-bi:gpu-card translate-y-0.8 mr-2 />GPU & VRAM health</span >
1279
+ - <span text-blue-300 ><div inline-block i-carbon:fusion-blender translate-y-0.8 mr-2 />PCIe status</span >
1280
+ - <span text-sky-300 ><div inline-block i-carbon:edge-node translate-y-0.8 mr-2 />Kernel modules status</span >
1281
+
1282
+ </v-clicks >
1264
1283
1265
1284
---
1266
1285
class: py-10
@@ -1312,37 +1331,6 @@ metadata:
1312
1331
class: py-10
1313
1332
---
1314
1333
1315
- # Let's build it together
1316
-
1317
- <span >Open sourced, already</span >
1318
-
1319
- <div flex >
1320
- <div
1321
- v-click="1" flex flex-col items-start transition duration-500 ease-in-out
1322
- :class="$clicks < 1 ? 'translate-x--20' : 'translate-x-0'"
1323
- >
1324
- <div mt-10 flex gap-16>
1325
- <img src="/kcover-repository-qr.png" w-60 />
1326
- <div text-2xl flex items-center gap-2 mt-4>
1327
- <div i-ri:github-fill /> <span underline decoration-dashed font-mono decoration-zinc-300 >BaizeAI/kcover</span >
1328
- </div>
1329
- </div>
1330
- </div>
1331
- </div>
1332
-
1333
- <div w-full absolute bottom-0 left-0 flex items-center transform =" translate-x--10 translate-y--10 " >
1334
- <div w-full flex items-center justify-end gap-4 >
1335
- <img src="/KubeCon.png" h-10>
1336
- <img src="/CloudNativeCon.png" h="10.1">
1337
- <img src="/OpenSourceSummit.png" h-9>
1338
- <img src="/AI_dev.png" h-4>
1339
- </div >
1340
- </div >
1341
-
1342
- ---
1343
- class: py-10
1344
- ---
1345
-
1346
1334
# Futures
1347
1335
1348
1336
<span >Foresight from our perspective</span >
@@ -1402,6 +1390,37 @@ class: py-10
1402
1390
class: py-10
1403
1391
---
1404
1392
1393
+ # Let's build it together
1394
+
1395
+ <span >Open sourced, already</span >
1396
+
1397
+ <div flex >
1398
+ <div
1399
+ v-click="1" flex flex-col items-start transition duration-500 ease-in-out
1400
+ :class="$clicks < 1 ? 'translate-x--20' : 'translate-x-0'"
1401
+ >
1402
+ <div mt-10 flex gap-16>
1403
+ <img src="/kcover-repository-qr.png" w-60 />
1404
+ <div text-2xl flex items-center gap-2 mt-4>
1405
+ <div i-ri:github-fill /> <span underline decoration-dashed font-mono decoration-zinc-300 >BaizeAI/kcover</span >
1406
+ </div>
1407
+ </div>
1408
+ </div>
1409
+ </div>
1410
+
1411
+ <div w-full absolute bottom-0 left-0 flex items-center transform =" translate-x--10 translate-y--10 " >
1412
+ <div w-full flex items-center justify-end gap-4 >
1413
+ <img src="/KubeCon.png" h-10>
1414
+ <img src="/CloudNativeCon.png" h="10.1">
1415
+ <img src="/OpenSourceSummit.png" h-9>
1416
+ <img src="/AI_dev.png" h-4>
1417
+ </div >
1418
+ </div >
1419
+
1420
+ ---
1421
+ class: py-10
1422
+ ---
1423
+
1405
1424
# To community
1406
1425
1407
1426
<span >Let's improve it together</span >
0 commit comments