Releases: Project-HAMi/HAMi
Releases · Project-HAMi/HAMi
v2.7.1
What's Changed
🐛 Bug Fixes
Major:
Update HAMi-core to fix vllm-related issues: #1381 # 1461 by @archlitchi in #1478
Fix: Calculation error for quotas by @luohua13 in #1400
Others
- Fix release CI by @archlitchi in #1373
- Fix: failed clusterrolebinding when change release name or chart name by @FouoF in #1380
- fix: e2e ginkgo version mismatch by @FouoF in #1391
- fix: check pod nil in
ReleaseNodeLockby @DSFans2014 in #1372 - fix: upgrade nvidia-mig-parted to v0.12.2 to solve security issues by @Shouren in #1388
- fix: scheduler flaky test by @FouoF in #1402
- Fix: After removing the device plugin from the gpu node, it can still… by @luohua13 in #1456
- Fix concurrent map iteration and map write fatal error. by @litaixun in #1452
- fix: fix typos by @DSFans2014 in #1434
- Fix CI error of the PR #1470, #1326, #1033 by @archlitchi in #1473
- Fix concurrent map read write fatal error. by @litaixun in #1476
- add podInfos in DeviceUsage to enhance scheduling decision by @Kyrie336 in #1362
- Update device-numa acquisition logic by @archlitchi in #1403
- Improved support for iluvatar GPUs by @qiangwei1983 in #1399
- Improve: Replace
StrategicMergePatchTypebyMergePatchTypeby @luohua13 in #1431 - optimize schedule failure event by @Kyrie336 in #1444
- Release v2.7.1 by @archlitchi in #1480
New Contributors
- @luohua13 made their first contribution in #1400
- @qiangwei1983 made their first contribution in #1399
- @eltociear made their first contribution in #1412
- @daixiang0 made their first contribution in #1465
- @zhegemingzimeibanquan made their first contribution in #1419
Full Changelog: v2.7.0...v2.7.1
v2.7.0
What's Changed
✨ Key Features
- Metax sGPU topology aware by @Kyrie336 in #1193
- NVIDIA Resourcequota by @FouoF in #1359
- Kunlunxin topology-aware scheduling by @FouoF in #1141
- Kunlunxin vxpu sopport #1016 by @ouyangluwei163 @archlitchi in #1337
- Enflame GCU topology-awareness (#1040) by @zhaikangqi331 in #1334
- AWS-neuron device and device-core allocation by @archlitchi in #1238
- Aggregated Scheduling Failure Events by @Wangmin362 in #1333
✨ Other Features
- Optimize Fit-in-device logic to make it device-specific by @archlitchi in #1097
- feat(scheduler): make node lock timeout configurable by @Kevinz857 in #1117
- featue: mig mode-change #1116 by @ouyangluwei163 in #1124
- feat: Add new labels in .github/release.yml by @Shouren in #1066
- feat(scheduler-role): use a scoped-down role for scheduler by @Antvirf in #1152
- feat(helm): optionally disable admission webhook by @Antvirf in #1145
- remove redundant metrics for vgpu allocation by @FouoF in #1169
- refactor: clean up code and improve maintainability by @Wangmin362 in #1195
- refactor: Ranging over SplitSeq is more efficient by @Shouren in #1239
- feat:NodeLockTimeout set from env by @miaobyte in #1244
- refactor: move watchAndFeedback function to feedback.go by @miaobyte in #1248
- feat: add informer-based pod cache to reduce API server load by @miaobyte in #1250
- feat: Add option to disable device plugin at values.yaml. by @FouoF in #1274
- refactor(util/nodelock): replace manual polling with k8s.io/client-go/util/retry by @mayooot in #1252
- refactor: Remove annotation in Devices interfaces by @Shouren in #1343
- feat: update the
Ascend910scheduling policy by @DSFans2014 in #1344 - feat(nvidia): default gpucores=100 when memory is exclusive and cores… by @xrwang8 in #1354
🐛 Bug Fixes
- fix: Before executing MIG partitioning, suppress NVML usage in o… by @Goend in #1095
- Fix golint-CI by @archlitchi in #1127
- fix: override node socre failure for kunlun #1137 by @ouyangluwei163 in #1138
- fix: Multi-node scoring nodes are inaccurate by @ouyangluwei163 in #1147
- fix: An error occurred while create Iluvatar pod by @ouyangluwei163 in #1149
- Fix e2e CI by @archlitchi in #1165
- fix: Add option for overwrite schedulerName by @Shouren in #1163
- fix: using go-safecast to fix incorrect conversion of numbers by @Shouren in #1183
- fix: deal with security issues reported by Trivy in image by @Shouren in #1189
- fix: wrong Pod's UID and emtpy Pod's name in log of webhook.go by @Shouren in #1092
- fix: concurrent map writes error in scheduler.calcScore #1269 by @Shouren in #1270
- fix: release dangling node lock by @peachest in #1271
- fix: fix err which retrieved incorrect NUMA node information issue #1275 by @abstractmj in #1276
- fix(security): resolve issues reported by Code scanning in Security by @Shouren in #1280
- fix: fix golangci-lint error by @DSFans2014 in #1319
- Fix: device allocation missing containers with no device request by @FouoF in #1299
- fix: update int8Slice to uint8Slice for better type clarity and consistency by @yxxhero in #1357
📚 Documentation
- documentation: add Known Issues for dynamic mig support by @Goend in #1122
- docs: fix broken link by @lixd in #1125
- clearly list supported devices doc references at README by @FouoF in #1155
- docs: update ascend910b-support docs by @DSFans2014 in #1321
🔨 Other Changes
- Prerelease-v2.6 by @archlitchi in #1108
- add new reviewers Shouren and ouyangluwei163 by @wawa0210 in #1131
- Support topology-awareness for Kunlunxin device by @archlitchi in #1121
- Support Metax sGPU Qos Policy by @Kyrie336 in #1123
- add global image for chart by @calvin0327 in #1133
- fix: Skip admission webhook when Pod's scheduler is already assigned. by @ghostloda in #1041
- Add node configs to docs by @wylswz in #1159
- build(deps): upgrade golang to 1.24.4 by @Shouren in #1172
- build(deps): Upgrade golang image in ci to 1.24.4 by @Shouren in #1176
- build(deps): Upgrade controller-runtime to 0.21.0 by @Shouren in #1171
- build(deps): Dump github.com/NVIDIA/nvidia-container-toolkit by @Shouren in #1170
- Add unit tests for Fit Function for enflame,hygon, metax, mthreads, nvidia by @Wangmin362 in #1199
- [Misc] update hami-core version by @chaunceyjiang in #1201
- Improve the impl of DevicePluginConfigs.Nodeconfig overwriting NvidiaConfig by @FouoF in #1158
- Add unit tests for cambricon's Fit Function by @Wangmin362 in #1198
- Add unit tests for Ascend's Fit Function by @Wangmin362 in #1197
- 修复生成 pod 请求资源时不必要的重复计算 by @litaixun in #1215
- 修复更新节点注解时的日志提示词 by @litaixun in #1214
- If the mem applied for the Mig device is the same as the template value,>will result in CardNotFoundCustom Filter Rule. by @zgqqiang in #1179
- updated dri section to combine text for better readability by @mpetason in #1216
- feat: Add nvidia gpu topoloy scheduler by @fyp711 in #1028
- add issue translate robot by @wawa0210 in #1232
- add issue translate robot by @wawa0210 in #1234
- perf(util/nodelock): Use clientset Patch instead of Update. by @mayooot in #1192
- Update hami-core and fix readme documents by @archlitchi in #1240
- Update hami-core version to fix by @archlitchi in #1256
- [Snyk] Security upgrade tensorflow/tensorflow from latest-gpu to 2.20.0rc0-gpu by @wawa0210 in #1243
- feat: Add an action of 'Close stale issue and PRs' in github worklfow by @Shouren in #1083
- Welcome fyp711 to become a HAMi member by @wawa0210 in #1288
- Add values readme by @clcc2019 in #1267
- Support Metax sGPU device health check by @Kyrie336 in #1295
- Optimize pkg/util.go and distribute logics to corresponding logics by @archlitchi in #1296
- cleanup: Clear and correct ascend device name by @FouoF in #1315
- bugfix: Nvidia card abnormal pod will still continue to schedule by @zgqqiang in #1336
- FIx CI, add 910B4-1 template and fix vGPUmonitor metrics error by @archlitchi in #1345
- add httpTargetPort to values.yaml by @flpanbin in #1356
- Update kunlunxin documents by @archlitchi in #1366
- update chart version and hami-core by @archlitchi in #1369
New Contributors
v2.5.3
What's Changed
🔨 Other Changes
- Release v2.5.1 - fix e2e workflow by @archlitchi in #1037
- Release v2.5.2 by @archlitchi in #1080
Bug Fixes:
Full Changelog: v2.5.2...v2.5.3
v2.6.1
BUG Fix:
Full Changelog: v2.6.0...v2.6.1
v2.6.0
Key feature:
- Optimize scheduler log
- Support enflame gcu-share
- Support metax GPU and metax sGPU
- Helm chart add checksum annotation for restarting hami component after ConfigMap modification
- Support for using RuntimeClass with nvidia devices
- Add support for profiling via net/http/pprof package
- Add nvidia gpu topoloy score registry to node
- Feat: vGPUmonitor support MigInfo metrics
Bug fix:
- Fix stuck in driver 570+
- Fix device memory not counted properly in comfyUI task
- Fix cambricon devices not allocated properly
- Fix wrong log and container request device count error
- Fix vgpu-devices-allocated annotations are inconsistent
- Fix removing node devices from node manager
- Fix: Dynamic GPU partitioning lacks single-GPU-level granularity
- Fix device memory count error on cuMallocAsync
- Fix scheduler crash if a 'mig' task running accidentally on a 'hami-core' GPU
- Fix multi-process device memory count
What's Changed
⬆️ Dependencies
- Bump aquasecurity/trivy-action from 0.28.0 to 0.29.0 by @dependabot in #631
- Bump nvidia/cuda from 12.4.1-base-ubuntu22.04 to 12.6.3-base-ubuntu22.04 in /docker by @dependabot in #676
- Bump actions/upload-artifact from 4.4.3 to 4.5.0 by @dependabot in #717
- Bump docker/build-push-action from 6.9.0 to 6.10.0 by @dependabot in #644
- Bump docker/build-push-action from 6.10.0 to 6.11.0 by @dependabot in #792
- Bump golang.org/x/net from 0.26.0 to 0.33.0 by @dependabot in #839
- Bump docker/build-push-action from 6.11.0 to 6.13.0 by @dependabot in #837
- Bump golang.org/x/net from 0.26.0 to 0.35.0 by @dependabot in #859
- Bump aquasecurity/trivy-action from 0.29.0 to 0.30.0 by @dependabot in #941
- Bump docker/login-action from 3.3.0 to 3.4.0 by @dependabot in #942
- Bump docker/build-push-action from 6.13.0 to 6.15.0 by @dependabot in #899
- build(deps): bump docker/build-push-action from 6.15.0 to 6.16.0 by @dependabot in #1024
- build(deps): bump docker/build-push-action from 6.16.0 to 6.17.0 by @dependabot in #1052
- build(deps): bump docker/build-push-action from 6.17.0 to 6.18.0 by @dependabot in #1091
🔨 Other Changes
- Fix Kubernetes version string handling by stripping metadata by @Nimbus318 in #623
- Update vGPUmonitor to add dynamic adjustment on core and memory limit by @archlitchi in #624
- feat: support device plugin daemonset update strategy by @devenami in #628
- add ut about schedule policy by @yt-huang in #638
- Fix: Refactor the license based on the approaches used in OpenSearch and ElasticSearch. by @haitwang-cloud in #626
- add ut for the scheduler by @shijinye in #645
- docs(issue-tmpl): add FAQ link to issue templates by @Nimbus318 in #647
- fix: filter device registry to node by @lengrongfu in #639
- Add self-hosted runner by @archlitchi in #659
- fix-example-yaml by @WQL782795 in #667
- update docs by @yangshiqi in #668
- add ut for ascend by @shijinye in #664
- optimization map init in test by @lengrongfu in #678
- Optimize monitor by @for800000 in #683
- fix code lint faild by @lengrongfu in #685
- fix(helm): Add NODE_NAME env var to the vgpu-monitor container from spec.nodeName by @Nimbus318 in #687
- fix vGPUmonitor deviceidx is always 0 by @lengrongfu in #684
- add ut for pkg/scheduler/event.go by @Penguin-zlh in #688
- add ut for nodes by @shijinye in #695
- add license for pkg/scheduler/event_test.go by @Penguin-zlh in #706
- fix: exception happen when creating multiple ascend-gpu pods concurrently by @lijm87 in #575
- add ut for device/nvidia by @shijinye in #657
- add ut for pkg/monitor/nvidia/v0/spec.go by @yt-huang in #670
- Enable Dynamic-mig feature for HAMi by @archlitchi in #708
- Fix chart can not be deployed properly by @archlitchi in #711
- Fix NodeLock issue by @archlitchi in #714
- fix example yaml by @lixd in #709
- add ut for device/cambricon by @shijinye in #712
- Update dynamic mig documents and examples by @archlitchi in #718
- random time may be zero by @shijinye in #697
- fix grafana dashboard and clarify dashboard usage more clearly. by @jiangsanyin in #543
- doc(README): add examples for GPU sharing and update-examples by @xiaoyao in #665
- add ut for github.com/Project-HAMi/HAMi/pkg/scheduler/pod.go by @yt-huang in #673
- Add design document to 'dynamic-mig' feature by @archlitchi in #725
- fix(doc): fix a typo and resolve markdown warnings in the tasklist by @elrondwong in #724
- add ut for pkg/util/nodelock/nodelock.go by @learner0810 in #719
- test: add ut for pkg/version/version.go by @Penguin-zlh in #677
- Update on mig mode by @archlitchi in #726
- Update documents for config & config_cn by @archlitchi in #729
- set PASS_DEVICE_SPECS ENV to device-plugin by @jingzhe6414 in #690
- fix device-plugin-version by @learner0810 in #743
- feat: Return the nodes that failed to be scheduled back to the scheduler by @chaunceyjiang in #746
- fix(log): fix missing log output in nvidiadeviceplugin server by @elrondwong in #735
- support configuration resources limits and requests by @flpanbin in #739
- feat(test): add TestMarshalNodeDevices scenarios by @elrondwong in #747
- print flags for device-plugin and scheduler by @flpanbin in #756
- Fix typos, add more contributors and maintainers. by @yangshiqi in #765
- Add a mind map(Chinese and English) to help understand this project by @oceanweave in #764
- [Docs] update config pages by @windsonsea in #760
- add ut for device-map by @KubeKyrie in #762
- refactor(ci): use go.mod file for Go version in workflows by @yxxhero in #766
- support set log level for device plugin by @flpanbin in #771
- feat: Restart/Upgrade device-plugin will not affect services. by @chaunceyjiang in #767
- add ut nvml devices by @KubeKyrie in #773
- add ut for device-map by @KubeKyrie in #772
- Optimize the time format layout by @learner0810 in #741
- fix: nvidia-device-plugin no version info by @chaunceyjiang in #779
- HAMi supports e2e by @Rei1010 in #775
- Proposal: enable E2E test by @Rei1010 in #633
- add ut for device/iluvatar by @shijinye in #795
- add ut for device/hygon by @shijinye in #787
- add ut for pkg/monitor/nvidia/v1 by @shijinye in #780
- refactor(logging): enhance log messages for device resource counting by @haitwang-cloud in #778
- Enrich pod health check by @Rei1010 in #801
- docs: fix broken link by @lixd in #802
- Optimize the E2E execution logic by @Rei1010 in #803
- optimize MetricsBindAddress to MetricsBindPort by @phoenixwu0229 in #796
- fix: handle the node nil issue & E2E test failure ...
v2.5.2
Full Changelog: v2.5.1...v2.5.2
Fix device usage metrics(31992) can't be accessed
v2.5.1
What's Changed
🔨 Other Changes
- Release v2.5 by @archlitchi in #1034
- Update tag to v2.5.1 by @archlitchi in #1035
- Fix: Update handling of version strings in Helm template and helpers.tpl by @HJJ256 in #845
- Update libvgpu.so by @archlitchi in #876
- fix: Set passDeviceSpecsEnabled to false by default in device plugin by @Nimbus318 in #872
- fix: scheduler ignore KUBECONFIG env even if this environment variable is set @Shouren in #681
- fix: correct device filter initialization order by @Nimbus318 in #857
- fix parseNvidiaNumaInfo index out of range by @flpanbin in #889
- Fix cambricon pods not been recognized by HAMi scheduler by @archlitchi in #947
- fix ubuntu base image in Dockerfile.withlib by @flpanbin in #944
- fix: Add error handling for nvml.Init in NvidiaDevicePlugin by @yxxhero in #982
- Fix device memory count error on cuMallocAsync by @archlitchi in #1029
- Bump golang.org/x/net from 0.26.0 to 0.33.0 by @dependabot in #839
Full Changelog: v2.5.0...v2.5.1
v2.5.0
Major features:
- Support dynamic mig feature, please refer to this document
- Reinstall Hami will NOT crash GPU tasks
- Put all configurations into a configMap, you can customize hami installation by modify its content: see details
Major bug fixes:
- Fix an issue where hami-core will stuck on tasks using 'cuMallocAsync'
- Fix hami-core stuck on high glib images, like 'tf-serving:latest'
What's Changed
⬆️ Dependencies
- Bump aquasecurity/trivy-action from 0.28.0 to 0.29.0 by @dependabot in #631
- Bump nvidia/cuda from 12.4.1-base-ubuntu22.04 to 12.6.3-base-ubuntu22.04 in /docker by @dependabot in #676
- Bump actions/upload-artifact from 4.4.3 to 4.5.0 by @dependabot in #717
- Bump docker/build-push-action from 6.9.0 to 6.10.0 by @dependabot in #644
- Bump docker/build-push-action from 6.10.0 to 6.11.0 by @dependabot in #792
🔨 Other Changes
- Fix Kubernetes version string handling by stripping metadata by @Nimbus318 in #623
- Update vGPUmonitor to add dynamic adjustment on core and memory limit by @archlitchi in #624
- feat: support device plugin daemonset update strategy by @devenami in #628
- add ut about schedule policy by @yt-huang in #638
- Fix: Refactor the license based on the approaches used in OpenSearch and ElasticSearch. by @haitwang-cloud in #626
- add ut for the scheduler by @shijinye in #645
- docs(issue-tmpl): add FAQ link to issue templates by @Nimbus318 in #647
- fix: filter device registry to node by @lengrongfu in #639
- Add self-hosted runner by @archlitchi in #659
- fix-example-yaml by @WQL782795 in #667
- update docs by @yangshiqi in #668
- add ut for ascend by @shijinye in #664
- optimization map init in test by @lengrongfu in #678
- Optimize monitor by @for800000 in #683
- fix code lint faild by @lengrongfu in #685
- fix(helm): Add NODE_NAME env var to the vgpu-monitor container from spec.nodeName by @Nimbus318 in #687
- fix vGPUmonitor deviceidx is always 0 by @lengrongfu in #684
- add ut for pkg/scheduler/event.go by @Penguin-zlh in #688
- add ut for nodes by @shijinye in #695
- add license for pkg/scheduler/event_test.go by @Penguin-zlh in #706
- fix: exception happen when creating multiple ascend-gpu pods concurrently by @lijm87 in #575
- add ut for device/nvidia by @shijinye in #657
- add ut for pkg/monitor/nvidia/v0/spec.go by @yt-huang in #670
- Enable Dynamic-mig feature for HAMi by @archlitchi in #708
- Fix chart can not be deployed properly by @archlitchi in #711
- Fix NodeLock issue by @archlitchi in #714
- fix example yaml by @lixd in #709
- add ut for device/cambricon by @shijinye in #712
- Update dynamic mig documents and examples by @archlitchi in #718
- random time may be zero by @shijinye in #697
- fix grafana dashboard and clarify dashboard usage more clearly. by @jiangsanyin in #543
- doc(README): add examples for GPU sharing and update-examples by @xiaoyao in #665
- add ut for github.com/Project-HAMi/HAMi/pkg/scheduler/pod.go by @yt-huang in #673
- Add design document to 'dynamic-mig' feature by @archlitchi in #725
- fix(doc): fix a typo and resolve markdown warnings in the tasklist by @elrondwong in #724
- add ut for pkg/util/nodelock/nodelock.go by @learner0810 in #719
- test: add ut for pkg/version/version.go by @Penguin-zlh in #677
- Update on mig mode by @archlitchi in #726
- Update documents for config & config_cn by @archlitchi in #729
- set PASS_DEVICE_SPECS ENV to device-plugin by @jingzhe6414 in #690
- fix device-plugin-version by @learner0810 in #743
- feat: Return the nodes that failed to be scheduled back to the scheduler by @chaunceyjiang in #746
- fix(log): fix missing log output in nvidiadeviceplugin server by @elrondwong in #735
- support configuration resources limits and requests by @flpanbin in #739
- feat(test): add TestMarshalNodeDevices scenarios by @elrondwong in #747
- print flags for device-plugin and scheduler by @flpanbin in #756
- Fix typos, add more contributors and maintainers. by @yangshiqi in #765
- Add a mind map(Chinese and English) to help understand this project by @oceanweave in #764
- [Docs] update config pages by @windsonsea in #760
- add ut for device-map by @KubeKyrie in #762
- refactor(ci): use go.mod file for Go version in workflows by @yxxhero in #766
- support set log level for device plugin by @flpanbin in #771
- feat: Restart/Upgrade device-plugin will not affect services. by @chaunceyjiang in #767
- add ut nvml devices by @KubeKyrie in #773
- add ut for device-map by @KubeKyrie in #772
- Optimize the time format layout by @learner0810 in #741
- fix: nvidia-device-plugin no version info by @chaunceyjiang in #779
- HAMi supports e2e by @Rei1010 in #775
- Proposal: enable E2E test by @Rei1010 in #633
- add ut for device/iluvatar by @shijinye in #795
- add ut for device/hygon by @shijinye in #787
- add ut for pkg/monitor/nvidia/v1 by @shijinye in #780
- refactor(logging): enhance log messages for device resource counting by @haitwang-cloud in #778
- Enrich pod health check by @Rei1010 in #801
- docs: fix broken link by @lixd in #802
- Optimize the E2E execution logic by @Rei1010 in #803
- optimize MetricsBindAddress to MetricsBindPort by @phoenixwu0229 in #796
- fix: handle the node nil issue & E2E test failure by @haitwang-cloud in #804
- add ut for device/mthreads by @shijinye in #808
- fix: Resolve formatting issue in ConfigMap causing display anomalies by @lixd in #814
- [docs] Update ascend910b-support.md by @windsonsea in #816
- Refine metrics logs by @haitwang-cloud in #817
- Update mig-related logics and refine logs by @archlitchi in #833
- Add 910B4 config to device-configmap for ascend by @lijm87 in #828
- [docs] fix: glibc version requirement in README by @chinaran in #826
- Update HAMi-core for v2.5.0 by @archlitchi in #834
- FIx multi-process device memory count issue by @archlitchi in #835
- bump version to v2.5.0 by @wawa0210 in #836
- Fix CI by @archlitchi in #838
- Fix CI release by @archlitchi in #840
- Fix release ci by @archlitchi in #841
- Fix Dockerfile to make CI pass by @archlitchi in #846
- Fix E2E failure with pod status check by @Rei1010 in htt...
v2.4.1
Major Features:
- Support Metax scheduling optimazation
- Support Mthreads sGPU
- Add a configMap hami-scheduler-device for all configurations of HAMi
- Optimize installation process
Details
⬆️ Dependencies
- Bump actions/download-artifact from 3 to 4 by @dependabot in #529
- Bump docker/build-push-action from 6.8.0 to 6.9.0 by @dependabot in #528
- Bump actions/upload-artifact from 3.1.3 to 4.4.0 by @dependabot in #530
- Bump aquasecurity/trivy-action from 0.24.0 to 0.27.0 by @dependabot in #546
- Bump actions/upload-artifact from 4.4.0 to 4.4.3 by @dependabot in #541
- Bump ubuntu from 20.04 to 24.04 in /docker by @dependabot in #394
- Bump aquasecurity/trivy-action from 0.27.0 to 0.28.0 by @dependabot in #559
- Bump codecov/codecov-action from 4 to 5 by @dependabot in #613
🔨 Other Changes
- fix build badge status by @wawa0210 in #526
- update action-gh-release template file to more accurate matching by @wawa0210 in #527
- Refactor helm "Admission Webhook" config. by @4gt-104 in #532
- fix: error happen when allocate iluvatar device by @lijm87 in #522
- Fix code scanning alert-Incorrect conversion between integer types by @ghostloda in #556
- update hami-core version by @chaunceyjiang in #557
- Mthreads support by @archlitchi in #560
- Fix code scanning alert-Incorrect conversion between integer types by @ghostloda in #561
- update docs by @ghostloda in #567
- migrate hami slack to cncf hami group by @wawa0210 in #568
- Fix pod assignment issue when pod already has a node assigned by @chaunceyjiang in #564
- fix(scheduler): prevent array out-of-bounds when GPU containers are placed between non-GPU containers by @Nimbus318 in #572
- improve pkg/k8sutil/pod.go ut coverage by @wawa0210 in #570
- Metax GPU topo-awareness support by @archlitchi in #574
- Add WebUI to readme and readme_cn.md by @archlitchi in #578
- remove watermark of MetaX topo diagrams by @obnah in #581
- update HAMi Talks and References by @wawa0210 in #582
- fix: assgin to wrong devices when 1 pod has 2+ containers request GPU by @joy717 in #593
- docs: fix deployments path in README by @dublc in #608
- Add unified configMap and update charts by @archlitchi in #614
- Fix configMap device-config not properly installed by @archlitchi in #616
- fix CI: race condition error by @archlitchi in #618
- Pre release to v2.4.1 by @archlitchi in #619
New Contributors
- @4gt-104 made their first contribution in #532
- @lijm87 made their first contribution in #522
- @ghostloda made their first contribution in #556
- @Nimbus318 made their first contribution in #572
- @obnah made their first contribution in #581
- @dublc made their first contribution in #608
Full Changelog: v2.4.0...v2.4.1
v2.4.0
What's Changed
✨ New Features
- Support huawei ascend 910p for GA by @peizhaoyou in #389 ,https://github.com/Project-HAMi/ascend-device-plugin
- Support for multiple versions of cudevshr for vGPUmonitor by @zoyopei in #458
- Add filter device when register node by uuid or index by @lengrongfu in #495
- Support Ascend custom configuration file settings for NPU virtualization by @wawa0210 in #510
- Add event handlers registration by @wawa0210 in #417
- Officially supports arm architecture
🐛 Bug Fixes
- fixed go build image version by @chaunceyjiang in #405
- fix: fix duplicate resource keys in configmap by @devenami in #422
- fix data race when read pods info by @lengrongfu in #419
- fix OpenSSF Best Practices by @wawa0210 in #478
- fix CI and go-lint check error by @archlitchi in #486
- fix trivy scan failed by @wawa0210 in #504 and #507
- Fix HAMi image is too large and uses inappropriate base image by @wawa0210 in #508
- fix device configmap by @zoyopei in #494
- fix chart lint always running when charts has no change by @wawa0210 in #501
🔨 Other Changes
- Proposal: support GPU Utilization Metrics by @chaunceyjiang in #258
- disable PreferredAllocation by @lengrongfu in #415
- optimization code by @lengrongfu in #401
- add vgpu doc and update the readme. by @william-wang in #430
- add hami vulnerability scan and report by @wawa0210 in #433
- add CodeQL analysis by @wawa0210 in #432
- update hami logo by @lengrongfu in #456
- add node record pod info by @lengrongfu in #451
- support code Coverage Analytics by @wawa0210 in #473
- add dev branch ci & remove unused binary by @archlitchi in #487
- refactoring ascend device code by @zoyopei in #492
- Remake README.md by @archlitchi in #489
- support pr commit can also build images by @wawa0210 in #499
- Add HAMi release process by @wawa0210 in #520
New Contributors
- @william-wang made their first contribution in #430
- @devenami made their first contribution in #422
Full Changelog: v2.3.13...v2.4.0