Skip to content

Fetch from nvidia Megatron-LM #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4,927 commits into
base: load-iter
Choose a base branch
from
Open
Changes from all commits
Commits
Show all changes
4927 commits
Select commit Hold shift + click to select a range
261b3ce
Merge branch 'weimingc/fix_real_quant' into 'main'
shanmugamr1992 Apr 15, 2025
f453b4d
ADLR/megatron-lm!3117 - ci: Fix publish notify job
ko3n1g Apr 15, 2025
65cf8d5
Merge branch 'ko3n1g/ci/fix-publish-notify-job' into 'main'
ko3n1g Apr 15, 2025
34b3723
ADLR/megatron-lm!3106 - ci: Upload pipeline telemetrics
ko3n1g Apr 16, 2025
9b9e374
Merge branch 'ko3n1g/ci/dashboard-functional-runs' into 'main'
ko3n1g Apr 16, 2025
69e284d
ADLR/megatron-lm!3118 - Fix `post_training/test_get_gpt_modelopt_spec…
jenchen13 Apr 16, 2025
d46f999
Merge branch 'fix_mo_spec_test' into 'main'
ko3n1g Apr 16, 2025
671f254
ADLR/megatron-lm!3023 - Remove legacy bert tests
shanmugamr1992 Apr 16, 2025
8579a5d
Merge branch 'remove-legacy-bert-test' into 'main'
shanmugamr1992 Apr 16, 2025
f5a57fe
ADLR/megatron-lm!2601 - Alit/config mamba head
JRD971000 Apr 16, 2025
ecf8a10
Merge branch 'alit/config_mamba_head' into 'main'
shanmugamr1992 Apr 16, 2025
cbbbacb
ADLR/megatron-lm!3125 - Update CODEOWNERS to make modelopt review on…
shanmugamr1992 Apr 16, 2025
4597aaa
Merge branch 'shanmugamr-main-patch-70610' into 'main'
shanmugamr1992 Apr 16, 2025
f26cc41
ADLR/megatron-lm!3119 - Run nemo2 tests instead of nemo1
chtruong814 Apr 16, 2025
0f52851
Merge branch 'chtruong/update-functional-for-nemo2' into 'main'
ko3n1g Apr 16, 2025
e1d58bc
ADLR/megatron-lm!2955 - Integrating paged attention feature of flash_…
kvareddy Apr 16, 2025
d2e3ffc
Merge branch 'vijay/unify_static_dynamic' into 'main'
ko3n1g Apr 16, 2025
d0534e9
ADLR/megatron-lm!2960 - add l2 norm in torch_norm.py for LLAMA-4 support
yaoyu-33 Apr 16, 2025
e6bd64c
Merge branch 'yuya/add_l2_norm' into 'main'
ko3n1g Apr 16, 2025
202ad22
ADLR/megatron-lm!3126 - fix: Improvements to the auto-reminder bot
ko3n1g Apr 16, 2025
8e0215c
Merge branch 'ko3n1g/fix/reminder-bot-final-review-date' into 'main'
ko3n1g Apr 16, 2025
966bb9a
ADLR/megatron-lm!2475 - Fix Gemma TRTLLM export
meatybobby Apr 16, 2025
9db6e55
Merge branch 'bobchen/fix_nemo2' into 'main'
jaredcasper Apr 16, 2025
c0b5c91
ADLR/megatron-lm!2691 - Fix MLA THD format support
Shunkangz Apr 17, 2025
0bbb642
Merge branch 'mla_PackedSeqParams' into 'main'
ko3n1g Apr 17, 2025
02c6d64
ADLR/megatron-lm!2914 - Dynamic inference example | Control checkpoin…
lmcafee-nvidia Apr 17, 2025
370508c
Merge branch 'lmcafee/ifb-broken-example-25.02' into 'main'
ko3n1g Apr 17, 2025
b799e3f
ADLR/megatron-lm!3057 - patch for fp8 primary weight custom fsdp support
shjwudp Apr 17, 2025
76d6bcf
Merge branch 'fp8_patch_for_cfsdp' into 'main'
ko3n1g Apr 17, 2025
35fd148
ADLR/megatron-lm!3129 - ci: Track info about MR
ko3n1g Apr 17, 2025
7935dcf
Merge branch 'ko3n1g/feat/track-info-about-merge-request' into 'main'
ko3n1g Apr 17, 2025
04a5957
ADLR/megatron-lm!3105 - ci: Handle nargs
ko3n1g Apr 17, 2025
55c968d
Merge branch 'ko3n1g/ci/handle-nargs' into 'main'
ko3n1g Apr 17, 2025
2922bb6
ADLR/megatron-lm!2871 - Fix optimizer cpu offload load checkpoint wit…
shjwudp Apr 18, 2025
e6d1ef9
Merge branch 'fix_optimizer_cpu_offload_load_hf_model' into 'main'
ko3n1g Apr 18, 2025
7b980e0
ADLR/megatron-lm!3005 - Allow log_config_to_disk to accept an optiona…
ZhiyuLi-Nvidia Apr 18, 2025
1f92849
Merge branch 'zhiyul/orthotope/config_logger' into 'main'
ko3n1g Apr 18, 2025
e362ed8
ADLR/megatron-lm!3128 - build: Add TE patches
ko3n1g Apr 18, 2025
0f3256b
Merge branch 'ko3n1g/build/te-patches' into 'main'
ko3n1g Apr 18, 2025
e7a324e
ADLR/megatron-lm!3140 - ci: Add Final Reviewer
ko3n1g Apr 18, 2025
a37c48b
Merge branch 'ko3n1g/ci/add-final-reviewer' into 'main'
ko3n1g Apr 18, 2025
3cc2e29
ADLR/megatron-lm!3141 - ci: Only allow running if target is ADLR/mega…
ko3n1g Apr 18, 2025
47b99b6
Merge branch 'ko3n1g/ci/not-on-forks' into 'main'
ko3n1g Apr 18, 2025
f37edff
ADLR/megatron-lm!3102 - Enable fusion for interleaved RoPE
tomlifu Apr 18, 2025
2f78e3b
Merge branch 'interleaved_fused_rope_lifuz' into 'main'
ko3n1g Apr 18, 2025
8b95cb3
ADLR/megatron-lm!2958 - Fix circular import introduced in !2518
lmcafee-nvidia Apr 19, 2025
1d840fa
Merge branch 'lmcafee/cyclic-import-2518' into 'main'
ko3n1g Apr 19, 2025
3461557
ADLR/megatron-lm!3055 - Simplify and improve perf dynamic
shanmugamr1992 Apr 19, 2025
a41d87a
Merge branch 'simplify_and_improve_perf_dynamic' into 'main'
shanmugamr1992 Apr 19, 2025
4393020
ADLR/megatron-lm!3090 - Allow T5 model to take in optional customized…
ZhiyuLi-Nvidia Apr 19, 2025
20dca24
Merge branch 'zhiyul/orthotope/t5' into 'main'
ko3n1g Apr 19, 2025
d5529e2
ADLR/megatron-lm!3144 - ci: Fix telemetrics for failing pipelines
ko3n1g Apr 19, 2025
df0430d
Merge branch 'ko3n1g/fix/telemetrics' into 'main'
ko3n1g Apr 19, 2025
c844db7
ADLR/megatron-lm!2843 - Mimo model structure and vision submodules
yashaswikarnati Apr 20, 2025
0a82295
Merge branch 'yash/mimo_initial_modules' into 'main'
jaredcasper Apr 20, 2025
d7af02d
ADLR/megatron-lm!3113 - Unit tests for Transformer block with user de…
yashaswikarnati Apr 21, 2025
2ccaf22
Merge branch 'yash/tests_and_hot_fixes' into 'main'
ko3n1g Apr 21, 2025
575e7a5
ADLR/megatron-lm!2658 - Fix how Cuda Graph decides if a module is tra…
guyueh1 Apr 21, 2025
4a9702c
Merge branch 'fix_cuda_graph_for_te_full_layer' into 'main'
ko3n1g Apr 21, 2025
49a870e
ADLR/megatron-lm!3148 - Re-enable beam search in textgen server
mathemakitten Apr 21, 2025
d4f7a56
Merge branch 'helenn-beam-search-fix' into 'main'
jaredcasper Apr 21, 2025
0ad06b5
ADLR/megatron-lm!2581 - Add support for ZeRO-2 with PyTorch FSDP2
santhnm2 Apr 22, 2025
1cb506c
Merge branch 'torch_fsdp_zero2' into 'main'
deepakn94 Apr 22, 2025
1f53a0e
ADLR/megatron-lm!3147 - fix(moe): Fix the assertion of moe_ffn_hidden…
yanring Apr 22, 2025
80b8b7a
Merge branch 'zijie/moe_ffn_size_fix' into 'main'
jaredcasper Apr 22, 2025
8c1e094
ADLR/megatron-lm!3069 - enable num_distributed_optimizer_instances fo…
xrennvidia Apr 22, 2025
b5a97f3
Merge branch 'xren/multi_dc_moe' into 'main'
deepakn94 Apr 22, 2025
825f76e
ADLR/megatron-lm!3108 - Swap ignore virtual defaults
skyw Apr 22, 2025
a045193
Merge branch 'skyw/swap_ignore_virtual_defaults' into 'main'
ko3n1g Apr 22, 2025
e5a34e3
ADLR/megatron-lm!3112 - build: Bump mamba
ko3n1g Apr 22, 2025
59503ed
Merge branch 'ko3n1g/build/bump-mamba' into 'main'
ko3n1g Apr 22, 2025
adca14d
ADLR/megatron-lm!3157 - ci: Fix workflow-constraint for not running o…
ko3n1g Apr 23, 2025
00efe37
Merge branch 'ko3n1g/ci/auto-reminder' into 'main'
ko3n1g Apr 23, 2025
1eaed21
Revert "ADLR/megatron-lm!2581 - Add support for ZeRO-2 with PyTorch F…
ko3n1g Apr 23, 2025
d905ac7
ci: Update golden values
ko3n1g Apr 23, 2025
4e6ae3a
ADLR/megatron-lm!2758 - Fix a bug about tp_comm_buffer_name in MultiL…
BestJuly Apr 25, 2025
daa824d
Merge branch 'lit/fix_te_overlap_assertion' into 'main'
ko3n1g Apr 25, 2025
ed46806
ADLR/megatron-lm!3014 - Fix issue loading model checkpoint saved with…
jstjohn Apr 25, 2025
0c7c163
Merge branch 'jstjohn/fix_param_aware_dtype' into 'main'
ko3n1g Apr 25, 2025
20c635b
ADLR/megatron-lm!2950 - feat(MoE): FP8 Support for Multi-Token-Predic…
BestJuly Apr 25, 2025
51903b2
Merge branch 'lit/deepseekv3_fp8' into 'main'
ko3n1g Apr 25, 2025
3d1ecd7
ADLR/megatron-lm!3096 - Fix checkpoint directory bug in distill night…
AAnoosheh Apr 25, 2025
65aa136
Merge branch 'aanoosheh/fix-ckpt-dir-bug' into 'main'
ko3n1g Apr 25, 2025
f7fdafd
ADLR/megatron-lm!2637 - [dist ckpt] Re-attempt !2493 + fixing merge c…
ananthsub Apr 25, 2025
8f6e830
Merge branch 'intra-parallel-2493' into 'main'
ananthsub Apr 25, 2025
99f43ae
ADLR/megatron-lm!3175 - ci: Control which checks per test to run
ko3n1g Apr 25, 2025
2bb62e7
Merge branch 'ko3n1g/ci/metrics-per-test' into 'main'
ko3n1g Apr 25, 2025
48cc46f
ADLR/megatron-lm!3155 - Fix the sync issue in `TemporalAsyncWorker`
sbak5 Apr 26, 2025
222adb8
Merge branch 'sbak/ckpt_sync_issue' into 'main'
ko3n1g Apr 26, 2025
c8f6279
ADLR/megatron-lm!2971 - Add ModelOpt speculative decoding finetune
yeyu-nvidia Apr 26, 2025
154a7a8
Merge branch 'yeyu/finetune' into 'main'
ko3n1g Apr 26, 2025
4ca4309
ADLR/megatron-lm!3083 - Moe fix for Llama4
yaoyu-33 Apr 26, 2025
aab56ce
Merge branch 'yuya/moe_sigmoid_fix' into 'main'
ko3n1g Apr 26, 2025
5fe1eeb
ADLR/megatron-lm!2910 - [custom FSDP] Support EP + FSDP training for …
shjwudp Apr 26, 2025
f7a25e5
Merge branch 'custom_fsdp_dsv3' into 'main'
ko3n1g Apr 26, 2025
a1843ac
ADLR/megatron-lm!3178 - Fix extra tokens in returned generation
mathemakitten Apr 26, 2025
ceed1b7
Merge branch 'helenn-fix-seqlen-chopping' into 'main'
jaredcasper Apr 26, 2025
b764f2d
ADLR/megatron-lm!3160 - Update current scaling supported TE version t…
thomasdhc Apr 26, 2025
57d21c3
Merge branch 'donghyukc/te_min_version' into 'main'
ko3n1g Apr 26, 2025
2bc6257
ADLR/megatron-lm!3121 - Seperate chunk allocator
shanmugamr1992 Apr 26, 2025
e733d7d
Merge branch 'seperate_chunk_allocator' into 'main'
ko3n1g Apr 26, 2025
4f16de3
ADLR/megatron-lm!3180 - Revert inference_context.is_decode_only() to …
mathemakitten Apr 26, 2025
33a193d
Merge branch 'helenn-fix-seqlenoffset' into 'main'
jaredcasper Apr 26, 2025
bc70535
ADLR/megatron-lm!3058 - [BUG FIX]: fix the bug of indices-to-multihot…
Apr 26, 2025
885a245
Merge branch 'incidices_to_multihot' into 'main'
ko3n1g Apr 26, 2025
8208937
ADLR/megatron-lm!3015 - Refactor Inference Process Groups by replacin…
ZhiyuLi-Nvidia Apr 26, 2025
7118d88
Merge branch 'zhiyul/orthotope/inference' into 'main'
ko3n1g Apr 26, 2025
9bb34bf
ADLR/megatron-lm!3179 - Update te patch to include 1626
thomasdhc Apr 26, 2025
2f4463e
Merge branch 'donghyukc/te_patch_update' into 'main'
ko3n1g Apr 26, 2025
4429e8e
Revert "Revert "ADLR/megatron-lm!2581 - Add support for ZeRO-2 with P…
ko3n1g Apr 26, 2025
3053031
ADLR/megatron-lm!3120 - Use FlashAttention 3 for inference
santhnm2 Apr 28, 2025
47e3bd3
Merge branch 'fa3_inference' into 'main'
ko3n1g Apr 28, 2025
8d1367f
ADLR/megatron-lm!3167 - No RoPE for Llama4
suiyoubi Apr 28, 2025
5807d1c
Merge branch 'aot/no_rope_llama4' into 'main'
ko3n1g Apr 28, 2025
72afd63
ADLR/megatron-lm!3010 - Enable --fp8-param-gather for NV sub-channel …
kunlunl Apr 28, 2025
1eb5fe5
Merge branch 'nv_subchannel_native_fp8' into 'main'
shanmugamr1992 Apr 28, 2025
cf6d208
ADLR/megatron-lm!3133 - fix: fix FP8 support in recompute; fix fused …
hxbai Apr 28, 2025
f6b042b
Merge branch 'hongxiaob/recompute_fp8_fix' into 'main'
ko3n1g Apr 28, 2025
f2c0f12
ADLR/megatron-lm!3189 - ci: Auto-apply most recent milestone
ko3n1g Apr 28, 2025
97d27d1
Merge branch 'ko3n1g/ci/auto-milestone' into 'main'
ko3n1g Apr 28, 2025
06a2dd5
ADLR/megatron-lm!3191 - fix: Correct date of review stage
ko3n1g Apr 28, 2025
9c5c870
Merge branch 'ko3n1g/fix/auto-reminder' into 'main'
ko3n1g Apr 28, 2025
97efad4
ADLR/megatron-lm!3130 - tests: Fix model-config test for nemo2
ko3n1g Apr 28, 2025
0a524aa
Merge branch 'ko3n1g/tests/fix-model-config-test' into 'main'
ko3n1g Apr 28, 2025
4b63750
ADLR/megatron-lm!3038 - chore: QA on 0.12 release
ko3n1g Apr 28, 2025
14c4946
Merge branch 'ko3n1g/chore/release-tests-0.12' into 'main'
ko3n1g Apr 28, 2025
06f23e3
ADLR/megatron-lm!3194 - ci: Update golden values
ko3n1g Apr 28, 2025
fb7c3f8
Merge branch 'ko3n1g/ci/fix-golden-values' into 'main'
ko3n1g Apr 28, 2025
a1a0df9
ADLR/megatron-lm!2863 - Tpoon/hf llava saver radio
Apr 29, 2025
672e2b8
Merge branch 'tpoon/hf_llava_saver_radio' into 'main'
jaredcasper Apr 29, 2025
4623d68
ADLR/megatron-lm!3135 - Only pad to batch max sequence length when us…
santhnm2 Apr 29, 2025
644f5d7
Merge branch 'cuda_graph_padding' into 'main'
deepakn94 Apr 29, 2025
1827be9
ADLR/megatron-lm!3136 - Compute fused bias-dropout-add in-place for i…
santhnm2 Apr 29, 2025
5b4f8ca
Merge branch 'inplace_bda' into 'main'
deepakn94 Apr 29, 2025
8a72bde
ADLR/megatron-lm!3196 - Fix failing distill test again
AAnoosheh Apr 29, 2025
faab15b
Merge branch 'aanoosheh/fix-teacher-test-load-dir' into 'main'
ko3n1g Apr 29, 2025
3729405
ADLR/megatron-lm!3097 - ci: Add checks for inference testing
ko3n1g Apr 29, 2025
ed679f9
Merge branch 'ko3n1g/ci/onboard-inference-testrules' into 'main'
ko3n1g Apr 29, 2025
eb397f7
ADLR/megatron-lm!3202 - build: Install latest nvidia-modelopt
ko3n1g Apr 29, 2025
a284b55
Merge branch 'ko3n1g/build/bump-modelopt' into 'main'
ko3n1g Apr 29, 2025
3ed2571
ADLR/megatron-lm!3200 - ci: Update golden values
ko3n1g Apr 29, 2025
093ddb7
Merge branch 'ko3n1g/ci/fix-golden-values-2' into 'main'
ko3n1g Apr 29, 2025
39fd897
ADLR/megatron-lm!3183 - Pass vp_stage value to pipeline first/last st…
skyw Apr 30, 2025
fd59dd8
Merge branch 'skyw/vp_cleanup_3rd_try' into 'main'
ko3n1g Apr 30, 2025
f55397d
ADLR/megatron-lm!3214 - Add minus_sqrt as a WSD learning rate decay o…
deepakn94 Apr 30, 2025
d86da87
Merge branch 'dnarayanan/sqrt_decay' into 'main'
deepakn94 Apr 30, 2025
414ee9a
ADLR/megatron-lm!3203 - ci: Build container for publish step
ko3n1g Apr 30, 2025
543643b
Merge branch 'ko3n1g/ci/build-for-publish' into 'main'
ko3n1g Apr 30, 2025
c970ca9
ADLR/megatron-lm!3201 - ci: Wait for resources
ko3n1g Apr 30, 2025
48f20d3
Merge branch 'ko3n1g/ci/wait-for-resources' into 'main'
ko3n1g Apr 30, 2025
0c50f0b
ADLR/megatron-lm!3212 - fix: Fixes issue to maintain backward compati…
terrykong Apr 30, 2025
b3d384b
Merge branch 'tk/fix-te-1.13.0' into 'main'
shanmugamr1992 Apr 30, 2025
3806ec2
ADLR/megatron-lm!3139 - move reporting loss all-reduce to the end of …
xrennvidia Apr 30, 2025
cf12e78
Merge branch 'xren/multi_dc_loss' into 'main'
ko3n1g Apr 30, 2025
7828cd5
ADLR/megatron-lm!3193 - substitute nemo1 tests with nemo2 tests
ko3n1g May 1, 2025
82aeb83
Merge branch 'ko3n1g/feat/nemo2_tests' into 'main'
ko3n1g May 1, 2025
e85bb68
ADLR/megatron-lm!3223 - ci: Update sampling rate
ko3n1g May 1, 2025
0f3a56c
Merge branch 'ko3n1g/ci/update-golden-values-nightly-2' into 'main'
ko3n1g May 1, 2025
40f4cf2
ADLR/megatron-lm!3206 - chore: Update changelog 0.9
ko3n1g May 1, 2025
0d3843d
Merge branch 'ko3n1g/chore/update-changelog-0.9' into 'main'
ko3n1g May 1, 2025
624ddc2
ADLR/megatron-lm!3207 - chore: Update changelog 0.11
ko3n1g May 1, 2025
fc0346e
Merge branch 'ko3n1g/chore/update-changelog-0.11' into 'main'
ko3n1g May 1, 2025
d5f61d7
ADLR/megatron-lm!3229 - chore: Update changelog 0.12
ko3n1g May 1, 2025
909691e
Merge branch 'ko3n1g/chore/update-changelog-0.12' into 'main'
ko3n1g May 1, 2025
eec34ff
ADLR/megatron-lm!3209 - Fix moe score calculation
yaoyu-33 May 1, 2025
1e291ac
Merge branch 'yuya/sigmoid_gather_fix' into 'main'
deepakn94 May 1, 2025
a58f381
Revert "ADLR/megatron-lm!3193 - substitute nemo1 tests with nemo2 tests"
ko3n1g May 1, 2025
850fc61
ADLR/megatron-lm!2999 - adding symmetric memory all reduce for inference
wdykas May 1, 2025
7608f27
Merge branch 'symmetric-allreduce' into 'main'
ko3n1g May 1, 2025
26ad379
ADLR/megatron-lm!3238 - Remove openwebtext for having scurity vulnera…
shanmugamr1992 May 1, 2025
16fd8f7
Merge branch 'securityFix' into 'main'
jaredcasper May 1, 2025
3383a10
ADLR/megatron-lm!3138 - Hybrid functional tests
deepakn94 May 1, 2025
54afff4
Merge branch 'dnarayanan/hybrid_functional_tests' into 'main'
deepakn94 May 1, 2025
23e2e09
ADLR/megatron-lm!3236 - ci: Update golden values of nightly after 3139
ko3n1g May 2, 2025
b2e41cb
Merge branch 'ko3n1g/ci/update-golden-values-nightly-3' into 'main'
ko3n1g May 2, 2025
823d8d8
ADLR/megatron-lm!3237 - ci(fix): get_all expert reviewers
ko3n1g May 2, 2025
fdd88b2
Merge branch 'ko3n1g/fix/auto-reminder-2' into 'main'
ko3n1g May 2, 2025
dc092cb
ADLR/megatron-lm!3143 - Fix TP comm overlap with shared_expert_overlap
gdengk May 2, 2025
e56d0cf
Merge branch 'gaod/llama4/tp_overlap_fix' into 'main'
ericharper May 2, 2025
7acc83f
ADLR/megatron-lm!3244 - ci: Fixes to review-reminder
ko3n1g May 2, 2025
e087e89
Merge branch 'ko3n1g/ci/review-reminder' into 'main'
ko3n1g May 2, 2025
0504e97
ADLR/megatron-lm!2592 - Fix QK layer scaling for PP > 1
nick-knight May 3, 2025
89d758b
Merge branch 'nknight/qk-layer-scaling-fix' into 'main'
ko3n1g May 3, 2025
afb755f
ADLR/megatron-lm!3156 - Various bugfixes in megatron/core/distributed
shifangx May 5, 2025
8640c31
Merge branch 'shifang/fix_some_bug' into 'main'
deepakn94 May 5, 2025
ce70b8d
ADLR/megatron-lm!3230 - Pass vp_stage during model initialization
skyw May 5, 2025
16089c8
Merge branch 'skyw/vp_stage_in_transformer' into 'main'
ko3n1g May 5, 2025
e46c0b5
ADLR/megatron-lm!3250 - ci: Fix publish docs
ko3n1g May 5, 2025
10b5c58
Merge branch 'ko3n1g/ci/fix-publish-docs' into 'main'
ko3n1g May 5, 2025
f5c735a
ADLR/megatron-lm!3248 - ci: Remove broken MoE tests from LTS
ko3n1g May 6, 2025
6ebfb46
Merge branch 'ko3n1g/ci/remove-moe-tests-from-lts' into 'main'
ko3n1g May 6, 2025
7dcfe47
ADLR/megatron-lm!3256 - tests: Fix `gpt3_345m_nightly_dgx_a100_1N8G_m…
ko3n1g May 6, 2025
50d6262
Merge branch 'ko3n1g/ci/fix-nightly-runs-2' into 'main'
deepakn94 May 6, 2025
3ac41f7
ADLR/megatron-lm!3225 - ci: onboard T5 memory test
ko3n1g May 6, 2025
2fc0d68
Merge branch 'ko3n1g/ci/onboard-memory-tests' into 'main'
ko3n1g May 6, 2025
be878db
ADLR/megatron-lm!3257 - ci: Provide easier tooling for local runs
ko3n1g May 6, 2025
3235d43
Merge branch 'ko3n1g/ci/easier-local-runs' into 'main'
ko3n1g May 6, 2025
342cb69
ADLR/megatron-lm!3228 - Remove unintentionally leftover lines in Mode…
AAnoosheh May 7, 2025
07c682a
Merge branch 'aanoosheh/remove-modelopt-speculate-assert' into 'main'
jaredcasper May 7, 2025
f586001
ADLR/megatron-lm!2652 - feat: use multi-storage client in checkpointing
shunjiad May 8, 2025
763a4ab
Merge branch 'integrate-multi-storage-client' into 'main'
ko3n1g May 8, 2025
09c8397
ADLR/megatron-lm!3263 - ci: Fixes to the release
ko3n1g May 8, 2025
d1d9cac
Merge branch 'ko3n1g/ci/fix-release-2' into 'main'
ko3n1g May 8, 2025
c15ebb2
ADLR/megatron-lm!3235 - ADLR/megatron-lm!3193 - substitute nemo1 test…
ko3n1g May 8, 2025
eebd745
Merge branch 'ko3n1g/ci/onboard-nemo2-tests' into 'main'
ko3n1g May 8, 2025
18bac32
ADLR/megatron-lm!3270 - remove from recipe
ko3n1g May 8, 2025
378259d
Merge branch 'ko3n1g/skyw/deprecate_legacy_model_in_functional_tests'…
ko3n1g May 8, 2025
9d65d25
ADLR/megatron-lm!3261 - Fix attention_mask shapes in Attention unit test
santhnm2 May 8, 2025
3b6f038
Merge branch 'fix_attention_unit_test' into 'main'
ko3n1g May 8, 2025
53bfad2
ADLR/megatron-lm!3210 - Updated setup instructions in README.md
sbhavani May 8, 2025
bcbede5
Merge branch 'main' into 'main'
ko3n1g May 8, 2025
09add09
ADLR/megatron-lm!3151 - Disable cudagraphs when pipeline parallel mic…
mathemakitten May 9, 2025
5124103
Merge branch 'helenn-guard-cudagraphs-pp-microbatching' into 'main'
jaredcasper May 9, 2025
f8c8c9c
ADLR/megatron-lm!2812 - Inference functional test: 580M Minitron
mathemakitten May 9, 2025
f25dceb
Merge branch 'helenn-inference-functional-test' into 'main'
ko3n1g May 9, 2025
16aeade
Revert "ADLR/megatron-lm!2812 - Inference functional test: 580M Minit…
chtruong814 May 10, 2025
861f574
ADLR/megatron-lm!2812 - Inference functional test: 580M Minitron
mathemakitten May 9, 2025
b6212fd
ADLR/megatron-lm!3277 - Invalidate cached SSM tensors if batch size c…
santhnm2 May 12, 2025
d68c474
Merge branch 'mamba_variable_batch_size_fix' into 'main'
shanmugamr1992 May 12, 2025
0b084c6
ADLR/megatron-lm!3291 - ci: Move unit test logic to file
ko3n1g May 12, 2025
460e961
Merge branch 'ko3n1g/ci/unit-tests-script' into 'main'
ko3n1g May 12, 2025
f8b1172
ADLR/megatron-lm!3243 - Adapt _write_item call to new signature with …
skierat May 12, 2025
a3609ee
Merge branch 'skierat/write_item_signature' into 'main'
ko3n1g May 12, 2025
d87ba91
ADLR/megatron-lm!2711 - Add in-process restart
szmigacz May 13, 2025
0bdebc0
Merge branch 'inprocess_mr' into 'main'
deepakn94 May 13, 2025
5c7ecad
ci(hotfix): Update Dockerfile.ci.dev
ko3n1g May 13, 2025
e41dde6
Revert "ADLR/megatron-lm!2711 - Add in-process restart"
ko3n1g May 13, 2025
f61b17c
ADLR/megatron-lm!3292 - ci: Run on multiple clusters
ko3n1g May 13, 2025
c552e21
Merge branch 'ko3n1g/ci/multi-cluster' into 'main'
ko3n1g May 13, 2025
55343df
ADLR/megatron-lm!3302 - ci: Allow specific TE-ref
ko3n1g May 13, 2025
d50e830
Merge branch 'ko3n1g/ci/te-nightly' into 'main'
ko3n1g May 13, 2025
8c4875f
ADLR/megatron-lm!3299 - ci(fix): Write logs to log_dir
ko3n1g May 13, 2025
d6eb60b
Merge branch 'ko3n1g/ci/unit-tests-locally' into 'main'
ko3n1g May 13, 2025
c58e57f
ADLR/megatron-lm!3253 - Address dist checkpointing PyT 24.08 failure
ananthsub May 14, 2025
4a114e6
Merge branch 'dist-ckpt-2408' into 'main'
deepakn94 May 14, 2025
d2cbe5a
ADLR/megatron-lm!3307 - ci(hotfix): Downstream pipeline
ko3n1g May 14, 2025
53d55fb
Merge branch 'ko3n1g/ci/fix-downstream-pipeline' into 'main'
ko3n1g May 14, 2025
9c586bf
ADLR/megatron-lm!3308 - MR feedback: added units for arguments, optio…
rhewett-nv May 14, 2025
8416bff
Merge branch 'inprocess_mr' into 'main'
ko3n1g May 14, 2025
07b1992
ADLR/megatron-lm!2966 - Allow process group as optional argument for …
ZhiyuLi-Nvidia May 16, 2025
175497e
Merge branch 'zhiyul/orthotope/ssm' into 'main'
ko3n1g May 16, 2025
7f9f2bf
ADLR/megatron-lm!2588 - Add NVTX ranges to categorize execution
May 16, 2025
8a9e864
Merge branch 'llama31_automated_breakdown' into 'main'
jaredcasper May 16, 2025
1ff5a37
ADLR/megatron-lm!3116 - Move fsdp 2 import from _composable to public
BoxiangW May 16, 2025
ed0d528
Merge branch 'boxiangw/public_fsdp_import' into 'main'
ko3n1g May 16, 2025
d70e2e4
ADLR/megatron-lm!3321 - ci: Add nemo-image to `ci-rebuild-mcore-nemo-…
ko3n1g May 16, 2025
054fad5
Merge branch 'ko3n1g/ci/fix-rebuild-job' into 'main'
ko3n1g May 16, 2025
e494219
ADLR/megatron-lm!3197 - ci: Re-enable tests that failed on memory
ko3n1g May 16, 2025
bfc751a
Merge branch 'ko3n1g/ci/re-enable-broken-tests' into 'main'
ko3n1g May 16, 2025
a73b4d2
tests: Disable flaky test
ko3n1g May 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
6 changes: 6 additions & 0 deletions .coveragerc
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[html]
directory = coverage

[run]
data_file = .coverage_$LOCAL_RANK
relative_files = true
4 changes: 4 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8]
max-line-length = 100
extend-ignore = E203,E501,F401,E402,E714
per-file-ignores = __init__.py:F401
32 changes: 32 additions & 0 deletions .github/ISSUE_TEMPLATE/bug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
name: BUG
about: Report a bug that needs attention
title: "[BUG]"
labels: ''
assignees: ''

---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Stack trace/logs**
If applicable, add the stack trace or logs from the time of the error.

**Environment (please complete the following information):**
- Megatron-LM commit ID
- PyTorch version
- CUDA version
- NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
23 changes: 23 additions & 0 deletions .github/ISSUE_TEMPLATE/enhancement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
name: ENHANCEMENT
about: Suggest an idea to improve this project
title: "[ENHANCEMENT]"
labels: ''
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Proposed implementation**
If you have a proposed implementation for the feature state it here or link to a PR.

**Additional context**
Add any other context or screenshots about the feature request here.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: QUESTION
about: Ask a question about Megatron-LM that is not a bug, regression or enhancement
request
title: "[QUESTION]"
labels: ''
assignees: ''

---

**Your question**
Ask a clear and concise question about Megatron-LM.
39 changes: 39 additions & 0 deletions .github/ISSUE_TEMPLATE/regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
name: REGRESSION
about: Report a regression in speed or accuracy due to a Megatron-LM update
title: "[REGRESSION]"
labels: ''
assignees: ''

---

**Describe the regression**
A clear and concise description of what the regression is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Previous performance**
What speed or accuracy did you previously see.

**New performance**
What speed or accuracy do you see after the update.

**Stack trace/logs**
If applicable, add the stack trace or logs related to the regression.

**Environment (please complete the following information):**
- Previous Megatron-LM commit ID
- New Megatron-LM commit ID
- Previous PyTorch version
- New PyTorch version
- Previous CUDA version
- New CUDA version
- Previous NCCL version
- New NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
31 changes: 31 additions & 0 deletions .github/workflows/stale.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
#
# You can adjust the behavior by modifying this file.
# For more information, see:
# https://github.com/actions/stale
name: Mark stale issues and pull requests

on:
schedule:
- cron: '15 18 * * *'

jobs:
stale:

runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write

steps:
- uses: actions/stale@v5
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
days-before-stale: 60
stale-issue-message: 'Marking as stale. No activity in 60 days.'
stale-pr-message: 'Marking as stale. No activity in 60 days.'
stale-issue-label: 'stale'
stale-pr-label: 'stale'
remove-stale-when-updated: true
operations-per-run: 1000
days-before-close: -1
14 changes: 13 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,14 @@
__pycache__

*.so
build
.coverage_*
*.egg-info
*~
slurm*
logs
.vscode
local/
.gitmodules
wandb/
onelogger.log
onelogger.err
Loading