Changelog

NVIDIA Megatron Core 0.9.0

Uneven pipeline parallelism
- Enable pipeline parallelism where first and last ranks have fewer transformer layers than the intermediate ranks
Per layer CUDAGraph support for GPT training with Transformer Engine modules
Enable different TP sizes for the vision encoder
Enable pipeline parallelism for T5 & Llava models
Support multi-tile multi-image input in Llava models
MoE
- FP8 support
- Runtime upcycling support
- Dispatcher implementation optimizations
- Shared expert support with overlapping optimizations
  - Qwen Model support
Known Issues
- When using sequence parallel, during the transformer block forward pass, dropout is not using the appropriate rng context.

NVIDIA Megatron Core 0.8.0

Multimodal
- Added initial support for training vision language models using the LLaVA architecture
- Added initial support for inference with multimodal inputs
- End-to-end multimodal example from data collection to training to evaluation is provided in examples/multimodal
MoE
- Context Parallel support.
- Distributed checkpoint support for grouped GEMM.
Mamba

NVIDIA Megatron Core 0.7.0

MoE
- Token drop support
- Several efficiency optimizations
- Improved model parallelism
- Memory optimizations
Distributed checkpointing
- Enabled for Retro
- Asynchronous checkpoint saving
Several minor bug fixes, speed improvements, and memory optimizations

NVIDIA Megatron Core 0.6.0

MoE (Mixture of Experts)
- Performance optimization
  - Communication optimization for multi GPU and Single GPU
  - 23% improvement (323 TFLOPS/GPU) over MCore 0.5.0 on Mixtral with Hopper BF16
  - GroupedMLP enhancement for Hopper
  - DP Overlapping. Support overlapping computation with gradient reduction and parameter gathering.
- All-to-All based Token Dispatcher
- Layer-wise logging for load balancing loss.
- Improved expert parallel support including distributed optimizer.
Distributed optimizer
RETRO
- Data processing
BERT
- Distributed checkpointing
Dist checkpointing
- PyTorch native distributed backend
- Improved saving/loading speed
TensorRT-LLM Export
- Integration with TensorRT Model Optimizer Post-training quantization (PTQ)
- Text generation driver to perform PTQ in Megatron-LM
- Llama2 and Nemotron3-8b examples to use TensorRT-LLM unified build API to build engine after training.
Several minor enhancements, bug fixes, and documentation updates

NVIDIA Megatron Core 0.5.0

Key Features and Enhancements

Megatron core documentation is now live!

Model Features

MoE (Mixture of Experts)
- Support for Z-loss, Load balancing and Sinkhorn
- Layer and communications refactor
- Richer parallelism mappings and EP can be combined with other model parallel techniques for larger MoE variants, e.g. EP + TP + DP + SP + PP
- Token dropless architecture with Top-K routing
- Performance optimization with with GroupedGEMM when number of local experts is > 1
- Distributed checkpointing
Interleaved rotary embedding

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CHANGELOG.md

CHANGELOG.md

Changelog

NVIDIA Megatron Core 0.9.0

NVIDIA Megatron Core 0.8.0

NVIDIA Megatron Core 0.7.0

NVIDIA Megatron Core 0.6.0

NVIDIA Megatron Core 0.5.0

Key Features and Enhancements

Model Features

Datasets

Parallelism

Performance

General Improvements

NVIDIA Megatron Core 0.4.0

Key Features and Enhancements

Models

Parallelism

Datasets

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

NVIDIA Megatron Core 0.9.0

NVIDIA Megatron Core 0.8.0

NVIDIA Megatron Core 0.7.0

NVIDIA Megatron Core 0.6.0

NVIDIA Megatron Core 0.5.0

Key Features and Enhancements

Model Features

Datasets

Parallelism

Performance

General Improvements

NVIDIA Megatron Core 0.4.0

Key Features and Enhancements

Models

Parallelism

Datasets