Dynamo Release v0.5.1

Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models at data-center scale. We're an open-source first project under the Apache 2.0 license; built in Rust for performance and Python for extensibility. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.

Dynamo supports multiple large language model (LLM) inference engines (see Support Matrix for details):

NVIDIA TensorRT-LLM
vLLM
SGLang

Release Highlights

This release delivers major advances in KV routing capabilities with the new vLLM prefill router and commit router, comprehensive canary health checks across all backends, and significant tool calling enhancements. We strengthened production reliability with request cancellation support, improved Kubernetes deployment workflows, and expanded multinode capabilities. Lastly, we enhanced KVBM performance with vectorized memory transfers and tighter integration with TensorRT-LLM v1.1.0rc5.

Major Features and Improvements

1. Advanced KV Routing & Cache Management

KV Router

Introduced vLLM prefill router for optimized prefill phase handling (#3155)
Implemented KV commit router for improved cache consistency (#3024)
Added router benchmarking capabilities with mooncake-style testing (#3068, #2828)
Enabled router to optionally skip tracking active blocks during prefill and cached blocks during decode (#3135)
Router replicas with state-sharing for improved scalability (continued from v0.4.1)

KVBM (KV Block Manager)

Implemented vectorized copy between pinned memory and device memory for improved transfer performance (#2989)
Enhanced KVBM transfer context v2 (#2873)
Added KV indexer metrics for better observability (#2905)
Updated integration with TensorRT-LLM v1.1.0rc5 connector API (#2979, #3119)
Improved error handling with early stop for missing CPU/disk configuration (#2997)

2. Enhanced Health Checks & Reliability

Canary Health Checks

Implemented canary health check framework (#2903)
Added TensorRT-LLM canary health check with BOS token support (#3082, #3145)
Deployed SGLang canary health check (#3103, #3123)
Enabled vLLM prefill-specific health check payload (#3126)

Request Management

Added request cancellation support for unary requests (#3004)
Enabled vLLM abort while engine generates next token (#3102)
Implemented router-level request rejection for better resource management

3. Tool Calling & Reasoning Enhancements

Enabled tool calling with stream=True support (#2932)
Added Deepseek V3.1 tool parser with library refactoring (#2832)
Implemented Granite class reasoning parser (#2936)
Enhanced GPT-OSS frontend with Harmony tool calling and reasoning parsers (#2999)
Added finish reason tool_calls for non-streaming responses (#3087)
Fixed null tools processing via minijinja (#3340)

4. Kubernetes & Deployment Improvements

Grove Integration

Updated to official Grove 0.1.0-alpha release (#3030)
Added planner manifest support for Grove (#3203)

Deployment Enhancements

Installed Dynamo operator cluster-wide by default (#3199)
Added multinode K8s examples for TensorRT-LLM and vLLM (#3100)
Enabled in-cluster performance benchmarks with kubectl one-liner (#3144)
Implemented namespace isolation for improved multi-tenancy (#2394, #2970)
Added virtual connector for 3rd party deployments (#2913)
Improved SGLang multinode handling in operator (#3151)

5. Observability & Metrics

Added HTTP queue metrics for NIM frontend request tracking (#2914)
Implemented NIM FE runtime config metrics with periodic polling (#3107)
Added metrics labels for multimodal workloads (#2835)
Implemented frontend disconnect metrics (#2953)
Unified component metric names to prevent Kubernetes label collisions (continued from v0.4.1)

6. Frontend & Model Support

Added support for serving multiple models from single endpoint (continued from v0.4.1)
Implemented --custom-jinja-template argument for custom chat templates (#2829)
Added chat_template_kwargs parameter to v1/chat/completion (#3016)
Enabled framework tokenization/detokenization (#3134)
Implemented ModelExpress Dynamo integration (#3191)
Added SLA Planner support for TensorRT-LLM (#2980) and SGLang MoE models (#3185)

7. Performance & Optimization

Refactored discovery ModelManager to use parking_lot::RwLock (#2902)
Ported vLLM port allocator to Rust bindings for improved performance (#3125)
Implemented JailedStream for better resource management (#3034)
Added generic tensor type for inference (#2746)
Updated benchmarking and deployment utilities (#2933, #2973, #3098)

8. Bug Fixes

Fixed OpenAI-compliant usage stats for streaming responses (#3022)
Resolved token loss bug in final packet (#2985)
Fixed aggregate logprobs calculation (#2928)
Corrected Harmony parser streaming behavior (#3074)
Fixed router slot manager force expire requests (#2840)
Resolved metrics collection and namespace sanitization issues (#2868)
Fixed polling from exhausted stream in preprocessor (#3349)
Addressed KVBM fully contiguous memory region size bug (#3175)

Documentation

Revamped Kubernetes documentation (#3173)
Created deployment and benchmarking recipes for Llama3-70B and GPT-OSS-120B (#2792)
Added AWS ECS deployment example for Dynamo vLLM (#2415, #3381)
Published Python runtime request cancellation examples (#2893)
Added health check and structured logs documentation (#2805)
Created mermaid diagrams showcasing KV router features (#3184)
Updated consistent hashing documentation for KV events (#2981)
Published profiling-related documentation updates (#2816)
Fixed broken links and Sphinx structural errors (#3186, #3342)

Build, CI, and Test

Restructured TensorRT-LLM and SGLang to follow container strategy structure (#3009, #2803)
Moved to ARC runners for CI (#2904)
Added SGLang functional tests (#2943)
Implemented fault injection tests for Kubernetes (#3194)
Added concurrency checks to auto-cancel running actions (#2438)
Created broken links checker (#2927)
Converted vLLM multimodal examples to pytest framework (continued from v0.4.1)
Updated TensorRT-LLM to v1.1.0rc5 (#3119)

Migration Notes

Component metric names continue to use the dynamo_component_* pattern. Ensure dashboards and alerting rules are updated accordingly.
The Dynamo operator now installs cluster-wide by default. If namespace-scoped installation is required, use the appropriate Helm values.
TensorRT-LLM has been updated to v1.1.0rc5, which includes KVBM integration changes. Review the updated connector API if using custom integrations.
The Multinode Multimodal Guide works only with release v0.5.0. Users requiring multinode multimodal functionality should continue using v0.5.0 until support is restored in a future release.

Looking Forward
This release strengthens Dynamo's production readiness with advanced KV routing, comprehensive health monitoring, and robust request management. The enhanced Kubernetes integration and multinode support enable seamless scaling for enterprise deployments. With improved observability and the new prefill router, teams can now optimize both throughput and latency for diverse workload patterns. These capabilities set the stage for even more sophisticated routing strategies and performance optimizations in future releases!

Release Assets
Python Wheels:

Containers:

TensorRT-LLM Runtime: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1 NGC
vLLM Runtime: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1 NGC
SGLang Runtime: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1 NGC
Dynamo Kubernetes Operator: nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.5.1 NGC

Helm Charts:

Contributors
We welcome new contributors in this release: @blarson-b10, @lixuwei2333, @GavinZhu-GMI, @nv-hwoo
Full Changelog: v0.5.0...v0.5.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dynamo v0.5.1

Dynamo Release v0.5.1

Release Highlights

Major Features and Improvements

Contributors

Uh oh!