Dynamo Release v0.5.1
Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models at data-center scale. We're an open-source first project under the Apache 2.0 license; built in Rust for performance and Python for extensibility. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.
Dynamo supports multiple large language model (LLM) inference engines (see Support Matrix for details):
- NVIDIA TensorRT-LLM
- vLLM
- SGLang
Release Highlights
This release delivers major advances in KV routing capabilities with the new vLLM prefill router and commit router, comprehensive canary health checks across all backends, and significant tool calling enhancements. We strengthened production reliability with request cancellation support, improved Kubernetes deployment workflows, and expanded multinode capabilities. Lastly, we enhanced KVBM performance with vectorized memory transfers and tighter integration with TensorRT-LLM v1.1.0rc5.
Major Features and Improvements
1. Advanced KV Routing & Cache Management
KV Router
- Introduced vLLM prefill router for optimized prefill phase handling (#3155)
- Implemented KV commit router for improved cache consistency (#3024)
- Added router benchmarking capabilities with mooncake-style testing (#3068, #2828)
- Enabled router to optionally skip tracking active blocks during prefill and cached blocks during decode (#3135)
- Router replicas with state-sharing for improved scalability (continued from v0.4.1)
KVBM (KV Block Manager)
- Implemented vectorized copy between pinned memory and device memory for improved transfer performance (#2989)
- Enhanced KVBM transfer context v2 (#2873)
- Added KV indexer metrics for better observability (#2905)
- Updated integration with TensorRT-LLM v1.1.0rc5 connector API (#2979, #3119)
- Improved error handling with early stop for missing CPU/disk configuration (#2997)
2. Enhanced Health Checks & Reliability
Canary Health Checks
- Implemented canary health check framework (#2903)
- Added TensorRT-LLM canary health check with BOS token support (#3082, #3145)
- Deployed SGLang canary health check (#3103, #3123)
- Enabled vLLM prefill-specific health check payload (#3126)
Request Management
- Added request cancellation support for unary requests (#3004)
- Enabled vLLM abort while engine generates next token (#3102)
- Implemented router-level request rejection for better resource management
3. Tool Calling & Reasoning Enhancements
- Enabled tool calling with stream=True support (#2932)
- Added Deepseek V3.1 tool parser with library refactoring (#2832)
- Implemented Granite class reasoning parser (#2936)
- Enhanced GPT-OSS frontend with Harmony tool calling and reasoning parsers (#2999)
- Added finish reason tool_calls for non-streaming responses (#3087)
- Fixed null tools processing via minijinja (#3340)
4. Kubernetes & Deployment Improvements
Grove Integration
- Updated to official Grove 0.1.0-alpha release (#3030)
- Added planner manifest support for Grove (#3203)
Deployment Enhancements
- Installed Dynamo operator cluster-wide by default (#3199)
- Added multinode K8s examples for TensorRT-LLM and vLLM (#3100)
- Enabled in-cluster performance benchmarks with kubectl one-liner (#3144)
- Implemented namespace isolation for improved multi-tenancy (#2394, #2970)
- Added virtual connector for 3rd party deployments (#2913)
- Improved SGLang multinode handling in operator (#3151)
5. Observability & Metrics
- Added HTTP queue metrics for NIM frontend request tracking (#2914)
- Implemented NIM FE runtime config metrics with periodic polling (#3107)
- Added metrics labels for multimodal workloads (#2835)
- Implemented frontend disconnect metrics (#2953)
- Unified component metric names to prevent Kubernetes label collisions (continued from v0.4.1)
6. Frontend & Model Support
- Added support for serving multiple models from single endpoint (continued from v0.4.1)
- Implemented --custom-jinja-template argument for custom chat templates (#2829)
- Added chat_template_kwargs parameter to v1/chat/completion (#3016)
- Enabled framework tokenization/detokenization (#3134)
- Implemented ModelExpress Dynamo integration (#3191)
- Added SLA Planner support for TensorRT-LLM (#2980) and SGLang MoE models (#3185)
7. Performance & Optimization
- Refactored discovery ModelManager to use parking_lot::RwLock (#2902)
- Ported vLLM port allocator to Rust bindings for improved performance (#3125)
- Implemented JailedStream for better resource management (#3034)
- Added generic tensor type for inference (#2746)
- Updated benchmarking and deployment utilities (#2933, #2973, #3098)
8. Bug Fixes
- Fixed OpenAI-compliant usage stats for streaming responses (#3022)
- Resolved token loss bug in final packet (#2985)
- Fixed aggregate logprobs calculation (#2928)
- Corrected Harmony parser streaming behavior (#3074)
- Fixed router slot manager force expire requests (#2840)
- Resolved metrics collection and namespace sanitization issues (#2868)
- Fixed polling from exhausted stream in preprocessor (#3349)
- Addressed KVBM fully contiguous memory region size bug (#3175)
Documentation
- Revamped Kubernetes documentation (#3173)
- Created deployment and benchmarking recipes for Llama3-70B and GPT-OSS-120B (#2792)
- Added AWS ECS deployment example for Dynamo vLLM (#2415, #3381)
- Published Python runtime request cancellation examples (#2893)
- Added health check and structured logs documentation (#2805)
- Created mermaid diagrams showcasing KV router features (#3184)
- Updated consistent hashing documentation for KV events (#2981)
- Published profiling-related documentation updates (#2816)
- Fixed broken links and Sphinx structural errors (#3186, #3342)
Build, CI, and Test
- Restructured TensorRT-LLM and SGLang to follow container strategy structure (#3009, #2803)
- Moved to ARC runners for CI (#2904)
- Added SGLang functional tests (#2943)
- Implemented fault injection tests for Kubernetes (#3194)
- Added concurrency checks to auto-cancel running actions (#2438)
- Created broken links checker (#2927)
- Converted vLLM multimodal examples to pytest framework (continued from v0.4.1)
- Updated TensorRT-LLM to v1.1.0rc5 (#3119)
Migration Notes
- Component metric names continue to use the dynamo_component_* pattern. Ensure dashboards and alerting rules are updated accordingly.
- The Dynamo operator now installs cluster-wide by default. If namespace-scoped installation is required, use the appropriate Helm values.
- TensorRT-LLM has been updated to v1.1.0rc5, which includes KVBM integration changes. Review the updated connector API if using custom integrations.
- The Multinode Multimodal Guide works only with release v0.5.0. Users requiring multinode multimodal functionality should continue using v0.5.0 until support is restored in a future release.
Looking Forward
This release strengthens Dynamo's production readiness with advanced KV routing, comprehensive health monitoring, and robust request management. The enhanced Kubernetes integration and multinode support enable seamless scaling for enterprise deployments. With improved observability and the new prefill router, teams can now optimize both throughput and latency for diverse workload patterns. These capabilities set the stage for even more sophisticated routing strategies and performance optimizations in future releases!
Release Assets
Python Wheels:
- ai-dynamo v0.5.1
- ai-dynamo-runtime v0.5.1
Rust Crates: - dynamo-runtime v0.5.1
- dynamo-async-openai v0.5.1
- dynamo-llm v0.5.1
- dynamo-parsers v0.5.1
Containers:
- TensorRT-LLM Runtime:
nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1
NGC - vLLM Runtime:
nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1
NGC - SGLang Runtime:
nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1
NGC - Dynamo Kubernetes Operator:
nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.5.1
NGC
Helm Charts:
Contributors
We welcome new contributors in this release: @blarson-b10, @lixuwei2333, @GavinZhu-GMI, @nv-hwoo
Full Changelog: v0.5.0...v0.5.1