14 Oct 02:02

3ecc1fb

Dynamo v0.5.1 Latest

Latest

Dynamo Release v0.5.1

Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models at data-center scale. We're an open-source first project under the Apache 2.0 license; built in Rust for performance and Python for extensibility. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.

Dynamo supports multiple large language model (LLM) inference engines (see Support Matrix for details):

NVIDIA TensorRT-LLM
vLLM
SGLang

Release Highlights

This release delivers major advances in KV routing capabilities with the new vLLM prefill router and commit router, comprehensive canary health checks across all backends, and significant tool calling enhancements. We strengthened production reliability with request cancellation support, improved Kubernetes deployment workflows, and expanded multinode capabilities. Lastly, we enhanced KVBM performance with vectorized memory transfers and tighter integration with TensorRT-LLM v1.1.0rc5.

Major Features and Improvements

1. Advanced KV Routing & Cache Management

KV Router

Introduced vLLM prefill router for optimized prefill phase handling (#3155)
Implemented KV commit router for improved cache consistency (#3024)
Added router benchmarking capabilities with mooncake-style testing (#3068, #2828)
Enabled router to optionally skip tracking active blocks during prefill and cached blocks during decode (#3135)
Router replicas with state-sharing for improved scalability (continued from v0.4.1)

KVBM (KV Block Manager)

Implemented vectorized copy between pinned memory and device memory for improved transfer performance (#2989)
Enhanced KVBM transfer context v2 (#2873)
Added KV indexer metrics for better observability (#2905)
Updated integration with TensorRT-LLM v1.1.0rc5 connector API (#2979, #3119)
Improved error handling with early stop for missing CPU/disk configuration (#2997)

2. Enhanced Health Checks & Reliability

Canary Health Checks

Implemented canary health check framework (#2903)
Added TensorRT-LLM canary health check with BOS token support (#3082, #3145)
Deployed SGLang canary health check (#3103, #3123)
Enabled vLLM prefill-specific health check payload (#3126)

Request Management

Added request cancellation support for unary requests (#3004)
Enabled vLLM abort while engine generates next token (#3102)
Implemented router-level request rejection for better resource management

3. Tool Calling & Reasoning Enhancements

Enabled tool calling with stream=True support (#2932)
Added Deepseek V3.1 tool parser with library refactoring (#2832)
Implemented Granite class reasoning parser (#2936)
Enhanced GPT-OSS frontend with Harmony tool calling and reasoning parsers (#2999)
Added finish reason tool_calls for non-streaming responses (#3087)
Fixed null tools processing via minijinja (#3340)

4. Kubernetes & Deployment Improvements

Grove Integration

Updated to official Grove 0.1.0-alpha release (#3030)
Added planner manifest support for Grove (#3203)

Deployment Enhancements

Installed Dynamo operator cluster-wide by default (#3199)
Added multinode K8s examples for TensorRT-LLM and vLLM (#3100)
Enabled in-cluster performance benchmarks with kubectl one-liner (#3144)
Implemented namespace isolation for improved multi-tenancy (#2394, #2970)
Added virtual connector for 3rd party deployments (#2913)
Improved SGLang multinode handling in operator (#3151)

5. Observability & Metrics

Added HTTP queue metrics for NIM frontend request tracking (#2914)
Implemented NIM FE runtime config metrics with periodic polling (#3107)
Added metrics labels for multimodal workloads (#2835)
Implemented frontend disconnect metrics (#2953)
Unified component metric names to prevent Kubernetes label collisions (continued from v0.4.1)

6. Frontend & Model Support

Added support for serving multiple models from single endpoint (continued from v0.4.1)
Implemented --custom-jinja-template argument for custom chat templates (#2829)
Added chat_template_kwargs parameter to v1/chat/completion (#3016)
Enabled framework tokenization/detokenization (#3134)
Implemented ModelExpress Dynamo integration (#3191)
Added SLA Planner support for TensorRT-LLM (#2980) and SGLang MoE models (#3185)

7. Performance & Optimization

Refactored discovery ModelManager to use parking_lot::RwLock (#2902)
Ported vLLM port allocator to Rust bindings for improved performance (#3125)
Implemented JailedStream for better resource management (#3034)
Added generic tensor type for inference (#2746)
Updated benchmarking and deployment utilities (#2933, #2973, #3098)

8. Bug Fixes

Fixed OpenAI-compliant usage stats for streaming responses (#3022)
Resolved token loss bug in final packet (#2985)
Fixed aggregate logprobs calculation (#2928)
Corrected Harmony parser streaming behavior (#3074)
Fixed router slot manager force expire requests (#2840)
Resolved metrics collection and namespace sanitization issues (#2868)
Fixed polling from exhausted stream in preprocessor (#3349)
Addressed KVBM fully contiguous memory region size bug (#3175)

Documentation

Revamped Kubernetes documentation (#3173)
Created deployment and benchmarking recipes for Llama3-70B and GPT-OSS-120B (#2792)
Added AWS ECS deployment example for Dynamo vLLM (#2415, #3381)
Published Python runtime request cancellation examples (#2893)
Added health check and structured logs documentation (#2805)
Created mermaid diagrams showcasing KV router features (#3184)
Updated consistent hashing documentation for KV events (#2981)
Published profiling-related documentation updates (#2816)
Fixed broken links and Sphinx structural errors (#3186, #3342)

Build, CI, and Test

Restructured TensorRT-LLM and SGLang to follow container strategy structure (#3009, #2803)
Moved to ARC runners for CI (#2904)
Added SGLang functional tests (#2943)
Implemented fault injection tests for Kubernetes (#3194)
Added concurrency checks to auto-cancel running actions (#2438)
Created broken links checker (#2927)
Converted vLLM multimodal examples to pytest framework (continued from v0.4.1)
Updated TensorRT-LLM to v1.1.0rc5 (#3119)

Migration Notes

Component metric names continue to use the dynamo_component_* pattern. Ensure dashboards and alerting rules are updated accordingly.
The Dynamo operator now installs cluster-wide by default. If namespace-scoped installation is required, use the appropriate Helm values.
TensorRT-LLM has been updated to v1.1.0rc5, which includes KVBM integration changes. Review the updated connector API if using custom integrations.
The Multinode Multimodal Guide works only with release v0.5.0. Users requiring multinode multimodal functionality should continue using v0.5.0 until support is restored in a future release.

Looking Forward
This release strengthens Dynamo's production readiness with advanced KV routing, comprehensive health monitoring, and robust request management. The enhanced Kubernetes integration and multinode support enable seamless scaling for enterprise deployments. With improved observability and the new prefill router, teams can now optimize both throughput and latency for diverse workload patterns. These capabilities set the stage for even more sophisticated routing strategies and performance optimizations in future releases!

Release Assets
Python Wheels:

Containers:

TensorRT-LLM Runtime: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1 NGC
vLLM Runtime: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1 NGC
SGLang Runtime: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1 NGC
Dynamo Kubernetes Operator: nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.5.1 NGC

Helm Charts:

Contributors
We welcome new contributors in this release: @blarson-b10, @lixuwei2333, @GavinZhu-GMI, @nv-hwoo
Full Changelog: v0.5.0...v0.5.1

Contributors

lixuwei2333, nv-hwoo, and 2 other contributors

Assets 3

18 Sep 22:47

saturley-hall

v0.5.0

65f12d7

Dynamo Release v0.5.0

Dynamo 0.5.0 Release Notes

Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models—across any framework, architecture, or deployment scale. It's an open-source project under the Apache 2.0 license. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.

Dynamo supports multiple large language model (LLM) inference engines (see Support Matrix for details).

NVIDIA TensorRT-LLM
vLLM
SGLang

Release Highlights

This release introduces TRT-LLM integration for KV cache management, supports gRPC support and tool calling capabilities. We also delivered major improvements to system reliability, with request cancellation features and improved observability.

Major Features and Improvements

1. Fault Tolerance & Observability

Implemented End to End request cancellation (#2158, #2500) with Python context propagation
Implemented DRT shutdown on vLLM engine failures (#2698)
Added fast-fail validation for NATS JetStream requirements to prevent silent failures (#2590)
Unified metrics across all components with model labels for vLLM (#2474), TensorRT-LLM (#2666), and SGLang (#2679)
Standardized Prometheus metrics naming and sanitization with KvStats integration (#2733, #2704)
Added automatic uptime tracking and auto-start of metrics collection upon NATS service creation (#2587, #2664), improving observability readiness

2. Kubernetes Deployments

Integrated Grove and KAI scheduler into Dynamo Cloud Helm chart for multi-node deployments (#2755)
Implemented auto-injection of kai-scheduler annotations and labels with parent DGD Kubernetes name support (#2748, #2774)
Deployed Dynamo EPP-aware gateway with prevention of double-tokenization for optimized routing (#2633, #2559)
Integrated Model Express client for optimized model downloads with URL injection support (#2574, #2769)

3. KV Cache Management & Transfer

Integrated Dynamo KVBM connector API with TensorRT-LLM for G2-G3 offloading and onboarding (#2544)
Added support for user selection among multiple KV transfer connectors (nixl, kvbm, lmcache) (#2517)
Added detailed KV Block Manager metrics for match, offload, and onboard operations (#2626, #2673)

4. Planning & Routing

Router

Separated frontend and Router, through Python bindings for KvPushRouter, so the Router and frontend can be scaled independently (#2658, #2548)
Implemented warm restarts via durable KV event consumers and radix snapshotting for router persistence (#2756, #2740, #2800)

Planner

Added comprehensive tests for replica calculation and planner scaling with automated Kubernetes deployment validation (#2525)
Added SLA planner dry-run mode with a CLI to simulate workloads, generate plots, and expose optional Prometheus metrics (#2557)

5. Others

Tool Calling

Introduced parsers library (#2542) supporting multiple reasoning and tool-calling formats.
Implemented multiple tool-calling parsers, including Pythonic (#2788), Harmony (#2796), and JSON-based parsers with normal text parsing alongside tool calls (#2709)
Added support for separating reasoning from visible text (#2555) along with GPT-OSS reasoning parser integration (#2656)
Added support for custom logits processors in the TensorRT-LLM backend, enabling in-place logits modification during generation (#2613, #2702)

Multimodal Support Expansion

Added complete multimodal deployment examples for Llava and Qwen, with video support using vLLM v1 (#2628, #2694, #2738)
Added Encode Worker and NIXL support for TensorRT-LLM multimodal disaggregated flows (#2452)

Infrastructure & Performance

Added comprehensive KServe gRPC support for industry-standard model inference protocol (#2638)
Enhanced Hugging Face integration with HF_HOME and HF_ENDPOINT environment variable support (#2642, #2637)

Developer Experience

Added Devcontainer improvements with enhanced documentation and SGLang-specific setup (#2255, #2578, #2741)
Added logging setup for Kubernetes with Loki integration and Grafana dashboards (#2699)
Added benchmarking guide with GenAI-Perf integration and automated performance comparison (#2620)
Updated TensorRT-LLM to 1.0.0rc6 and simplified Eagle model configuration (#2606, #2661)

Bug Fixes

Improved Hugging Face download speeds with better API client configuration (#2566)
Added missing Prometheus to runtime images for SGLang and general runtime (#2565, #2689)
Fixed kv-event-config command line respect and environment variable overrides (#2627, #2640)
Enhanced pytest robustness and parsing errors with proper timeout handling (#2676, #2572)
Resolved metrics registration timing issues and prevented early returns from affecting measurements (#2664, #2576)

Documentation

Created SNS aggregated Kubernetes example and simplified sphinx build process (#2773, #2519)
Streamlined cloud installation documentation and deployment guides (#2818)
Updated benchmarking framework documentation with comprehensive deployment guides (#2620)
Updated supported models documentation for multimodal and SGLang container build instructions (#2651, #2707)

Build, CI, and Test

Added replica calculation and planner scaling tests with automated Kubernetes deployment validation (#2525)
Added vLLM sanity testing support on GitHub Actions with build optimizations (#2526)
Optimized CI job execution for docs-only changes and Rust-specific changes (#2775)
Enabled KVBM in vLLM container with improved virtual environment handling (#2763)
Enhanced test reliability with proper KVBM test exclusions and determinism testing (#2611)
Fixed concurrency settings to prevent main branch run cancellations (#2780)
Improved container build process with default dev builds for vLLM (#2837)

Migration Notes

Parser Integration: New parsing capabilities require updated CLI flags for reasoning and tool calling features
Container Updates: Runtime images now include Prometheus by default - review monitoring configurations

Looking Forward

This release sets the stage for more features in our H2 roadmap, including benchmarking KVBM performance, E2E performance, and improved fault tolerance and request rejection at every level. We will focus on significantly updating documentation and examples for a better experience and include Kubernetes benchmark scripts for most popular models.

Release Assets

Python Wheels:

Rust Crates:

Containers:

Helm Charts: