Skip to content

Releases: ai-dynamo/dynamo

Dynamo v0.5.1

14 Oct 02:02
3ecc1fb
Compare
Choose a tag to compare

Dynamo Release v0.5.1

Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models at data-center scale. We're an open-source first project under the Apache 2.0 license; built in Rust for performance and Python for extensibility. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.

Dynamo supports multiple large language model (LLM) inference engines (see Support Matrix for details):

  • NVIDIA TensorRT-LLM
  • vLLM
  • SGLang

Release Highlights

This release delivers major advances in KV routing capabilities with the new vLLM prefill router and commit router, comprehensive canary health checks across all backends, and significant tool calling enhancements. We strengthened production reliability with request cancellation support, improved Kubernetes deployment workflows, and expanded multinode capabilities. Lastly, we enhanced KVBM performance with vectorized memory transfers and tighter integration with TensorRT-LLM v1.1.0rc5.

Major Features and Improvements

1. Advanced KV Routing & Cache Management

KV Router

  • Introduced vLLM prefill router for optimized prefill phase handling (#3155)
  • Implemented KV commit router for improved cache consistency (#3024)
  • Added router benchmarking capabilities with mooncake-style testing (#3068, #2828)
  • Enabled router to optionally skip tracking active blocks during prefill and cached blocks during decode (#3135)
  • Router replicas with state-sharing for improved scalability (continued from v0.4.1)

KVBM (KV Block Manager)

  • Implemented vectorized copy between pinned memory and device memory for improved transfer performance (#2989)
  • Enhanced KVBM transfer context v2 (#2873)
  • Added KV indexer metrics for better observability (#2905)
  • Updated integration with TensorRT-LLM v1.1.0rc5 connector API (#2979, #3119)
  • Improved error handling with early stop for missing CPU/disk configuration (#2997)

2. Enhanced Health Checks & Reliability

Canary Health Checks

  • Implemented canary health check framework (#2903)
  • Added TensorRT-LLM canary health check with BOS token support (#3082, #3145)
  • Deployed SGLang canary health check (#3103, #3123)
  • Enabled vLLM prefill-specific health check payload (#3126)

Request Management

  • Added request cancellation support for unary requests (#3004)
  • Enabled vLLM abort while engine generates next token (#3102)
  • Implemented router-level request rejection for better resource management

3. Tool Calling & Reasoning Enhancements

  • Enabled tool calling with stream=True support (#2932)
  • Added Deepseek V3.1 tool parser with library refactoring (#2832)
  • Implemented Granite class reasoning parser (#2936)
  • Enhanced GPT-OSS frontend with Harmony tool calling and reasoning parsers (#2999)
  • Added finish reason tool_calls for non-streaming responses (#3087)
  • Fixed null tools processing via minijinja (#3340)

4. Kubernetes & Deployment Improvements

Grove Integration

  • Updated to official Grove 0.1.0-alpha release (#3030)
  • Added planner manifest support for Grove (#3203)

Deployment Enhancements

  • Installed Dynamo operator cluster-wide by default (#3199)
  • Added multinode K8s examples for TensorRT-LLM and vLLM (#3100)
  • Enabled in-cluster performance benchmarks with kubectl one-liner (#3144)
  • Implemented namespace isolation for improved multi-tenancy (#2394, #2970)
  • Added virtual connector for 3rd party deployments (#2913)
  • Improved SGLang multinode handling in operator (#3151)

5. Observability & Metrics

  • Added HTTP queue metrics for NIM frontend request tracking (#2914)
  • Implemented NIM FE runtime config metrics with periodic polling (#3107)
  • Added metrics labels for multimodal workloads (#2835)
  • Implemented frontend disconnect metrics (#2953)
  • Unified component metric names to prevent Kubernetes label collisions (continued from v0.4.1)

6. Frontend & Model Support

  • Added support for serving multiple models from single endpoint (continued from v0.4.1)
  • Implemented --custom-jinja-template argument for custom chat templates (#2829)
  • Added chat_template_kwargs parameter to v1/chat/completion (#3016)
  • Enabled framework tokenization/detokenization (#3134)
  • Implemented ModelExpress Dynamo integration (#3191)
  • Added SLA Planner support for TensorRT-LLM (#2980) and SGLang MoE models (#3185)

7. Performance & Optimization

  • Refactored discovery ModelManager to use parking_lot::RwLock (#2902)
  • Ported vLLM port allocator to Rust bindings for improved performance (#3125)
  • Implemented JailedStream for better resource management (#3034)
  • Added generic tensor type for inference (#2746)
  • Updated benchmarking and deployment utilities (#2933, #2973, #3098)

8. Bug Fixes

  • Fixed OpenAI-compliant usage stats for streaming responses (#3022)
  • Resolved token loss bug in final packet (#2985)
  • Fixed aggregate logprobs calculation (#2928)
  • Corrected Harmony parser streaming behavior (#3074)
  • Fixed router slot manager force expire requests (#2840)
  • Resolved metrics collection and namespace sanitization issues (#2868)
  • Fixed polling from exhausted stream in preprocessor (#3349)
  • Addressed KVBM fully contiguous memory region size bug (#3175)

Documentation

  • Revamped Kubernetes documentation (#3173)
  • Created deployment and benchmarking recipes for Llama3-70B and GPT-OSS-120B (#2792)
  • Added AWS ECS deployment example for Dynamo vLLM (#2415, #3381)
  • Published Python runtime request cancellation examples (#2893)
  • Added health check and structured logs documentation (#2805)
  • Created mermaid diagrams showcasing KV router features (#3184)
  • Updated consistent hashing documentation for KV events (#2981)
  • Published profiling-related documentation updates (#2816)
  • Fixed broken links and Sphinx structural errors (#3186, #3342)

Build, CI, and Test

  • Restructured TensorRT-LLM and SGLang to follow container strategy structure (#3009, #2803)
  • Moved to ARC runners for CI (#2904)
  • Added SGLang functional tests (#2943)
  • Implemented fault injection tests for Kubernetes (#3194)
  • Added concurrency checks to auto-cancel running actions (#2438)
  • Created broken links checker (#2927)
  • Converted vLLM multimodal examples to pytest framework (continued from v0.4.1)
  • Updated TensorRT-LLM to v1.1.0rc5 (#3119)

Migration Notes

  • Component metric names continue to use the dynamo_component_* pattern. Ensure dashboards and alerting rules are updated accordingly.
  • The Dynamo operator now installs cluster-wide by default. If namespace-scoped installation is required, use the appropriate Helm values.
  • TensorRT-LLM has been updated to v1.1.0rc5, which includes KVBM integration changes. Review the updated connector API if using custom integrations.
  • The Multinode Multimodal Guide works only with release v0.5.0. Users requiring multinode multimodal functionality should continue using v0.5.0 until support is restored in a future release.

Looking Forward
This release strengthens Dynamo's production readiness with advanced KV routing, comprehensive health monitoring, and robust request management. The enhanced Kubernetes integration and multinode support enable seamless scaling for enterprise deployments. With improved observability and the new prefill router, teams can now optimize both throughput and latency for diverse workload patterns. These capabilities set the stage for even more sophisticated routing strategies and performance optimizations in future releases!

Release Assets
Python Wheels:

Containers:

  • TensorRT-LLM Runtime: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1 NGC
  • vLLM Runtime: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1 NGC
  • SGLang Runtime: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1 NGC
  • Dynamo Kubernetes Operator: nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.5.1 NGC

Helm Charts:

Contributors
We welcome new contributors in this release: @blarson-b10, @lixuwei2333, @GavinZhu-GMI, @nv-hwoo
Full Changelog: v0.5.0...v0.5.1

Dynamo Release v0.5.0

18 Sep 22:47
65f12d7
Compare
Choose a tag to compare

Dynamo 0.5.0 Release Notes

Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models—across any framework, architecture, or deployment scale. It's an open-source project under the Apache 2.0 license. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.

Dynamo supports multiple large language model (LLM) inference engines (see Support Matrix for details).

  • NVIDIA TensorRT-LLM
  • vLLM
  • SGLang

Release Highlights

This release introduces TRT-LLM integration for KV cache management, supports gRPC support and tool calling capabilities. We also delivered major improvements to system reliability, with request cancellation features and improved observability.


Major Features and Improvements

1. Fault Tolerance & Observability

  • Implemented End to End request cancellation (#2158, #2500) with Python context propagation
  • Implemented DRT shutdown on vLLM engine failures (#2698)
  • Added fast-fail validation for NATS JetStream requirements to prevent silent failures (#2590)
  • Unified metrics across all components with model labels for vLLM (#2474), TensorRT-LLM (#2666), and SGLang (#2679)
  • Standardized Prometheus metrics naming and sanitization with KvStats integration (#2733, #2704)
  • Added automatic uptime tracking and auto-start of metrics collection upon NATS service creation (#2587, #2664), improving observability readiness

2. Kubernetes Deployments

  • Integrated Grove and KAI scheduler into Dynamo Cloud Helm chart for multi-node deployments (#2755)
  • Implemented auto-injection of kai-scheduler annotations and labels with parent DGD Kubernetes name support (#2748, #2774)
  • Deployed Dynamo EPP-aware gateway with prevention of double-tokenization for optimized routing (#2633, #2559)
  • Integrated Model Express client for optimized model downloads with URL injection support (#2574, #2769)

3. KV Cache Management & Transfer

  • Integrated Dynamo KVBM connector API with TensorRT-LLM for G2-G3 offloading and onboarding (#2544)
  • Added support for user selection among multiple KV transfer connectors (nixl, kvbm, lmcache) (#2517)
  • Added detailed KV Block Manager metrics for match, offload, and onboard operations (#2626, #2673)

4. Planning & Routing

Router

  • Separated frontend and Router, through Python bindings for KvPushRouter, so the Router and frontend can be scaled independently (#2658, #2548)
  • Implemented warm restarts via durable KV event consumers and radix snapshotting for router persistence (#2756, #2740, #2800)

Planner

  • Added comprehensive tests for replica calculation and planner scaling with automated Kubernetes deployment validation (#2525)
  • Added SLA planner dry-run mode with a CLI to simulate workloads, generate plots, and expose optional Prometheus metrics (#2557)

5. Others

Tool Calling

  • Introduced parsers library (#2542) supporting multiple reasoning and tool-calling formats.
  • Implemented multiple tool-calling parsers, including Pythonic (#2788), Harmony (#2796), and JSON-based parsers with normal text parsing alongside tool calls (#2709)
  • Added support for separating reasoning from visible text (#2555) along with GPT-OSS reasoning parser integration (#2656)
  • Added support for custom logits processors in the TensorRT-LLM backend, enabling in-place logits modification during generation (#2613, #2702)

Multimodal Support Expansion

  • Added complete multimodal deployment examples for Llava and Qwen, with video support using vLLM v1 (#2628, #2694, #2738)
  • Added Encode Worker and NIXL support for TensorRT-LLM multimodal disaggregated flows (#2452)

Infrastructure & Performance

  • Added comprehensive KServe gRPC support for industry-standard model inference protocol (#2638)
  • Enhanced Hugging Face integration with HF_HOME and HF_ENDPOINT environment variable support (#2642, #2637)

Developer Experience

  • Added Devcontainer improvements with enhanced documentation and SGLang-specific setup (#2255, #2578, #2741)
  • Added logging setup for Kubernetes with Loki integration and Grafana dashboards (#2699)
  • Added benchmarking guide with GenAI-Perf integration and automated performance comparison (#2620)
  • Updated TensorRT-LLM to 1.0.0rc6 and simplified Eagle model configuration (#2606, #2661)

Bug Fixes

  • Improved Hugging Face download speeds with better API client configuration (#2566)
  • Added missing Prometheus to runtime images for SGLang and general runtime (#2565, #2689)
  • Fixed kv-event-config command line respect and environment variable overrides (#2627, #2640)
  • Enhanced pytest robustness and parsing errors with proper timeout handling (#2676, #2572)
  • Resolved metrics registration timing issues and prevented early returns from affecting measurements (#2664, #2576)

Documentation

  • Created SNS aggregated Kubernetes example and simplified sphinx build process (#2773, #2519)
  • Streamlined cloud installation documentation and deployment guides (#2818)
  • Updated benchmarking framework documentation with comprehensive deployment guides (#2620)
  • Updated supported models documentation for multimodal and SGLang container build instructions (#2651, #2707)

Build, CI, and Test

  • Added replica calculation and planner scaling tests with automated Kubernetes deployment validation (#2525)
  • Added vLLM sanity testing support on GitHub Actions with build optimizations (#2526)
  • Optimized CI job execution for docs-only changes and Rust-specific changes (#2775)
  • Enabled KVBM in vLLM container with improved virtual environment handling (#2763)
  • Enhanced test reliability with proper KVBM test exclusions and determinism testing (#2611)
  • Fixed concurrency settings to prevent main branch run cancellations (#2780)
  • Improved container build process with default dev builds for vLLM (#2837)

Migration Notes

  • Parser Integration: New parsing capabilities require updated CLI flags for reasoning and tool calling features
  • Container Updates: Runtime images now include Prometheus by default - review monitoring configurations

Looking Forward

This release sets the stage for more features in our H2 roadmap, including benchmarking KVBM performance, E2E performance, and improved fault tolerance and request rejection at every level. We will focus on significantly updating documentation and examples for a better experience and include Kubernetes benchmark scripts for most popular models.

Release Assets

Python Wheels:

Rust Crates:

Containers:

Helm Charts:


Contributors

We welcome new contributors in this release:
@jasonqinzhou, @michaelfeil, @ahinsutime, @bhuvan002, @WaelBKZ, @hhk7734, @Michaelgathara, @KavinKrishnan, @michaelshin

Full Changelog: v0.4.1...v0.5.0

Dynamo Release v0.4.1

28 Aug 00:33
9f68e83
Compare
Choose a tag to compare

Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models—across any framework, architecture, or deployment scale. It's an open-source project under the Apache 2.0 license. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.

Dynamo supports multiple large language model (LLM) inference engines (see Support Matrix for details)

  • NVIDIA TensorRT-LLM
  • vLLM
  • SGLang

Release Highlights

This release brings substantial performance improvements for Deepseek R1, improved fault tolerance capabilities with high availability router testing, and groundbreaking KV cache management features. We've also significantly enhanced our Kubernetes deployment story with Grove integration and the new Inference Gateway, while expanding multimodal support across multiple backends.


Major Features and Improvements

1. Model Performance Breakthroughs

  • Achieved significant Deepseek R1 wideEP performance with both SGLang (#2223) and TRT-LLM (#2387)
  • Added in TRT-LLM support for variable sliding window attention (VSWA) for Gemma3 models (#2134)
  • Launched Day0 support + deployment guide for GPT-OSS 120B on Blackwell GPUs (#2297)

2. Fault Tolerance & Observability Improvements

  • Introduced testing for multiple KV routers and frontends for high availability (#2324)
  • Completed end-to-end request migration testing with vLLM (#2177), ensuring seamless failover
  • Added router-level request rejection (#2465) for better resource management under load
  • Unified NATS, DRT & component metrics (#2292) for comprehensive system monitoring
  • Made health checks more flexible with parameterized /health and /live endpoints (#2230)

3. Enhanced Kubernetes Deployments

Grove

  • Unlocked multi-node support through Grove integration (#2269, #2405)
  • Provided workaround for component scaling when using Grove (#2531)

Inference Gateway

  • Launched Dynamo integration with API Gateway featuring EPP customization (#2345)

4. Advanced KV Cache Management & Transfer

KV Block Manager

  • First release of KV Block Manager (KVBM) with vLLM, supporting tiered storage across HBM (G1), host memory (G2), and local disk (G3) (#2258)

LMCache integration

  • Successfully integrated LMCache for improved cache efficiency (#2079)

5. Intelligent Planning & Routing

Router

  • Enabled router replicas with state-sharing for improved scalability (#2264)

Planner

  • Extended SLA Planner integration to support SGLang dense models (#2421)

6. Others

Multimodal model support

  • Shipped multimodal examples with vLLM v1 (#2040)
  • Added comprehensive Llava model deployment example with vLLM v1 (#2628)
  • Brought multimodal support to TRT-LLM backend (#2195)

Guided decoding

  • Implemented frontend support for Structured Output and Guided Decoding (#2380)

Frontend improvements

  • Added capability to serve multiple models from a single endpoint (#2418)
  • Introduced LLM metrics for non-streaming requests (#2427)

Bug fixes

  • Resolved metrics collection timeout issues (#2480, #2506)
  • Standardized component metric names to dynamo_component_* pattern, preventing Kubernetes label collisions (#2180)
  • Fixed runtime error propagation in endpoint.rs (#2156)
  • Corrected processor/router unit queuing behavior with NATS (#1787)
  • Added missing dependencies to SGLang runtime build (#2279)
  • Improved HuggingFace token handling in preprocessor tests (#2321)
  • Implemented detokenize stream functionality (#2413)

Documentation

  • Created comprehensive TRT-LLM deployment examples for Kubernetes (#2133)
  • Authored SGLang deployment guide (#2238)
  • Developed MetricsRegistry API guides (#2159, #2160)
  • Published guide for collecting and viewing Dynamo metrics in Kubernetes (#2271)
  • Released Dynamo Inference Gateway documentation (#2257, #2260)
  • Created SGLang hicache example and guide (#2388)

Build, CI, and Test

  • Implemented KV routing tests for SGLang (#2424)
  • Completed request migration end-to-end testing with vLLM (#2177)
  • Converted vLLM multimodal example to pytest framework (#2451)
  • Added ZMQ library support for TRT-LLM's UCX connection establishment (#2381)
  • Created unit tests for SLA planner's interpolator (#2505)

Migration Notes

Component metric names have been standardized to the dynamo_component_* pattern. Users monitoring these metrics should update their dashboards and alerting rules accordingly.

Looking Forward

This release sets the foundation for even more ambitious features in our H2 roadmap. The new KV cache management capabilities and multi-node support open doors for larger-scale Dynamo deployments, while our enhanced observability features ensure you can confidently run Dynamo in production.

Release Assets

Python Wheels:

Rust Crates:

Containers:

Helm Charts:


Contributors

We welcome new contributors in this release:
@qimcis, @yinggeh, @da-x, @elyasmnvidian, @ryan-lempka, @JesseStutler, @nate-martinez, @suzusuzu

Full Changelog: v0.4.0...v0.4.1

Dynamo Release v0.4.0

12 Aug 06:29
73bcc3b
Compare
Choose a tag to compare

Dynamo 0.4.0 Release Notes

Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models—across any framework, architecture, or deployment scale. It's an open-source project under the Apache 2.0 license. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.

As a vendor-neutral serving framework, Dynamo supports multiple large language model (LLM) inference engines to varying degrees:

  • NVIDIA TensorRT-LLM
  • vLLM
  • SGLang

Major Features and Improvements

Increasing Framework Support

  • vLLM Updates

    • Added E2E integration tests (#1935) and multimodal example with Llama4 Maverick (#1990)
    • Prefill-aware routing for improved performance (#1895)
    • Configurable namespace support for vLLM examples (#1909)
    • Routing via ApproxKvIndexer with use_kv_events flag (#1869)
    • Updated all vLLM examples to new UX (#1756)
  • SGLang Updates

    • Receive KV metrics from scheduler (#1789)
    • Disaggregated deployment examples (#2137)
    • Launch and deploy examples added (#2068)
  • TRT-LLM Updates

    • New/speculative decoding example: Llama-4 + Eagle-3 (#1828)
  • Routing Performance

    • Removed router hot-path lock for faster request handling (#1963)
    • Added radix tree dumps as router events (#2057)

UX Updates

  • Migration to New Python UX

    • Updated all Python launch flows to the new UX structure (#2003), including refactoring vLLM backend integration (#1983).
    • Removed outdated examples that relied on the old UX (#1899).
  • CLI and Packaging Enhancements

    • Added Python bindings for Dynamo CLI tools (#1799).
    • Updated Python packaging to align with the new UX (#2054).
    • Introduced a Python frontend/ingress node for easier deployment integration (#1912).
    • Added a convenience script to uninstall Dynamo Deploy CRDs (#1933).
  • Kubernetes Deployment UX

    • Enhanced Helm chart flexibility:
      • Added ability to override any podSpec property (#2116).
      • Enabled Helm upgrade via deploy script for smoother iteration (#1936).
      • Added Grove scheduling support to the graph Helm chart (#1954).
    • Introduced Kubernetes deployment examples for vLLM, SGLang, and TRT-LLM (#2062, #2133).
    • New Hello World Kubernetes deployment example (#1854).
  • Examples & Docs Overhaul

    • Hello World Python binding example (#2083).
    • Documentation updated for UX (#2070), reorganized example READMEs (#2174), and refactored core README structure (#2141).

Deployment, Kubernetes, and CLI

  • Helm and Graph Deployments

    • Liveness/readiness probes in graph Helm chart (#1888)
    • Added ability to override any podSpec property (#2116)
    • Support for Grove scheduling in Helm (#1954)
  • Planner and Profiling

    • Deploy SLA profiler and SLA planner to Kubernetes (#2030, #2135)

Performance and Observability

  • Structured Logging Improvements

    • Enhanced structured JSONL logs with span start/close events, trace ID/span ID injection, duration formatting in microseconds, and improved context capture for distributed tracing workflows (PR #2061).
  • Tokenizer & Runtime

    • De-tokenize performance improved by ~50% (#1868)
    • Runtime now uses all available parallelism (#1858)
  • Metrics

    • Hierarchical Prometheus metrics registry (#2008)
    • Generic ingress handler metrics (#2090)

Bug Fixes

  • Fixed GPU resource specifications in LLM deployments (#1812)
  • Corrected vLLM, SGLang, and TRTLLM deployment issues, including container builds, runtime packaging, and helm chart updates (#1942, #2062, #1825)
  • Addressed port conflicts, deterministic port assignments, and health check improvements (#1937, #1996)
  • Improved error handling for empty message lists and invalid configurations (#2067, #2071)
  • Fixed nil pointer dereference issues in the Dynamo controller (#2299, #2335)
  • Locked dependencies to avoid breaking changes (e.g., Triton 3.4.0 w/ TRT-LLM 1.0.0) (#2233)

Documentation

  • Guides and Examples

    • New hello world Python binding example (#2083)
    • Added multinode, disaggregated, and Grove deployment guides (#2155, #2086)
    • Added AKS/EKS deployment guides (#2080)
  • Docs Restructuring

    • Updated for new Python UX (#2070)
    • Refactored README and reorganized examples (#2141, #2174)

Build, CI, and Test

  • Added support for sGLang runtime image builds (#1770)
  • Optional TRTLLM dependency and custom build support (#2113)
  • New end-to-end router tests with mockers (#2073)
  • Fixed vLLM builds for Blackwell GPUs (#2020)

Release Assets

Python Wheels:

Rust Crates:

Containers:

Helm Charts:


Open Issues

  • x86 TRT-LLM container image not compatible out of the box with B200. Dev container still works for B200/GB200

Contributors

We welcome new contributors in this release:
@umang-kedia-hpe, @Ethan-ES, @messiaen, @galletas1712, @mc-nv, @zaristei, @jhaotingc, @saurabh-nvidia.

For the full list of changes, see the changelog.

Dynamo Release v0.3.2

18 Jul 05:21
50f3636
Compare
Choose a tag to compare

Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models—across any framework, architecture, or deployment scale. It's an open source project under the Apache 2.0 license. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.

As a vendor-neutral serving framework, Dynamo supports multiple large language model (LLM) inference engines to varying degrees:

  • NVIDIA TensorRT-LLM
  • vLLM
  • SGLang

Major Features and Improvements

Engine Support and Routing

  • Example standalone router for use outside of dynamo (#1409).
  • The new SLA-based planner dynamically manages resource allocation based on service-level objectives (#1420).
  • Data-parallel vLLM worker setups are now supported (#1513).
  • SGLang support was extended for DeepEP deployments (#1120).
  • Clean shutdown is now available for vllm_v1 and SGLang engines (#1562, #1764).
  • Experimental support for WideEP with EPLB aggregation and disaggregation is now available for TRTLLM (#1652, #1690).
  • Approximate KV cache residency and predicted active KV blocks for improved routing efficiency (#1636, #1638, #1731).

Observability and Metrics

  • Native DCGM and Prometheus integration enables hardware metrics collection and export. Optional Grafana dashboards are provided (#1488, #1701, #1788).
  • New Grafana dashboards offer composite software and hardware system visibility (#1788).
  • Batch /completions endpoint and speculative decoding metrics are now supported for vLLM (#1626, #1549).

Deployment, Kubernetes, and CLI

  • The Kubernetes operator now supports custom entrypoints, command overrides, and simplified graph deployments (#1396, #1708, #1877, #1893).
  • Example manifests for multimodal and minimal deployments were added (#1836, #1872).
  • Graph Helm chart logic, resource requests, and health probes were improved (#1877, #1888).
  • Two new Helm charts are introduced in this release: dynamo-platform, and dynamo-crds, enabling modular and robust Kubernetes deployments for a variety of topologies and operational requirements.
  • The dynamo-run command line interface now supports the --version flag and improved error handling and validation (#1596, #1674, #1623).
  • Docker and Kubernetes deployment workflows were streamlined. Helm charts and container images were improved (#1742, #1796, #1840, #1841).

Developer Experience

  • Embedding request handling was improved with frontend tokenization (#1494).
  • OpenAI API request validation is now available (#1674).
  • Batch embedding and parallel tokenization improve efficiency for batch inference and embedding (#1657).
  • The /responses endpoint and additional API features were added (#1694).

Bug Fixes

  • Issues related to GPU resource specifications in deployments, container builds, and runtime were fixed (#1826, #1792, #1546).
  • Helm chart logic, resource requests, and health probes were corrected (#1877, #1893).
  • Error handling and model loading were improved for multimodal and distributed deployments (#1545).
  • Metrics publishing and logging were fixed for vLLM, SGLang, and OpenAI endpoints (#1864, #1649, #1639).
  • Process cleanup issues were resolved in tests (#1801).

Documentation

  • Documentation updates include new guides for Ray setup, architecture diagrams, and deployment modes (#1947, #1697).
  • Benchmarking, troubleshooting, and advanced usage scenario documentation was enhanced.
  • Notes were added to deprecate outdated connectors (#1964, #1959).

Build, CI, and Test

  • Dependency upgrades include protobuf, nats, and etcd (#1876, #1744).
  • CI coverage now includes GPU-based and multi-engine tests.
  • Container builds now use distroless images for improved security and efficiency (#1570, #1569).
  • Fault tolerance tests #1444

Known Issues

  • KVBM is supported only with Python 3.12.

Release Assets

Python Wheels:

Rust Crates:

Containers:

Helm Charts:

Contributors

Thank you to all contributors for this release. For a full list, refer to the changelog.

Dynamo Release v0.3.1

01 Jul 17:59
e117295
Compare
Choose a tag to compare

Dynamo is an open source project under the Apache 2.0 license. The primary distribution is done through pip wheels with minimal binary size. The ai-dynamo GitHub organization hosts two repositories: Dynamo and NIXL. Dynamo is designed as the next-generation inference server, building upon the foundation of NVIDIA® Triton Inference Server™. While Triton focuses on single-node inference deployments, we're integrating its robust capabilities into Dynamo over the next several months. We'll maintain support for Triton while providing a clear migration path for existing users once Dynamo achieves feature parity.

As a vendor-neutral serving framework, Dynamo supports multiple large language model (LLM) inference engines to varying degrees:

  • NVIDIA TensorRT-LLM
  • vLLM
  • SGLang

Dynamo v0.3.1 features:

  • Functional DeepSeek R1 disaggregated serving with wide EP using SGLang
  • Functional EPD disaggregation with video model (Llava video 7B)
  • Proof of concept inference gateway support
  • Prebuilt Dynamo + vLLM container
    • We plan to release these pre-built containers in the coming days
  • Amazon Linux support

Future plans
Dynamo Roadmap

Known Issues

  • KVBM is supported only with python 3.12

What's Changed

🚀 Features & Improvements

🐛 Bug Fixes

Read more

Dynamo Release v0.3.0

05 Jun 20:51
15ca948
Compare
Choose a tag to compare

Dynamo is an open source project under the Apache 2.0 license. The primary distribution is done through pip wheels with minimal binary size. The ai-dynamo GitHub organization hosts two repositories: Dynamo and NIXL. Dynamo is designed as the next-generation inference server, building upon the foundation of NVIDIA® Triton Inference Server™. While Triton focuses on single-node inference deployments, we're integrating its robust capabilities into Dynamo over the next several months. We'll maintain support for Triton while providing a clear migration path for existing users once Dynamo achieves feature parity.

As a vendor-neutral serving framework, Dynamo supports multiple large language model (LLM) inference engines to varying degrees:

  • NVIDIA TensorRT-LLM
  • vLLM
  • SGLang

Dynamo v0.3.0 features:

  • Dynamo run with KV routing and multiple model support! guide
  • Vllm v1 engine support! example
  • Sglang with DP attention! example
  • SLA based planner! guide
  • Optimized embedding transfer for multi-modal! example
  • Dynamo deploy update command! guide
  • Model caching using Fluid! guide
  • Fluxcd guide to managing custom resources guide

Future plans
Dynamo Roadmap

Known Issues

  • KVBM is supported only with python 3.12

What's Changed

🚀 Features & Improvements

🐛 Bug Fixes

Read more

Dynamo Release v0.2.1

22 May 23:45
b950ec5
Compare
Choose a tag to compare

Dynamo is an open source project with Apache 2 license. The primary distribution is done via pip wheels with minimal binary size. The ai-dynamo github org hosts 2 repos: dynamo and NIXL. Dynamo is designed as the ideal next generation inference server, building upon the foundations of the Triton Inference Server. While Triton focuses on single-node inference deployments, we are committed to integrating its robust single-node capabilities into Dynamo within the next several months. We will maintain ongoing support for Triton while ensuring a seamless migration path for existing users to Dynamo once feature parity is achieved. As a vendor-agnostic serving framework, Dynamo supports multiple LLM inference engines including TRT-LLM, vLLM, and SGLang, with varying degrees of maturity and support.

Dynamo v0.2.1 features:

  • KV Block Manager! intro
  • Improved vLLM Performance by avoiding re-initializing sampling params
  • SGLang support! README.md
  • Multi-Modal E/P/D Disaggregation! README.md
  • Leader Worker Set K8s!
  • Qwen3, Gemma3 and Llama4 in Dynamo Run!

Future plans

Dynamo Roadmap

Known Issues

  • Benchmark guides are still being validated on public cloud instances (GCP / AWS)

What's Changed

🚀 Features & Improvements

🐛 Bug Fixes

  • fix: Extract tokenizer from GGUF for Qwen3 and Gemma3 arch by @grahamking in #1011

Other Changes

Read more

Dynamo Release v0.2.0

01 May 00:33
ca728f6
Compare
Choose a tag to compare

Dynamo is an open source project with Apache 2 license. The primary distribution is done via pip wheels with minimal binary size. The ai-dynamo github org hosts 2 repos: dynamo and NIXL. Dynamo is designed as the ideal next generation inference server, building upon the foundations of the Triton Inference Server. While Triton focuses on single-node inference deployments, we are committed to integrating its robust single-node capabilities into Dynamo within the next several months. We will maintain ongoing support for Triton while ensuring a seamless migration path for existing users to Dynamo once feature parity is achieved. As a vendor-agnostic serving framework, Dynamo supports multiple LLM inference engines including TRT-LLM, vLLM, and SGLang, with varying degrees of maturity and support.

Dynamo v0.2.0 features:

  • GB200 support with ARM builds (Note: currently requires a container build)
  • Planner - new experimental support for spinning workers up and down based on load
  • Improved K8s deployment workflow
    • Installation wizard to enable easy configuration of Dynamo on your Kubernetes cluster
    • CLI to manage your operator-based deployments
    • Consolidate Custom Resources for Dynamo Deployments
    • Documentation improvements (including Minikube guide to installing Dynamo Platform)

Future plans

Dynamo Roadmap

Known Issues

  • Benchmark guides are still being validated on public cloud instances (GCP / AWS)
  • Benchmarks on internal clusters show a 15% degradation from results displayed in summary graphs for multi-node 70B and are being investigated.
  • TensorRT-LLM examples are not working currently in this release - but are being fixed in main.

What's Changed

Read more

Dynamo Release v0.1.1

16 Apr 20:44
926370b
Compare
Choose a tag to compare

Dynamo is an open source project with Apache 2 license. The primary distribution is done via pip wheels with minimal binary size. The ai-dynamo github org hosts 2 repos: dynamo and NIXL. Dynamo is designed as the ideal next generation inference server, building upon the foundations of the Triton Inference Server. While Triton focuses on single-node inference deployments, we are committed to integrating its robust single-node capabilities into Dynamo within the next several months. We will maintain ongoing support for Triton while ensuring a seamless migration path for existing users to Dynamo once feature parity is achieved. As a vendor-agnostic serving framework, Dynamo supports multiple LLM inference engines including TRT-LLM, vLLM, and SGLang, with varying degrees of maturity and support.

Dynamo v0.1.1 features:

  • Benchmarking guides for Single and Multi-Node Disaggregation on H100 (vLLM)
  • TensorRT-LLM support for KV Aware Routing
  • TensorRT-LLM support for Disaggregation
  • ManyLinux and Ubuntu 22.04 Support for wheels and crates
  • Unified logging for Python and Rust

Future plans

  • Instructions for reproducing benchmark guides on GCP and AWS
  • KV Cache Manager as a standalone repository under the ai-dynamo organization. This release will provide functionality for storing and evicting KV cache across multiple memory tiers, including GPU, system memory, local SSD, and object storage.
  • Searchable user guides and documentation
  • Multi-node instances for large models
  • Initial Planner version supporting dynamic scaling of P / D workers. We will include an early version of the Dynamo Planner, another core component. This initial release will feature heuristic-based dynamic allocation of GPU workers between prefill and decode tasks, as well as model and fleet configuration adjustments based on user traffic patterns. Our vision is to evolve the Planner into a reinforcement learning platform, which will allow users to define objectives and then tune and optimize performance policies automatically based on system feedback.
  • vLLM 1.0 support with NIXL and KV Cache Events

Known Issues

  • Benchmark guides are still being validated on public cloud instances (GCP / AWS)
  • Benchmarks on internal clusters show a 15% degradation from results displayed in summary graphs for multi-node 70B and are being investigated.

What's Changed

Read more