Skip to content

Comments

example demonstrating how to train CosmosReason2 Eagle3#920

Open
skierat wants to merge 3 commits intoNVIDIA:mainfrom
skierat:skierat/CR2_BYOC
Open

example demonstrating how to train CosmosReason2 Eagle3#920
skierat wants to merge 3 commits intoNVIDIA:mainfrom
skierat:skierat/CR2_BYOC

Conversation

@skierat
Copy link

@skierat skierat commented Feb 23, 2026

What does this PR do?

Type of change: New example

Overview:
Adds examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb, a step-by-step Jupyter notebook that walks through the full EAGLE3 draft-head training workflow for nvidia/Cosmos-Reason2-8B. The notebook covers:

  1. Installing dependencies
  2. Authenticating with Hugging Face
  3. Preparing training data from the Nemotron-Post-Training-Dataset-v2 (chat split) using a curated row-selection mapping (guides/nemotron_mapping.csv)
  4. Inspecting the bundled EAGLE3 config (guides/CR2_eagle_config.json) tuned for Cosmos-Reason2 (YaRN RoPE, FlexAttention, reduced draft vocabulary)
  5. (Optional) Calibrating the draft vocabulary to 32k tokens for faster training and inference
  6. Launching training via launch_train.sh with FSDP2 multi-GPU support
  7. Exporting the checkpoint to HF format and serving with vLLM

Also includes guides/nemotron_mapping.csv and guides/CR2_eagle_config.json as companion files.

Usage

Open and run examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb cell by cell. After training, serve the exported checkpoint with:

vllm serve nvidia/Cosmos-Reason2-8B \    --host 0.0.0.0 \    --port 8000 \    --speculative-model export/cosmos-reason2-8b-eagle3 \    --num-speculative-tokens 3 \    --dtype bfloat16

Testing

Tested end-to-end on a 4xB100 GPUs. The exported checkpoint was validated with specdec_bench

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: No
  • Did you add or update any necessary documentation?: Yes (the notebook is self-documenting)
  • Did you update Changelog?: No

Additional Information

Cosmos-Reason2-8B requires at least one 80 GB GPU (H100/A100)

Summary by CodeRabbit

Release Notes

  • New Features
    • Added comprehensive training guide and configuration for speculative decoding on Cosmos-Reason2-8B models
    • Introduced dataset preparation utility to convert Nemotron chat conversations into standardized format
    • Included example configuration and training instructions for EAGLE3 draft model acceleration

@skierat skierat requested a review from a team as a code owner February 23, 2026 15:16
@skierat skierat requested a review from h-guo18 February 23, 2026 15:16
@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

📝 Walkthrough

Walkthrough

Introduces a complete training workflow for EAGLE3 speculative decoding, including a configuration file, Jupyter notebook documenting the training process, and a data preparation utility for ingesting Nemotron chat conversations into a standardized format.

Changes

Cohort / File(s) Summary
Configuration & Training Guide
examples/speculative_decoding/guides/CR2_eagle_config.json, examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb
New Eagle model configuration with hyperparameters (vocab size, RoPE scaling, attention implementation) and comprehensive Jupyter notebook documenting end-to-end training workflow including dependencies, data preparation, model configuration inspection, calibration, training execution via accelerate/FSDP, and checkpoint export.
Data Preparation Utility
examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py
New script to ingest Nemotron-Post-Training-Dataset-v2 chat data with dual access modes (mapping-based and streaming), conversation parsing with role/content field handling, unique conversation ID generation, and CLI with configurable output options.

Sequence Diagram

sequenceDiagram
    actor User
    participant Notebook as Training Notebook
    participant DataPrep as add_nemotron_chat.py
    participant HF as Hugging Face Hub
    participant Training as launch_train.sh
    participant Model as Eagle Model

    User->>Notebook: Execute training workflow
    Notebook->>DataPrep: Prepare input conversations
    DataPrep->>HF: Resolve Nemotron dataset URLs
    HF-->>DataPrep: Return parquet shard URLs
    DataPrep->>DataPrep: Parse conversations (mapping/streaming mode)
    DataPrep-->>Notebook: Output standardized conversations
    Notebook->>Notebook: Inspect CR2_eagle_config.json
    Notebook->>Notebook: Optional: calibrate draft vocabulary
    Notebook->>Training: Launch training with accelerate/FSDP
    Training->>Model: Train Eagle head on Cosmos-Reason2-8B
    Model-->>Training: Trained checkpoint
    Training-->>Notebook: Export to HF-compatible format
    Notebook-->>User: Training complete, ready for deployment
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding an example demonstrating EAGLE3 training on the Cosmos-Reason2 model, which aligns with all substantive changes in the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb (2)

90-96: os.chdir("..") is fragile and depends on the Jupyter kernel's working directory.

This assumes the notebook is opened/run with cwd = guides/. If a user opens Jupyter from the speculative_decoding/ root (the common default), os.chdir("..") navigates one level above the intended directory, breaking every subsequent relative path (prepare_input_conversations/add_nemotron_chat.py, guides/CR2_eagle_config.json, etc.).

Consider anchoring relative to the notebook file itself:

♻️ Proposed fix
-import os
-
-# Run from the speculative_decoding root so that relative paths resolve correctly.
-os.chdir("..")
-print("Working directory:", os.getcwd())
+import os
+from pathlib import Path
+
+# Navigate to the speculative_decoding root regardless of where Jupyter was launched.
+os.chdir(Path("__file__").parent.parent if "__file__" in dir() else Path.cwd().parent)
+print("Working directory:", os.getcwd())

Or more robustly, use an absolute path derived from globals()["_dh"][0] (the notebook's directory as set by Jupyter):

import os
from pathlib import Path

notebook_dir = Path(globals().get("_dh", [Path.cwd()])[0])
os.chdir(notebook_dir.parent)
print("Working directory:", os.getcwd())
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`
around lines 90 - 96, The current os.chdir("..") is fragile; replace it by
deriving the notebook directory from Jupyter’s runtime state and changing to its
parent: obtain notebook_dir = Path(globals().get("_dh", [Path.cwd()])[0]) (or
equivalent using globals()["_dh"][0] with a safe fallback), then call
os.chdir(notebook_dir.parent) and keep the existing os.getcwd() print; update
imports to include pathlib.Path if not present and remove the hardcoded
os.chdir("..") call (refer to os.chdir, os.getcwd, globals(), and _dh in the
notebook).

162-166: EAGLE_CONFIG is defined twice with the same value.

Lines 163–165 (Step 4) and line 236 (Step 6 setup cell) both assign EAGLE_CONFIG = "guides/CR2_eagle_config.json". The second definition is harmless but creates an implicit assumption that users must run Step 4 first, and it makes the notebook slightly harder to follow. The Step 4 definition is sufficient; the Step 6 cell comment makes it clear the value is passed to launch_train.sh.

Also applies to: 235-237

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`
around lines 162 - 166, Remove the duplicate assignment of EAGLE_CONFIG by
keeping the first definition (EAGLE_CONFIG = "guides/CR2_eagle_config.json") and
deleting the later, redundant re-assignment in the subsequent cell; ensure any
comments in that later cell reference the existing variable rather than
redefining it so users can run cells independently without relying on a prior
re-definition of EAGLE_CONFIG.
examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py (1)

166-166: async def main contains no await — drop the async keyword.

Every I/O operation inside (open(), load_dataset(), tqdm) is synchronous. The asyncio.run() trampoline adds overhead and implies async behavior that doesn't exist.

♻️ Proposed fix
-async def main(args: argparse.Namespace) -> None:
+def main(args: argparse.Namespace) -> None:
-    asyncio.run(main(args))
+    main(args)

And remove the import asyncio since it would no longer be needed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`
at line 166, The function main is declared async but contains no await and
should be a regular synchronous function: change "async def main(...)" to "def
main(...)", remove the unused "import asyncio" and replace any invocation of
"asyncio.run(main(...))" with a direct call to "main(...)" (references: main,
asyncio.run, import asyncio, and the synchronous calls open(), load_dataset(),
tqdm).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`:
- Around line 315-327: The notebook declares nbformat_minor: 5 so each cell must
include a unique "id" field; add a short unique string "id" to the missing cells
named in the review (Step 1 markdown, Step 1 install code, Step 2 markdown, Step
2 login code, and Step 7 export code block) so their JSON objects include an
"id": "<unique_id>" property; ensure each id is unique within the notebook
(e.g., "step1-md-1", "step1-install-1", "step2-md-1", "step2-login-1",
"step7-export-1") to satisfy nbformat 4.5 validation and avoid papermill/nbmake
failures.

In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`:
- Around line 87-98: The mapping-file reader currently calls int(line.strip())
for each CSV row which will raise ValueError on headers or non-numeric lines;
update the logic that processes the --mapping-file input (the
parser.add_argument('--mapping-file', ...) and the code block that converts file
lines to ints) to robustly skip blank lines and non-numeric rows — either by
wrapping int(...) in a try/except ValueError and ignoring/logging invalid lines
or by filtering with a numeric check (e.g., str.strip().lstrip('-').isdigit()),
and also update the --mapping-file help text to explicitly state the expected
format (one 0-based integer per line, no header) so callers know the
requirement.

---

Nitpick comments:
In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`:
- Around line 90-96: The current os.chdir("..") is fragile; replace it by
deriving the notebook directory from Jupyter’s runtime state and changing to its
parent: obtain notebook_dir = Path(globals().get("_dh", [Path.cwd()])[0]) (or
equivalent using globals()["_dh"][0] with a safe fallback), then call
os.chdir(notebook_dir.parent) and keep the existing os.getcwd() print; update
imports to include pathlib.Path if not present and remove the hardcoded
os.chdir("..") call (refer to os.chdir, os.getcwd, globals(), and _dh in the
notebook).
- Around line 162-166: Remove the duplicate assignment of EAGLE_CONFIG by
keeping the first definition (EAGLE_CONFIG = "guides/CR2_eagle_config.json") and
deleting the later, redundant re-assignment in the subsequent cell; ensure any
comments in that later cell reference the existing variable rather than
redefining it so users can run cells independently without relying on a prior
re-definition of EAGLE_CONFIG.

In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`:
- Line 166: The function main is declared async but contains no await and should
be a regular synchronous function: change "async def main(...)" to "def
main(...)", remove the unused "import asyncio" and replace any invocation of
"asyncio.run(main(...))" with a direct call to "main(...)" (references: main,
asyncio.run, import asyncio, and the synchronous calls open(), load_dataset(),
tqdm).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02fa362 and 12fda6a.

⛔ Files ignored due to path filters (1)
  • examples/speculative_decoding/guides/nemotron_mapping.csv is excluded by !**/*.csv
📒 Files selected for processing (3)
  • examples/speculative_decoding/guides/CR2_eagle_config.json
  • examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb
  • examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py

Comment on lines +315 to +327
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.10.0"
}
},
"nbformat": 4,
"nbformat_minor": 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Several cells are missing the id field required by nbformat 4.5.

nbformat_minor: 5 declares the notebook as nbformat 4.5, which mandates that every cell carry a unique id string. Cells at the Step 1 markdown, Step 1 install code, Step 2 markdown, Step 2 login code, and the Step 7 export code block are all missing id fields. nbformat.validate() will fail with a ValidationError for each missing cell ID, which will break papermill, nbmake, and similar CI-validation tools.

Add a unique id to each affected cell, e.g.:

{
  "cell_type": "markdown",
  "id": "a1b2c3d4",
  ...
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/guides/train_eagle_head_cosmos_reason2.ipynb`
around lines 315 - 327, The notebook declares nbformat_minor: 5 so each cell
must include a unique "id" field; add a short unique string "id" to the missing
cells named in the review (Step 1 markdown, Step 1 install code, Step 2
markdown, Step 2 login code, and Step 7 export code block) so their JSON objects
include an "id": "<unique_id>" property; ensure each id is unique within the
notebook (e.g., "step1-md-1", "step1-install-1", "step2-md-1", "step2-login-1",
"step7-export-1") to satisfy nbformat 4.5 validation and avoid papermill/nbmake
failures.

Comment on lines +87 to +98
parser.add_argument(
"--mapping-file",
type=Path,
default=None,
help=(
"Path to a CSV file containing one 0-based dataset row index per line, "
"e.g.:\n 429801\n 271023\n 432477\n"
"Rows are loaded in the order they appear in the file. "
"When provided, the dataset is downloaded (not streamed) to allow random access "
"and --max-samples is ignored."
),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Mapping file parsed as bare integers — will crash with ValueError if the CSV has a header row.

The --mapping-file argument is documented as "a CSV file" and references guides/nemotron_mapping.csv, a .csv-extension file. The parsing at lines 184–186 does int(line.strip()) unconditionally; any non-numeric line (e.g., a column header like "index") raises ValueError: invalid literal for int().

🛡️ Proposed fix — skip non-numeric lines and document the expected format precisely
         with args.mapping_file.open("r", encoding="utf-8") as f:
             ordered_source_indices: list[int] = [
-                int(line.strip()) for line in f if line.strip()
+                int(line.strip()) for line in f if line.strip().isdigit()
             ]

Alternatively, update the help text to explicitly state that the file must contain no header — one 0-based integer per line — and rename it with a .txt extension to eliminate ambiguity.

Also applies to: 183-186

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@examples/speculative_decoding/prepare_input_conversations/add_nemotron_chat.py`
around lines 87 - 98, The mapping-file reader currently calls int(line.strip())
for each CSV row which will raise ValueError on headers or non-numeric lines;
update the logic that processes the --mapping-file input (the
parser.add_argument('--mapping-file', ...) and the code block that converts file
lines to ints) to robustly skip blank lines and non-numeric rows — either by
wrapping int(...) in a try/except ValueError and ignoring/logging invalid lines
or by filtering with a numeric check (e.g., str.strip().lstrip('-').isdigit()),
and also update the --mapping-file help text to explicitly state the expected
format (one 0-based integer per line, no header) so callers know the
requirement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant