Add stepwise intermediate reward for RL #526

Shaobo-Zhou · 2025-11-26T16:11:16Z

Description

This PR introduces a shaped, step-wise reward signal for the RL-based compiler.

For the figures of merit expected_fidelity and estimated_success_probability, the reward is computed in two regimes:

Exact regime (native + mapped circuits)
If the circuit consists only of device-native gates and respects the device’s coupling map, the step reward is based on the change in the exact calibration-aware metric between successive steps.
Approximate regime (non-native / unmapped circuits)
If the circuit still contains non-native gates or violates the device topology, a conservative canonical cost model is used to approximate the expected fidelity and ESP. The intermediate reward is then derived from the change in this approximate metric.

Checklist:

The pull request only contains commits that are focused and relevant to this change.
I have added appropriate tests that cover the new/changed functionality.
I have updated the documentation to reflect these changes.
I have added entries to the changelog for any noteworthy additions, changes, fixes, or removals.
I have added migration instructions to the upgrade guide (if needed).
The changes follow the project's style guidelines and introduce no new warnings.
The changes are fully tested and pass the CI checks.
I have reviewed my own code changes.

Update action space and feature space Update actions Update action space

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Remove example_test.py Remove example_test.py

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations

Signed-off-by: Shaobo-Zhou <[email protected]>

Fix bugs Fix bugs Fix bugs

…e gate check

Fix windows runtime warning problem Fix windows runtime warning issue

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

src/mqt/predictor/rl/predictorenv.py (2)
269-269: Remove unused noqa directive.

Static analysis indicates SLF001 is not enabled, making this directive unnecessary.
-        self.state._layout = self.layout  # noqa: SLF001
+        self.state._layout = self.layout
256-264: Clarify symmetric reward scaling intent.

Both positive and negative delta branches multiply by reward_scale, while no_effect_penalty is only used for zero delta. The naming suggests asymmetric behavior but implementation is symmetric. Consider renaming for clarity:
-        self.reward_scale = 1
-        self.no_effect_penalty = 0.001
+        self.delta_scale = 1
+        self.no_progress_penalty = 0.001
And update usage accordingly. This makes the symmetric scaling intent explicit.
src/mqt/predictor/rl/cost_model.py (1)
300-301: Remove redundant initialization of t_hat.

Line 300 initializes t_hat = 0.0 but it's immediately overwritten on line 301. This is dead code.
-    t_hat = 0.0
     t_hat = (n_1q * tau1_avg + n_2q * tau2_avg) / k_eff

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c42b41 and 8259b4d.

📒 Files selected for processing (5)

src/mqt/predictor/reward.py (1 hunks)
src/mqt/predictor/rl/cost_model.py (1 hunks)
src/mqt/predictor/rl/predictorenv.py (10 hunks)
tests/compilation/test_predictor_rl.py (5 hunks)
tests/hellinger_distance/test_estimated_hellinger_distance.py (1 hunks)

🧰 Additional context used

🧠 Learnings (4)

📚 Learning: 2025-11-01T15:57:31.153Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1283
File: src/qir/runtime/QIR.cpp:196-201
Timestamp: 2025-11-01T15:57:31.153Z
Learning: In the QIR runtime (src/qir/runtime/QIR.cpp), the PRX gate (__quantum__qis__prx__body) is an alias for the R gate (Phased X-Rotation) and should call runtime.apply<qc::R>(theta, phi, qubit), not runtime.apply<qc::RX>() which is a single-parameter rotation gate.

Applied to files:

src/mqt/predictor/rl/cost_model.py

📚 Learning: 2025-10-09T13:20:11.483Z

Learnt from: DRovara
Repo: munich-quantum-toolkit/core PR: 1108
File: mlir/test/Dialect/MQTOpt/Transforms/lift-measurements.mlir:269-288
Timestamp: 2025-10-09T13:20:11.483Z
Learning: In the MQT MLIR dialect, the `rz` gate should not be included in the `DIAGONAL_GATES` set for the `ReplaceBasisStateControlsWithIfPattern` because its operator matrix does not have the required shape | 1 0 | / | 0 x | for the targets-as-controls optimization. It is only included in `LiftMeasurementsAboveGatesPatterns` where the matrix structure requirement differs.

Applied to files:

src/mqt/predictor/rl/cost_model.py

📚 Learning: 2025-11-27T21:26:39.677Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

📚 Learning: 2025-11-04T14:26:25.420Z

Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

🧬 Code graph analysis (4)

tests/hellinger_distance/test_estimated_hellinger_distance.py (2)

src/mqt/predictor/ml/predictor.py (2)

compile_training_circuits (206-240)

generate_training_data (242-300)

tests/device_selection/test_predictor_ml.py (2)

path_uncompiled_circuits (27-29)

path_compiled_circuits (33-35)

tests/compilation/test_predictor_rl.py (2)

src/mqt/predictor/rl/cost_model.py (2)

canonical_cost (171-183)

get_cost_table (150-168)

src/mqt/predictor/rl/predictorenv.py (1)

calculate_reward (272-356)

src/mqt/predictor/rl/predictorenv.py (5)

src/mqt/predictor/rl/actions.py (4)

CompilationOrigin (92-98)

DeviceDependentAction (136-146)

PassType (101-110)

Action (114-127)

src/mqt/predictor/rl/cost_model.py (2)

approx_estimated_success_probability (250-308)

approx_expected_fidelity (221-247)

src/mqt/predictor/rl/helper.py (1)

create_feature_dict (72-94)

src/mqt/predictor/rl/parsing.py (1)

postprocess_vf2postlayout (229-250)

src/mqt/predictor/utils.py (1)

calc_supermarq_features (93-145)

src/mqt/predictor/reward.py (2)

tests/hellinger_distance/test_estimated_hellinger_distance.py (1)

device (48-50)

tests/compilation/test_reward.py (1)

device (38-40)

🪛 Ruff (0.14.7)

src/mqt/predictor/rl/predictorenv.py

210-210: Logging statement uses f-string

(G004)

241-241: Logging statement uses f-string

(G004)

269-269: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

🔇 Additional comments (13)

tests/hellinger_distance/test_estimated_hellinger_distance.py (1)

219-226: LGTM! API calls correctly updated to use keyword arguments.

The compile_training_circuits and generate_training_data calls now properly use keyword arguments, aligning with the updated function signatures in src/mqt/predictor/ml/predictor.py. The removal of num_workers is consistent with the new NO_PARALLEL constant handling parallelism internally.

src/mqt/predictor/rl/cost_model.py (4)

41-147: Well-structured canonical cost tables with clear documentation.

The per-device cost tables are well-organized with clear comments explaining the decomposition rationale. The registry pattern with DEVICE_CANONICAL_COSTS provides clean extensibility.

150-168: Safe fallback behavior with appropriate warnings.

The unknown-device fallback to Torino with both a warning and log message ensures visibility while maintaining functionality. The stacklevel=3 is appropriate for the call chain.

186-218: Robust gate cost estimation with conservative fallbacks.

The iteration pattern correctly uses the modern CircuitInstruction access. The conservative fallback for unknown gates (3×1q or 1×2q+4×1q) is reasonable for approximation purposes.

221-247: Clean implementation of approximate fidelity and ESP calculations.

Both functions properly clamp outputs to [0, 1] and handle edge cases. The ESP model's idle-time penalty with liveness modulation is a reasonable approximation approach.

Also applies to: 250-308

src/mqt/predictor/reward.py (1)

206-212: Proper handling of discrete time unit with safe fallback.

The device.dt or 1.0 fallback correctly handles backends without discrete-time calibration, preventing TypeError when device.dt is None. The dt-scaled duration is appropriately used in the exponential decay calculation.

tests/compilation/test_predictor_rl.py (3)

156-167: Good coverage for unknown-device fallback and unknown-gate defaults.

The test properly verifies that unknown devices trigger a UserWarning and fall back to TORINO_CANONICAL_COSTS, and that unknown gates return (0, 0). This aligns with the behavior defined in get_cost_table and canonical_cost.

170-219: Comprehensive test coverage for dual-path reward calculation.

The test covers all three modes (exact, approx, auto) for ESP and verifies that critical_depth always returns exact results. The addition of the auto-exact path test (lines 197-200) addresses the previous review suggestion. The monkeypatch approach for forcing the approx path is appropriate.

108-115: Good addition of mapping validation.

The CheckMap assertion ensures the compiled circuit respects the device's coupling map, complementing the existing native-gates check.

src/mqt/predictor/rl/predictorenv.py (4)

272-356: Well-designed dual-path reward calculation with proper mode handling.

The calculate_reward method cleanly separates exact (calibration-based) and approximate (cost-model-based) paths. The auto mode correctly uses _is_native_and_mapped to determine the appropriate path. The device_id parameter is now correctly passed to both approx_expected_fidelity and approx_estimated_success_probability.

608-730: Thorough device calibration caching with proper error handling.

The _ensure_device_averages_cached method properly aggregates 1q/2q error rates, durations, and coherence times. Key strengths:

Handles dt scaling for durations correctly

Tries flipped orientation for uni-directional couplings (lines 690-691)

Uses specific exceptions (KeyError, AttributeError) per past review feedback

Provides informative error messages when calibration data is missing

732-741: Clean helper for native/mapped circuit detection.

The _is_native_and_mapped method properly combines both checks using GatesInBasis and CheckMap passes, returning a boolean suitable for auto-mode reward selection.

190-270: Solid step-wise reward shaping implementation.

The refactored step() method implements the shaped reward signal as described in the PR objectives:

Pre-action reward evaluation establishes baseline

Post-action delta computation drives intermediate rewards

The prev_kind == "approx" and new_kind == "exact" check (line 253-254) correctly handles regime transitions

Terminal actions use exact metrics for final reward

One note: The logging statements at lines 210 and 241 use f-strings (flagged by Ruff G004), but this is a stylistic preference and doesn't affect functionality.

src/mqt/predictor/rl/predictorenv.py

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (2)

src/mqt/predictor/rl/predictorenv.py (2)
269-269: Remove unused noqa directive.

Static analysis indicates this directive is unnecessary since SLF001 is not enabled in the linter configuration.
-        self.state._layout = self.layout  # noqa: SLF001
+        self.state._layout = self.layout
458-458: Type annotation mismatch persists.

The variable is annotated as PropertySet | None but initialized to {}. Since PropertySet is a Qiskit dict-like class, consider initializing to None and updating the downstream logic, or change the annotation to dict[str, Any] | None to match the runtime type.
-        pm_property_set: PropertySet | None = {}
+        pm_property_set: dict[str, Any] | None = None
This would require updating line 477/484 to handle the None initialization:
pm_property_set = dict(pm.property_set) if hasattr(pm, "property_set") else None

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4b5bf6f and 14aa3ac.

📒 Files selected for processing (2)

src/mqt/predictor/rl/predictorenv.py (10 hunks)
tests/compilation/test_predictor_rl.py (5 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-11-27T21:26:39.677Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

📚 Learning: 2025-11-04T14:26:25.420Z

Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

🧬 Code graph analysis (1)

src/mqt/predictor/rl/predictorenv.py (6)

src/mqt/predictor/reward.py (4)

crit_depth (40-43)

esp_data_available (229-324)

estimated_success_probability (102-226)

expected_fidelity (46-74)

src/mqt/predictor/rl/actions.py (5)

CompilationOrigin (92-98)

DeviceDependentAction (136-146)

PassType (101-110)

get_actions_by_pass_type (462-467)

Action (114-127)

src/mqt/predictor/rl/cost_model.py (2)

approx_estimated_success_probability (250-308)

approx_expected_fidelity (221-247)

src/mqt/predictor/rl/helper.py (2)

create_feature_dict (72-94)

get_path_training_circuits (107-109)

src/mqt/predictor/rl/parsing.py (1)

postprocess_vf2postlayout (229-250)

src/mqt/predictor/utils.py (1)

calc_supermarq_features (93-145)

🪛 Ruff (0.14.7)

tests/compilation/test_predictor_rl.py

227-227: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

src/mqt/predictor/rl/predictorenv.py

210-210: Logging statement uses f-string

(G004)

241-241: Logging statement uses f-string

(G004)

269-269: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04

🔇 Additional comments (9)

src/mqt/predictor/rl/predictorenv.py (5)

179-188: LGTM!

The renamed variables (reward_scale, no_effect_penalty) are clearer than the previous names. The device-average caching variables are well-structured for the approximate reward path.

272-356: LGTM!

The dual-path reward calculation is well-structured with clear mode selection logic. The device_id is now correctly passed to the approximate reward functions, and the fallback behavior for unsupported reward functions is appropriately logged.

495-526: LGTM!

The refactored method properly accepts the property set as a parameter, improving testability and decoupling from the PassManager. The early return for falsy pm_property_set handles edge cases appropriately.

732-741: LGTM!

Clean helper method that encapsulates the native-and-mapped check logic. The use of Qiskit's analysis passes (GatesInBasis, CheckMap) is appropriate.

253-254: Important regime-switch handling.

Setting delta_reward = 0.0 when transitioning from approximate to exact mode prevents numerical artifacts from comparing metrics computed with different methods. This is a sound design choice.

tests/compilation/test_predictor_rl.py (4)

35-44: LGTM!

The new imports support testing the cost model fallback behavior and the dual-path reward computation. The TYPE_CHECKING pattern for MonkeyPatch is correctly used.

105-115: LGTM!

Adding explicit validation via GatesInBasis and CheckMap strengthens the test by verifying the compiled circuit meets device constraints. The descriptive assertion messages aid debugging.

156-167: LGTM!

Good coverage of cost model edge cases: unknown device fallback and unknown gate default. The test properly verifies both the warning emission and the returned values.

170-235: LGTM!

Comprehensive test coverage for the dual-path reward computation:

ESP exact/approx/auto modes

Auto mode selecting exact for native+mapped circuits (addresses previous review feedback)

RuntimeError on broken Target API

Critical depth always returning exact

The monkeypatch usage for forcing the approximate path is appropriate.

coderabbitai · 2025-12-04T10:02:31Z

src/mqt/predictor/rl/predictorenv.py

+        if not p1 and not p2:
+            msg = "No valid 1q/2q calibration data found in Target; cannot compute approximate reward."
+            raise RuntimeError(msg)


🧹 Nitpick | 🔵 Trivial

Consider differentiating empty 1q vs 2q calibration data.

The check if not p1 and not p2 raises a RuntimeError, but a device with only single-qubit gates (no 2q gates) is valid. The error message could be misleading if p2 is empty by design rather than missing calibration.

Consider logging a warning instead of raising if one list is populated:

if not p1 and not p2: msg = "No valid 1q/2q calibration data found in Target; cannot compute approximate reward." raise RuntimeError(msg) + elif not p1: + logger.warning("No single-qubit calibration data found; _p1_avg will be 0.0") + elif not p2: + logger.warning("No two-qubit calibration data found; _p2_avg will be 0.0")

🤖 Prompt for AI Agents

In src/mqt/predictor/rl/predictorenv.py around lines 706-708, the current check treats missing 1q or 2q calibration the same and raises a RuntimeError when both p1 and p2 are falsy; change the logic so that you only raise when both p1 and p2 are empty, but if one is populated and the other is empty emit a warning (include which calibration is missing: "missing 1q" or "missing 2q") and continue using the available calibration to compute the approximate reward; ensure the warning uses the module/logger available in this file and updates the message to be unambiguous about whether the absence is expected (e.g., device has no 2q gates) or truly missing.

@Shaobo-Zhou the suggestion from @coderabbitai here seems to make sense to me.

@burgholzer Thank you for the confirmation!

@Shaobo-Zhou The rationale here is that some quantum devices may legitimately have only single-qubit operations available (or conversely, a device definition might only expose 2q calibration), and the current check would incorrectly raise an error in such cases.

The recommended change would:

Only raise RuntimeError when both p1 and p2 are completely empty (true missing calibration data)

Log a warning when one list is empty but the other has data (partial calibration, which may be expected)

Continue execution with whatever calibration data is available

Here's the implementation approach:

if not p1 and not p2: msg = "No valid 1q/2q calibration data found in Target; cannot compute approximate reward." raise RuntimeError(msg) elif not p1: logger.warning("No single-qubit calibration data found in Target; _p1_avg will be 0.0") elif not p2: logger.warning("No two-qubit calibration data found in Target; _p2_avg will be 0.0")

This ensures the method is more robust and provides clearer diagnostics about what calibration data is actually missing versus what might be legitimately absent by device design.

coderabbitai · 2025-12-04T10:02:31Z

tests/compilation/test_predictor_rl.py

+        RuntimeError,
+        match=re.escape("Device target does not expose the required Target API for approximate reward computation."),
+    ):
+        broken_predictor.env._ensure_device_averages_cached()  # noqa: SLF001


🧹 Nitpick | 🔵 Trivial

Remove unused noqa directive.

Static analysis indicates SLF001 is not enabled, making this directive unnecessary.

- broken_predictor.env._ensure_device_averages_cached() # noqa: SLF001 + broken_predictor.env._ensure_device_averages_cached()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

broken_predictor.env._ensure_device_averages_cached() # noqa: SLF001

broken_predictor.env._ensure_device_averages_cached()

🧰 Tools

🪛 Ruff (0.14.7)

227-227: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

🤖 Prompt for AI Agents

In tests/compilation/test_predictor_rl.py around line 227, the call to broken_predictor.env._ensure_device_averages_cached() includes an unnecessary "# noqa: SLF001" directive; remove the trailing noqa directive from that line so the code reads simply broken_predictor.env._ensure_device_averages_cached() and commit the change.

SLF is enabled in the ruff rules of the project.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check the ruff configuration in pyproject.toml cat pyproject.toml | grep -A 50 "\[tool.ruff"

Length of output: 2538

@burgholzer, you're absolutely right! I've confirmed that SLF (flake8-self) is indeed enabled in the extend-select list in pyproject.toml. The # noqa: SLF001 directive is necessary to suppress the private member access warning for _ensure_device_averages_cached().

My original suggestion to remove the directive was incorrect. The RUF100 warning appears to be a false positive in this case—the directive is required and should remain. Thank you for the correction!

🧬 Code Graph Analysis Results

tests/compilation/test_predictor_rl.py

<no_relevant_code_snippets>
</no_relevant_code_snippets>

✏️ Learnings added

Learnt from: burgholzer Repo: munich-quantum-toolkit/predictor PR: 526 File: tests/compilation/test_predictor_rl.py:225-225 Timestamp: 2025-12-25T13:28:02.860Z Learning: In the munich-quantum-toolkit/predictor repository, Ruff is configured with the "SLF" (flake8-self) rule category enabled in extend-select. SLF001 (private member access) is active, so `# noqa: SLF001` directives are necessary when accessing private methods/attributes (e.g., `_ensure_device_averages_cached()`), even if Ruff reports RUF100 warnings suggesting the directive is unused.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: flowerthrower Repo: munich-quantum-toolkit/core-plugins-catalyst PR: 23 File: docs/conf.py:110-130 Timestamp: 2025-12-14T15:23:54.712Z Learning: In the munich-quantum-toolkit/core-plugins-catalyst repository, the Ruff configuration has 'ALL' rules enabled with only specific rules disabled. PLR6301 (no-self-use) is active, so `# noqa: PLR6301` directives are necessary for methods that don't use self, even if Ruff reports RUF100 warnings suggesting the directive is unused.

Learnt from: burgholzer Repo: munich-quantum-toolkit/qmap PR: 846 File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34 Timestamp: 2025-11-27T21:26:39.677Z Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Learnt from: burgholzer Repo: munich-quantum-toolkit/core PR: 1326 File: python/mqt/core/__init__.py:22-22 Timestamp: 2025-11-24T10:19:41.147Z Learning: In the munich-quantum-toolkit/core repository, Ruff is configured with 'ALL' rules enabled by default, and only specific rules are selectively disabled. When reviewing changes that enable previously-disabled rules (like PLC0415), noqa directives for those rules become necessary and should be retained.

coderabbitai

Actionable comments posted: 5

♻️ Duplicate comments (2)

src/mqt/predictor/rl/predictorenv.py (2)
269-269: Unused noqa directive.

Static analysis confirms SLF001 is not enabled. Remove the directive.
-        self.state._layout = self.layout  # noqa: SLF001
+        self.state._layout = self.layout
458-458: Type annotation mismatch.

The variable is annotated as PropertySet | None but initialized to an empty dict. Either initialize as None or change the annotation to dict[str, Any]:
-        pm_property_set: PropertySet | None = {}
+        pm_property_set: dict[str, Any] | None = None
Then adjust the stochastic branch return type expectation accordingly.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 14aa3ac and 04fde70.

📒 Files selected for processing (1)

src/mqt/predictor/rl/predictorenv.py (10 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-11-27T21:26:39.677Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

📚 Learning: 2025-11-04T14:26:25.420Z

Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

🧬 Code graph analysis (1)

src/mqt/predictor/rl/predictorenv.py (4)

src/mqt/predictor/reward.py (1)

crit_depth (40-43)

src/mqt/predictor/rl/helper.py (1)

get_path_training_circuits (107-109)

src/mqt/predictor/rl/parsing.py (2)

final_layout_bqskit_to_qiskit (186-226)

postprocess_vf2postlayout (229-250)

src/mqt/predictor/utils.py (1)

calc_supermarq_features (93-145)

🪛 Ruff (0.14.7)

src/mqt/predictor/rl/predictorenv.py

210-210: Logging statement uses f-string

(G004)

241-241: Logging statement uses f-string

(G004)

269-269: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04

🔇 Additional comments (3)

src/mqt/predictor/rl/predictorenv.py (3)

45-75: LGTM on import organization.

The imports are well-organized, with appropriate separation of type-checking imports and runtime imports. The new imports for approx_expected_fidelity, approx_estimated_success_probability, and calc_supermarq_features align with the dual-path reward computation feature.

272-356: Well-structured dual-path reward computation.

The calculate_reward method cleanly separates exact (calibration-based) and approximate (cost-model-based) paths. The auto-mode logic for selecting the appropriate path based on circuit state is sound. Previous feedback about missing device_id has been addressed.

495-526: LGTM on layout postprocessing refactor.

The method now accepts pm_property_set directly as a dict, which is cleaner than passing a PassManager. The early return on line 501-502 properly handles the None/empty case.

src/mqt/predictor/rl/predictorenv.py

coderabbitai · 2025-12-04T10:25:15Z

src/mqt/predictor/rl/predictorenv.py

+    def _is_native_and_mapped(self, qc: QuantumCircuit) -> bool:
+        check_nat_gates = GatesInBasis(basis_gates=self.device.operation_names)
+        check_nat_gates(qc)
+        only_nat_gates = check_nat_gates.property_set["all_gates_in_basis"]
+
+        check_mapping = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map()))
+        check_mapping(qc)
+        mapped = check_mapping.property_set["is_swap_mapped"]
+
+        return bool(only_nat_gates and mapped)


🧹 Nitpick | 🔵 Trivial

Consider caching pass instances for performance.

_is_native_and_mapped creates new GatesInBasis and CheckMap instances on every call. During RL training, this method is invoked frequently via calculate_reward. Consider caching the pass instances as instance variables:

+ def __init__(self, ...): + ... + # Cached analysis passes for _is_native_and_mapped + self._gates_in_basis_check: GatesInBasis | None = None + self._check_map: CheckMap | None = None def _is_native_and_mapped(self, qc: QuantumCircuit) -> bool: - check_nat_gates = GatesInBasis(basis_gates=self.device.operation_names) + if self._gates_in_basis_check is None: + self._gates_in_basis_check = GatesInBasis(basis_gates=self.device.operation_names) + check_nat_gates = self._gates_in_basis_check check_nat_gates(qc) only_nat_gates = check_nat_gates.property_set["all_gates_in_basis"] - check_mapping = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map())) + if self._check_map is None: + self._check_map = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map())) + check_mapping = self._check_map check_mapping(qc) mapped = check_mapping.property_set["is_swap_mapped"] return bool(only_nat_gates and mapped)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In src/mqt/predictor/rl/predictorenv.py around lines 732-741, the method _is_native_and_mapped currently instantiates GatesInBasis and CheckMap on every call which is costly during RL training; instead, create and store these pass instances as persistent attributes (e.g., self._gates_in_basis and self._check_map) during the environment initialization (or the first use), reuse them in _is_native_and_mapped, and only rebuild/update them if the device or its coupling map changes (add a simple invalidation/update path). This removes repeated allocations and speeds up frequent calls.

@Shaobo-Zhou the suggestion from @coderabbitai here seems to make sense to me.

@burgholzer Thank you for the feedback! The caching optimization should help reduce overhead during RL training episodes, especially since _is_native_and_mapped is called at least once per step via calculate_reward (in "auto" mode).

@Shaobo-Zhou, if you decide to implement this, one additional note: since GatesInBasis and CheckMap are analysis passes that modify their internal property_set when called, the caching approach will work fine as long as we call them each time with the current circuit—the property set will be updated with fresh results for each circuit.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Shaobo-Zhou <[email protected]>

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (5)

src/mqt/predictor/rl/predictorenv.py (5)
17-18: Empty conditional block should be removed.

This empty block with just pass serves no purpose. If it was a placeholder for future type-only imports on Python 3.11+, add a comment explaining the intent; otherwise remove the block entirely.
-if sys.version_info >= (3, 11) and TYPE_CHECKING:  # pragma: no cover
-    pass
274-274: Remove unused noqa directive.

Static analysis flags this noqa: SLF001 as unused. The directive can be removed.
-        self.state._layout = self.layout  # noqa: SLF001
+        self.state._layout = self.layout
463-463: Type annotation mismatch persists.

The variable is annotated as PropertySet | None but initialized as {} (empty dict). Later it's assigned dict(pm.property_set). For type consistency, consider:
-        pm_property_set: PropertySet | None = {}
+        pm_property_set: dict[str, Any] | None = None
Then ensure the downstream code handles None appropriately (which it already does at line 506).

711-713: Consider logging a warning for partial calibration data.

The check raises RuntimeError when both p1 and p2 are empty, which is correct. However, if only one list is populated (e.g., a device with only single-qubit gates), a warning would be helpful for debugging.
         if not p1 and not p2:
             msg = "No valid 1q/2q calibration data found in Target, cannot compute approximate reward."
             raise RuntimeError(msg)
+        if not p1:
+            logger.warning("No single-qubit calibration data found; _p1_avg will be 0.0")
+        elif not p2:
+            logger.warning("No two-qubit calibration data found; _p2_avg will be 0.0")
737-746: Consider caching analysis pass instances for performance.

_is_native_and_mapped creates new GatesInBasis and CheckMap instances on every call. During RL training, this method is invoked frequently via calculate_reward. Caching these passes would improve performance.

Initialize cached passes in __init__:
# In __init__:
self._gates_in_basis_check: GatesInBasis | None = None
self._check_map: CheckMap | None = None
Then lazily initialize in _is_native_and_mapped:
 def _is_native_and_mapped(self, qc: QuantumCircuit) -> bool:
-    check_nat_gates = GatesInBasis(basis_gates=self.device.operation_names)
+    if self._gates_in_basis_check is None:
+        self._gates_in_basis_check = GatesInBasis(basis_gates=self.device.operation_names)
+    check_nat_gates = self._gates_in_basis_check
     check_nat_gates(qc)
     only_nat_gates = check_nat_gates.property_set["all_gates_in_basis"]

-    check_mapping = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map()))
+    if self._check_map is None:
+        self._check_map = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map()))
+    check_mapping = self._check_map
     check_mapping(qc)
     mapped = check_mapping.property_set["is_swap_mapped"]

     return bool(only_nat_gates and mapped)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 04fde70 and 6a104b8.

📒 Files selected for processing (1)

src/mqt/predictor/rl/predictorenv.py (11 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-11-27T21:26:39.677Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

📚 Learning: 2025-11-04T14:26:25.420Z

Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

🧬 Code graph analysis (1)

src/mqt/predictor/rl/predictorenv.py (5)

src/mqt/predictor/reward.py (3)

crit_depth (40-43)

estimated_success_probability (102-226)

expected_fidelity (46-74)

src/mqt/predictor/rl/cost_model.py (2)

approx_estimated_success_probability (250-308)

approx_expected_fidelity (221-247)

src/mqt/predictor/rl/helper.py (2)

create_feature_dict (72-94)

get_path_training_circuits (107-109)

src/mqt/predictor/rl/parsing.py (3)

final_layout_bqskit_to_qiskit (186-226)

final_layout_pytket_to_qiskit (167-183)

postprocess_vf2postlayout (229-250)

src/mqt/predictor/utils.py (1)

calc_supermarq_features (93-145)

🪛 Ruff (0.14.7)

src/mqt/predictor/rl/predictorenv.py

214-214: Logging statement uses f-string

(G004)

245-245: Logging statement uses f-string

(G004)

274-274: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14

🔇 Additional comments (6)

src/mqt/predictor/rl/predictorenv.py (6)

88-89: LGTM: Penalty sign corrected and parameters made configurable.

The no_effect_penalty is now correctly initialized as a negative value (-0.001), and both reward_scale and no_effect_penalty are exposed as constructor parameters for tuning during RL experimentation. This addresses previous review feedback.

257-259: Good handling of regime transition.

The comment clearly explains why delta_reward = 0.0 when transitioning from approximate to exact metrics—the two aren't directly comparable. This is the correct approach to avoid misleading reward signals.

277-319: Well-structured dual-path reward computation.

The calculate_reward method cleanly implements the exact vs. approximate reward paths with a clear mode selection mechanism. The fallback for unsupported reward functions with a warning is good defensive coding.

500-531: LGTM: Layout postprocessing signature updated correctly.

The method signature now accepts dict[str, Any] | None which aligns with how it's called (passing dict(pm.property_set)). The early return for falsy pm_property_set is a good guard.

229-231: Good defensive normalization for high-level gates.

Decomposing unitary and clifford gates before reward computation ensures the reward functions can properly analyze the circuit. This prevents errors from gates that don't have calibration data.

130-133: Verify whether this action registration order change breaks pre-trained model compatibility.

Registering OPT actions before LAYOUT actions changes the indices in action_set. While web search suggests PassType operates as a flexible tagging system (with the RL agent dynamically determining execution order), the action indices themselves may still be significant if:

Pre-trained models encode action indices as state features

Saved experience or replay buffers reference specific action indices

Model weights learned associations with particular action index ranges

Confirm whether existing trained agents or datasets depend on action registration indices, or if this reordering is safe.

src/mqt/predictor/rl/predictorenv.py

Signed-off-by: Shaobo-Zhou <[email protected]>

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pyproject.toml (1)
268-269: Remove duplicate [tool.check-sdist] configuration.

This section duplicates lines 97-98. TOML does not allow duplicate table headers, and parsers will reject this file, causing build failures.

Apply this diff to remove the duplicate:
-[tool.check-sdist]
-sdist-only = ["src/mqt/predictor/_version.py"]

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a104b8 and ae43600.

📒 Files selected for processing (2)

CHANGELOG.md (2 hunks)
pyproject.toml (2 hunks)

🧰 Additional context used

🧠 Learnings (6)

📚 Learning: 2025-10-10T08:09:54.528Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1246
File: bindings/CMakeLists.txt:0-0
Timestamp: 2025-10-10T08:09:54.528Z
Learning: In the Munich Quantum Toolkit (MQT) Core project, scikit-build-core is configured with `wheel.install-dir = "mqt/core"` in pyproject.toml, which automatically prefixes all CMake `DESTINATION` paths with `mqt/core/` during wheel installation. Therefore, CMake install destinations are relative to the `mqt/core` package namespace, not `site-packages`.

Applied to files:

pyproject.toml

📚 Learning: 2025-10-10T08:10:16.394Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1246
File: test/python/na/test_na_fomac.py:35-0
Timestamp: 2025-10-10T08:10:16.394Z
Learning: In the munich-quantum-toolkit/core repository, scikit-build-core is configured with `wheel.install-dir = "mqt/core"` in pyproject.toml, which means CMake `install()` commands with `DESTINATION <path>` install files relative to `mqt/core/` in the wheel, making them accessible via `files("mqt.core").joinpath("<path>")`.

Applied to files:

pyproject.toml

📚 Learning: 2025-12-02T10:08:36.022Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/core-plugins-catalyst PR: 1
File: .github/workflows/cd.yml:40-48
Timestamp: 2025-12-02T10:08:36.022Z
Learning: In munich-quantum-toolkit workflows, the reusable-python-packaging-sdist.yml workflow uploads the sdist artifact with the name `cibw-sdist` (or `dev-cibw-sdist` for pull requests), which means it is covered by the `cibw-*` glob pattern when downloading artifacts for deployment.

Applied to files:

pyproject.toml

📚 Learning: 2025-10-14T14:37:38.047Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/yaqs PR: 212
File: CHANGELOG.md:12-15
Timestamp: 2025-10-14T14:37:38.047Z
Learning: In the munich-quantum-toolkit/yaqs project, changelog entries follow the template: "- $TITLE ([#$NUMBER]($URL)) ([**AUTHOR**](https://github.com/$AUTHOR))". Issue references should not be included in changelog entries; the PR number is sufficient for traceability.

Applied to files:

CHANGELOG.md

📚 Learning: 2025-12-05T17:45:37.602Z

Learnt from: denialhaag
Repo: munich-quantum-toolkit/core PR: 1360
File: .github/workflows/reusable-mlir-tests.yml:40-43
Timestamp: 2025-12-05T17:45:37.602Z
Learning: In the munich-quantum-toolkit/core repository, patch releases of LLVM dependencies don't require documentation updates, changelog entries, or additional tests beyond what's validated by passing CI checks.

Applied to files:

CHANGELOG.md

📚 Learning: 2025-12-01T11:00:40.342Z

Learnt from: flowerthrower
Repo: munich-quantum-toolkit/core-plugins-catalyst PR: 1
File: CHANGELOG.md:18-18
Timestamp: 2025-12-01T11:00:40.342Z
Learning: In the munich-quantum-toolkit/core-plugins-catalyst repository, the CHANGELOG.md intentionally references the parent MQT Core repository's release notes (https://github.com/munich-quantum-toolkit/core/releases) because the plugin repository is based on work performed in the parent repository.

Applied to files:

CHANGELOG.md

🔇 Additional comments (3)

pyproject.toml (2)

97-98: LGTM!

The check-sdist configuration correctly ensures the hatch-vcs generated version file is included in source distributions.

130-130: LGTM!

The warning filter appropriately suppresses expected warnings when tests exercise devices without defined canonical cost tables, consistent with the fallback behavior introduced in this PR.

CHANGELOG.md (1)

50-51: Changelog entries and PR links look good.

Both new changelog entries are properly added to the Unreleased section, and the corresponding PR link references are correctly defined in the PR links section. The entries follow the existing format and conventions used elsewhere in the file. Once the formatting issue on line 15 is fixed, this will be ready.

CHANGELOG.md

Shaobo-Zhou · 2025-12-17T15:00:52Z

@coderabbitai review

coderabbitai · 2025-12-17T15:01:00Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

src/mqt/predictor/rl/predictorenv.py (1)
462-498: Consider clarifying the type annotation for pm_property_set.

Line 463 declares pm_property_set: PropertySet | None but initializes it to {} (an empty dict). The downstream code (lines 482, 489) converts pm.property_set to dict[str, Any], and line 503 checks for truthiness rather than None. For consistency:
-        pm_property_set: PropertySet | None = {}
+        pm_property_set: dict[str, Any] | None = None
Then ensure downstream code handles None appropriately (which it appears to do at line 506).

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ae43600 and b2197b3.

📒 Files selected for processing (3)

CHANGELOG.md (2 hunks)
pyproject.toml (1 hunks)
src/mqt/predictor/rl/predictorenv.py (11 hunks)

🧰 Additional context used

🧠 Learnings (5)

📚 Learning: 2025-10-14T14:37:38.047Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/yaqs PR: 212
File: CHANGELOG.md:12-15
Timestamp: 2025-10-14T14:37:38.047Z
Learning: In the munich-quantum-toolkit/yaqs project, changelog entries follow the template: "- $TITLE ([#$NUMBER]($URL)) ([**AUTHOR**](https://github.com/$AUTHOR))". Issue references should not be included in changelog entries; the PR number is sufficient for traceability.

Applied to files:

CHANGELOG.md

📚 Learning: 2025-12-05T17:45:37.602Z

Learnt from: denialhaag
Repo: munich-quantum-toolkit/core PR: 1360
File: .github/workflows/reusable-mlir-tests.yml:40-43
Timestamp: 2025-12-05T17:45:37.602Z
Learning: In the munich-quantum-toolkit/core repository, patch releases of LLVM dependencies don't require documentation updates, changelog entries, or additional tests beyond what's validated by passing CI checks.

Applied to files:

CHANGELOG.md

📚 Learning: 2025-12-01T11:00:40.342Z

Learnt from: flowerthrower
Repo: munich-quantum-toolkit/core-plugins-catalyst PR: 1
File: CHANGELOG.md:18-18
Timestamp: 2025-12-01T11:00:40.342Z
Learning: In the munich-quantum-toolkit/core-plugins-catalyst repository, the CHANGELOG.md intentionally references the parent MQT Core repository's release notes (https://github.com/munich-quantum-toolkit/core/releases) because the plugin repository is based on work performed in the parent repository.

Applied to files:

CHANGELOG.md

📚 Learning: 2025-11-27T21:26:39.677Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/qmap PR: 846
File: python/mqt/qmap/plugins/qiskit/sc/load_calibration.py:34-34
Timestamp: 2025-11-27T21:26:39.677Z
Learning: In the qmap project, the Ruff linter has the "PL" (pylint) rule category enabled, which includes PLC0415 (import-outside-top-level). Therefore, `# noqa: PLC0415` directives on lazy imports are appropriate and necessary, not unused.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

📚 Learning: 2025-11-04T14:26:25.420Z

Learnt from: marcelwa
Repo: munich-quantum-toolkit/core PR: 1243
File: test/python/qdmi/qiskit/conftest.py:11-19
Timestamp: 2025-11-04T14:26:25.420Z
Learning: In the munich-quantum-toolkit/core repository, Qiskit is always available as a dependency during testing, so import guards for qiskit-dependent imports in test files (e.g., test/python/qdmi/qiskit/*.py) are not necessary.

Applied to files:

src/mqt/predictor/rl/predictorenv.py

🧬 Code graph analysis (1)

src/mqt/predictor/rl/predictorenv.py (6)

src/mqt/predictor/reward.py (3)

crit_depth (40-43)

estimated_success_probability (102-226)

expected_fidelity (46-74)

src/mqt/predictor/rl/actions.py (3)

CompilationOrigin (92-98)

PassType (101-110)

Action (114-127)

src/mqt/predictor/rl/cost_model.py (2)

approx_estimated_success_probability (250-308)

approx_expected_fidelity (221-247)

src/mqt/predictor/rl/helper.py (1)

create_feature_dict (72-94)

src/mqt/predictor/rl/parsing.py (1)

final_layout_bqskit_to_qiskit (186-226)

src/mqt/predictor/utils.py (1)

calc_supermarq_features (93-145)

🪛 Ruff (0.14.8)

src/mqt/predictor/rl/predictorenv.py

214-214: Logging statement uses f-string

(G004)

245-245: Logging statement uses f-string

(G004)

274-274: Unused noqa directive (non-enabled: SLF001)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 900000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14
GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04

🔇 Additional comments (8)

pyproject.toml (1)

127-128: LGTM! Test warning filter appropriately scoped.

The new warning filter suppresses expected UserWarning messages from the canonical cost model when testing with devices that lack predefined cost tables. This is a reasonable approach to reduce test noise while still catching unexpected warnings.

CHANGELOG.md (2)

14-15: LGTM! Changelog entries follow project conventions.

Both entries are properly formatted with the correct template structure. The previous formatting issue has been resolved.

50-51: LGTM! PR link references are correct.

The PR links are properly formatted and correspond to the new changelog entries.
src/mqt/predictor/rl/predictorenv.py (5)
25-26: LGTM! Import additions support new dual-path reward functionality.

The new imports (PropertySet, CheckMap, GatesInBasis, cost_model functions, calc_supermarq_features) are necessary for the dual-path reward computation and device-averages caching introduced in this PR.

Also applies to: 45-48, 58-69, 75-75

88-89: LGTM! Reward shaping parameters are now configurable.

The new reward_scale and no_effect_penalty parameters allow tuning reward behavior during RL training. The negative default for no_effect_penalty correctly penalizes actions that produce no improvement. The instance variables properly support the dual-path reward computation and device-averages caching.

Also applies to: 97-98, 183-192

194-276: LGTM! Dual-path reward computation is well-structured.

The rewritten step() method correctly implements shaped intermediate rewards with proper handling of:

Exact vs. approximate reward paths

Regime transitions (zeroing delta when switching from approx to exact)

Symmetric scaling for positive/negative deltas

No-effect penalty for actions that don't change the circuit

The logic is clear and the approach aligns with the PR objectives.

Minor: Remove unused noqa directive and use lazy logging.

Static analysis flags three minor issues:

Line 274: The # noqa: SLF001 directive is unused and should be removed.

Lines 214, 245: f-string logging should use lazy % formatting for performance (e.g., logger.info("Applying %s", self.action_set[action].name)).

Apply this diff:
-        logger.info(f"Applying {self.action_set[action].name}")
+        logger.info("Applying %s", self.action_set[action].name)
         
         # ... (code omitted)
         
-            logger.info(f"Final reward ({final_kind}): {final_val}")
+            logger.info("Final reward (%s): %s", final_kind, final_val)
             
         # ... (code omitted)
         
-        self.state._layout = self.layout  # noqa: SLF001
+        self.state._layout = self.layout
277-361: LGTM! Dual-path reward calculation is well-implemented.

The calculate_reward() method correctly implements the dual-path (exact vs. approximate) reward computation:

The mode parameter allows flexible control over which path to use

The "auto" mode intelligently selects based on whether the circuit is native and mapped

Device-specific cost tables are correctly selected via device_id parameter

Unsupported reward functions gracefully fall back to exact computation with a warning

The implementation aligns well with the PR objectives for shaped intermediate rewards.

500-531: LGTM! Layout postprocessing refactor aligns with property set handling.

The signature change to accept dict[str, Any] | None instead of a PassManager is a cleaner design that decouples layout extraction from the PassManager object. The logic correctly handles missing property sets and properly constructs the TranspileLayout.

src/mqt/predictor/rl/predictorenv.py

flowerthrower

Hey @Shaobo-Zhou once again thank you very much for your continued efforts - LGTM! There is really only a small nitpick from my end left.

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8a80d2 and 7810d4d.

📒 Files selected for processing (1)

src/mqt/predictor/rl/cost_model.py (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-10-09T13:20:11.483Z

Learnt from: DRovara
Repo: munich-quantum-toolkit/core PR: 1108
File: mlir/test/Dialect/MQTOpt/Transforms/lift-measurements.mlir:269-288
Timestamp: 2025-10-09T13:20:11.483Z
Learning: In the MQT MLIR dialect, the `rz` gate should not be included in the `DIAGONAL_GATES` set for the `ReplaceBasisStateControlsWithIfPattern` because its operator matrix does not have the required shape | 1 0 | / | 0 x | for the targets-as-controls optimization. It is only included in `LiftMeasurementsAboveGatesPatterns` where the matrix structure requirement differs.

Applied to files:

src/mqt/predictor/rl/cost_model.py

📚 Learning: 2025-11-01T15:57:31.153Z

Learnt from: burgholzer
Repo: munich-quantum-toolkit/core PR: 1283
File: src/qir/runtime/QIR.cpp:196-201
Timestamp: 2025-11-01T15:57:31.153Z
Learning: In the QIR runtime (src/qir/runtime/QIR.cpp), the PRX gate (__quantum__qis__prx__body) is an alias for the R gate (Phased X-Rotation) and should call runtime.apply<qc::R>(theta, phi, qubit), not runtime.apply<qc::RX>() which is a single-parameter rotation gate.

Applied to files:

src/mqt/predictor/rl/cost_model.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: 🐍 Test (ubuntu-24.04) / 🐍 ubuntu-24.04
GitHub Check: 🐍 Test (windows-2022) / 🐍 windows-2022
GitHub Check: 🐍 Test (macos-14) / 🐍 macos-14

🔇 Additional comments (5)

src/mqt/predictor/rl/cost_model.py (5)

1-36: LGTM: Clean module structure and imports.

The module-level organization is solid: clear documentation, appropriate use of TYPE_CHECKING for forward references, and a logger for diagnostics. The CanonicalCostTable type alias using Mapping correctly conveys immutability.

42-148: LGTM: Device-specific cost tables are well-documented.

The canonical cost tables correctly reflect different native gate sets across devices (CZ for Torino, iSWAP for Ankaa3, CZ for Emerald). The decomposition comments (e.g., CX → H•CZ•H ⇒ 6×1q + 1×2q) provide clear rationale. Gate coverage differences are intentional and align with each device's architecture.

151-169: LGTM: Safe fallback with appropriate warnings.

The unknown-device fallback to Torino is well-communicated via both warnings.warn and logger. The dual notification ensures visibility in different deployment contexts.

172-185: LGTM: Simple and correct delegation.

The function correctly wraps get_cost_table and provides the documented (0, 0) fallback for unknown gates.

223-249: LGTM: Correct fidelity approximation with defensive bounds.

The multiplicative fidelity model F = (1−p₁)^n₁ · (1−p₂)^n₂ is standard for independent gate errors. Defensive clamping and max() calls ensure numerical robustness.

coderabbitai · 2025-12-18T19:10:46Z

src/mqt/predictor/rl/cost_model.py

+def estimate_counts(
+    qc: QuantumCircuit,
+    *,
+    cost_table: CanonicalCostTable,
+) -> tuple[int, int]:
+    """Estimate canonical (n_1q, n_2q) counts for a circuit.
+
+    Uses the provided ``cost_table`` where available and a simple, conservative
+    fallback otherwise (3*1q for unknown 1q gates, 1*2q + 4*1q for unknown 2q gates).
+    """
+    n_1q = 0
+    n_2q = 0
+
+    for circuit_instr in qc.data:
+        name = circuit_instr.operation.name
+        qargs = circuit_instr.qubits
+
+        # Ignore non-unitary / timing-only ops for this count
+        if name in ("barrier", "delay", "measure"):
+            continue
+
+        cost = cost_table.get(name)
+        if cost is None:
+            # Conservative fallback by arity (only used for gates missing in the table)
+            if len(qargs) == 1:
+                n_1q += 3
+            elif len(qargs) == 2:
+                n_2q += 1
+                n_1q += 4
+        else:
+            n_1q += cost[0]
+            n_2q += cost[1]
+    return n_1q, n_2q


🧹 Nitpick | 🔵 Trivial

Consider logging unknown gates.

The conservative fallback for gates not in the cost table is reasonable, but silently falling back (lines 210-216) could hide unexpected gates or make debugging difficult. Consider adding a debug or warning log when the fallback is triggered.

🔎 Suggested addition:

cost = cost_table.get(name) if cost is None: + logger.debug( + f"Gate '{name}' not in cost table; using conservative fallback: " + f"{'(3, 0)' if len(qargs) == 1 else '(4, 1)'}" + ) # Conservative fallback by arity (only used for gates missing in the table) if len(qargs) == 1: n_1q += 3

🤖 Prompt for AI Agents

In src/mqt/predictor/rl/cost_model.py around lines 188 to 220, the fallback path for gates missing from the cost_table silently applies conservative counts; modify this to emit a log message when the fallback is triggered that includes the gate name and its arity (len(qargs)). Add a module logger (logging.getLogger(__name__)) if not present, and call logger.debug or logger.warning inside the if cost is None branch (before incrementing n_1q/n_2q) so developers can see which unknown gates caused the fallback; keep message concise and avoid sensitive data.

src/mqt/predictor/rl/cost_model.py

burgholzer

@Shaobo-Zhou just did a very quick review here in order to not hold this up for too long.
I think this will probably need one more round of iteration until it is ready to be merged. At the moment it feels like it's cutting a few corners too many.

Edit: There were also quite some CodeRabbit comments left, which should ideally be resolved.

src/mqt/predictor/rl/predictorenv.py

burgholzer · 2025-12-25T12:19:29Z

tests/compilation/test_predictor_rl.py

+        RuntimeError,
+        match=re.escape("Device target does not expose the required Target API for approximate reward computation."),
+    ):
+        broken_predictor.env._ensure_device_averages_cached()  # noqa: SLF001


SLF is enabled in the ruff rules of the project.

burgholzer · 2025-12-25T13:16:37Z

docs/setup.md

 from mqt.bench import get_benchmark, BenchmarkLevel

-uncompiled_qc = get_benchmark("ghz", level=BenchmarkLevel.ALG, circuit_size=5)
+uncompiled_qc = get_benchmark("ghz", level=BenchmarkLevel.INDEP, circuit_size=5)


We should not be changing the level of benchmarks supported by the predictor. At least, we should not lower it.
If the predictor worked before with the ALG circuits, then we should ensure that it continues to work with them in any new version.

burgholzer · 2025-12-25T13:19:13Z

src/mqt/predictor/rl/cost_model.py

This feels extremely hard coded and unnecessarily rigid.
We have very similar functionality like that programmatically in QCEC: https://github.com/munich-quantum-toolkit/qcec/blob/main/python/mqt/qcec/compilation_flow_profiles.py

These profiles can be cached for all known devices/gate sets or they can be generated on demand for new/unknown targets.
I'd argue that a similar solution should be adopted here to completely avoid the fallback behavior.

burgholzer · 2025-12-25T13:20:38Z

src/mqt/predictor/rl/predictorenv.py

+        if self.reward_function not in {"expected_fidelity", "estimated_success_probability"}:
+            if self.reward_function == "critical_depth":


this does not handle the two Hellinger fidelity metrics explicitly. Thus, they unnecessarily raise a warning in the code.

burgholzer · 2025-12-25T13:22:05Z

src/mqt/predictor/rl/predictorenv.py

+        if not p1 and not p2:
+            msg = "No valid 1q/2q calibration data found in Target; cannot compute approximate reward."
+            raise RuntimeError(msg)


@Shaobo-Zhou the suggestion from @coderabbitai here seems to make sense to me.

burgholzer · 2025-12-25T13:22:12Z

src/mqt/predictor/rl/predictorenv.py

+    def _is_native_and_mapped(self, qc: QuantumCircuit) -> bool:
+        check_nat_gates = GatesInBasis(basis_gates=self.device.operation_names)
+        check_nat_gates(qc)
+        only_nat_gates = check_nat_gates.property_set["all_gates_in_basis"]
+
+        check_mapping = CheckMap(coupling_map=CouplingMap(self.device.build_coupling_map()))
+        check_mapping(qc)
+        mapped = check_mapping.property_set["is_swap_mapped"]
+
+        return bool(only_nat_gates and mapped)


@Shaobo-Zhou the suggestion from @coderabbitai here seems to make sense to me.

burgholzer · 2025-12-25T13:23:09Z

pyproject.toml

    'ignore:.*qiskit.providers.models is deprecated since Qiskit 1.2*:DeprecationWarning:',
    'ignore:.*The class ``qiskit.qobj.*`` is deprecated as of Qiskit 1.3.*:DeprecationWarning:',
    'ignore:.*The property ``qiskit.circuit.instruction.Instruction.*`` is deprecated as of qiskit 1.3.0.*:DeprecationWarning:',
+    "ignore:.*No canonical cost table defined for device.*:UserWarning:",


This ignore should not be necessary anymore if the programmatic approach for determining the canonical cost is implemented.

burgholzer · 2025-12-25T13:23:38Z

src/mqt/predictor/rl/cost_model.py

+            if len(qargs) == 1:
+                n_1q += 3
+            elif len(qargs) == 2:
+                n_2q += 1
+                n_1q += 4


This should probably handle 3qb+ gates.

burgholzer · 2025-12-25T13:25:05Z

CHANGELOG.md

 ### Changed

+- ✨ Improve RL reward design by adding intermediate rewards ([#526]) ([**@Shaobo-Zhou**])
+- 🔧 Change test circuit level for RL predictor from ALG to INDEP ([#449]) ([**@Shaobo-Zhou**])


I suspect this can be removed if the respective review comment is addressed.

Shaobo Zhou and others added 30 commits March 29, 2025 19:20

Update predictor(adding callbacks)

129b60f

Update

08889bd

Restore helper.py and predictor.py to match upstream

e2ff3fe

Merge remote-tracking branch 'upstream/main'

1c32d15

Implement new mapping actions

78dc1aa

Update action space and feature space Update actions Update action space

Fix: resolve pre-commit issues and add missing annotations

a3ba836

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Remove example_test.py Remove example_test.py

Fix: resolve pre-commit issues and add missing annotations

5935e6f

Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations Fix: resolve pre-commit issues and add missing annotations

Fix: resolve pre-commit issues and add missing annotations

f71fb29

Fix: resolve pre-commit issues and add missing annotations

3c7592b

Fix mypy errors

6db5c27

Fix mypy errors

47841c5

Fix dependencies issues

b1ac8ce

Fix dependency issues

5f8473c

Add missing zip file

7491ec0

Fix issue with Python 3.13

3346842

Merge branch 'main' into hybrid-mapping

6f7a73c

Signed-off-by: Shaobo-Zhou <[email protected]>

Remove Python 3.13 from noxfile.py due to compatibility issue

6c67349

Skip minimums session on Windows due to CI slowness

2692b96

Fix bugs

f4874e6

Fix bugs Fix bugs Fix bugs

Fix bugs

54eec91

Use default Qiskit settings for VF2Layout and add assertion for nativ…

845f7de

…e gate check

Debug

3418936

Fix missing argument

ae870cc

Fix warning issues

861bc62

Fix window runtime warning problem

fa989b6

Fix window runtime warning problem

405bd39

Add time limit for VF2PostLayout

7b2f321

Fix windows runtime warning problem

b67d0a6

Fix windows runtime warning problem Fix windows runtime warning issue

Add new actions

bf7c9ee

Add new actions

6d2733f

coderabbitai bot requested changes Dec 3, 2025

View reviewed changes

src/mqt/predictor/rl/predictorenv.py Outdated Show resolved Hide resolved

Shaobo-Zhou added 2 commits December 4, 2025 00:58

Fix warnings

4b5bf6f

Improve coverage

14aa3ac

coderabbitai bot requested changes Dec 4, 2025

View reviewed changes

Improve coverage

04fde70

coderabbitai bot requested changes Dec 4, 2025

View reviewed changes

Shaobo-Zhou and others added 2 commits December 4, 2025 12:02

Update src/mqt/predictor/rl/predictorenv.py

1df8071

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: Shaobo-Zhou <[email protected]>

Fixes

6a104b8

coderabbitai bot requested changes Dec 4, 2025

View reviewed changes

src/mqt/predictor/rl/predictorenv.py Show resolved Hide resolved

Merge branch 'main' into new_RL

ae43600

Signed-off-by: Shaobo-Zhou <[email protected]>

coderabbitai bot requested changes Dec 17, 2025

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Shaobo-Zhou added 2 commits December 17, 2025 15:48

Fixes

b91f4f4

Fix format

b2197b3

coderabbitai bot requested changes Dec 17, 2025

View reviewed changes

src/mqt/predictor/rl/predictorenv.py Outdated Show resolved Hide resolved

src/mqt/predictor/rl/predictorenv.py Show resolved Hide resolved

Shaobo-Zhou added 2 commits December 17, 2025 16:49

Fixes

331972a

Fixes

b8a80d2

flowerthrower approved these changes Dec 18, 2025

View reviewed changes

Update comments

7810d4d

coderabbitai bot requested changes Dec 18, 2025

View reviewed changes

flowerthrower requested a review from burgholzer December 19, 2025 12:29

burgholzer added 6 commits December 25, 2025 11:15

Merge branch 'main' into new_RL

8b3ecaf

✏️ curate changelog

bc49c63

✏️ minimize unnecessary whitespace changes

5cd1bf7

⏪ revert Windows workaround

5a895d0

🏷️ Various typing fixes

1ecb3e2

⏪ avoid CUDA out of memory error

dde6f89

burgholzer added the feature New feature or request label Dec 25, 2025

burgholzer requested changes Dec 25, 2025

View reviewed changes

	broken_predictor.env._ensure_device_averages_cached() # noqa: SLF001
	broken_predictor.env._ensure_device_averages_cached()

		if self.reward_function not in {"expected_fidelity", "estimated_success_probability"}:
		if self.reward_function == "critical_depth":

Uh oh!

Add stepwise intermediate reward for RL #526

Are you sure you want to change the base?

Add stepwise intermediate reward for RL #526

Uh oh!

Conversation

Shaobo-Zhou commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist:

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Shaobo-Zhou commented Dec 17, 2025

Uh oh!

coderabbitai bot commented Dec 17, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

flowerthrower left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Shaobo-Zhou commented Nov 26, 2025 •

edited

Loading

coderabbitai bot Dec 4, 2025 •

edited

Loading

coderabbitai bot Dec 4, 2025 •

edited

Loading

coderabbitai bot Dec 4, 2025 •

edited

Loading

flowerthrower left a comment •

edited

Loading

burgholzer left a comment •

edited

Loading