Skip to content

Conversation

@ChiahsinChu
Copy link
Contributor

@ChiahsinChu ChiahsinChu commented Mar 20, 2025

Overview

This PR adds a data modifier plugin functionality to the PyTorch implementation of DeepMD. This feature allows for on-the-fly data modification during training and inference, enabling advanced data manipulation capabilities.

Key Changes

1. Added Data Modifier to Training Pipeline

  • File: deepmd/pt/entrypoints/main.py
    • Added imports for data modifier functionality (get_data_modifier)
    • Added modifier initialization in get_trainer() function
    • Added modifier parameter to data loader initialization for both training and validation datasets
    • Enhanced model freezing process to include modifier handling with temporary file management

2. Added Data Modifier to Inference

  • File: deepmd/pt/infer/deep_eval.py
    • Added modifier loading and handling in DeepEval class
    • Enhanced model loading process to handle extra files containing modifier data
    • Added modifier application in inference methods to modify model predictions

3. Implemented Data Modifier Framework

  • File: deepmd/pt/modifier/__init__.py (entirely new)
    • Created base class BaseModifier with registration system
    • Implemented three specific modifier types:
      • ModifierRandomTester: Applies random scaling to energy/force/virial data for testing
      • ModifierZeroTester: Zeroes out energy/force/virial data for testing
      • ModifierScalingTester: Applies scaled model predictions as data modifications
    • Added comprehensive argument parsing for modifier configuration

4. Added Data Modifier Tests

  • File: deepmd/pt/test/test_modifier.py (entirely new)
    • Created comprehensive test suite for data modifier functionality
    • Tests include:
      • Modifier initialization and data modification verification
      • Ensuring data modification is applied only once
      • Testing inference with data modification by verifying scaled model predictions
    • Added helper methods for test data management and comparison

Summary by CodeRabbit

  • New Features

    • Pluggable data modifier API: create, attach, serialize and embed modifiers with models; modifiers propagate through loaders, model wrapper and inference.
  • Behavior

    • Modifiers can be preloaded, applied and optionally cached during data loading; their outputs adjust energies/forces/virials during training and inference.
    • Frozen model export/import preserves embedded modifiers.
  • Tests

    • End-to-end tests covering zeroing, scaling, deterministic/random modifiers and frozen-model inference.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 20, 2025

📝 Walkthrough

Walkthrough

Adds a data-modifier subsystem (BaseModifier + factory), integrates modifiers into DeepmdData/datasets/loaders with optional caching and preload, propagates modifiers through training, inference, freezing (.pth extra file), updates ModelWrapper to apply modifiers at inference, and adds unit tests.

Changes

Cohort / File(s) Summary
Modifier package
deepmd/pt/modifier/__init__.py
New factory get_data_modifier(...) exported; __all__ includes BaseModifier and get_data_modifier.
BaseModifier implementation
deepmd/pt/modifier/base_modifier.py
New BaseModifier (torch.nn.Module) with abstract forward, serialize/deserialize, and modify_data helper applying/subtracting modifier outputs.
Deepmd data core
deepmd/utils/data.py
Add use_modifier_cache, _modified_frame_cache; change get_item_*/get_single_frame to accept num_worker; apply modifier (threaded), cache results, and add preload_and_modify_all_data_torch(num_worker).
Dataset & DataLoader wiring
deepmd/pt/utils/dataset.py, deepmd/pt/utils/dataloader.py
Datasets/loader constructors accept optional modifier and forward it; DpLoaderSet.preload_and_modify_all_data_torch() added; propagate NUM_WORKERS to item calls.
Training preparation
deepmd/pt/train/training.py
Attach per-model data requirements and call preload_and_modify_all_data_torch() on training/validation datasets; removed _validation_data and _data_requirement from single_model_stat signature/usage.
Entrypoint & packaging
deepmd/pt/entrypoints/main.py
Import and instantiate modifier via get_data_modifier, pass to loader construction, and serialize modifier into scripted model extra_files as data_modifier.pth when freezing.
Model wrapper / runtime
deepmd/pt/train/wrapper.py
ModelWrapper accepts/stores optional modifier; in inference-only flow computes modifier outputs and combines them with model outputs.
Inference loading & runtime
deepmd/pt/infer/deep_eval.py, deepmd/pt/infer/inference.py
.pth loader reads data_modifier.pth extra file and reconstructs modifier (in-memory); DeepEval/Tester store/use loaded/jitable modifiers.
Tests
source/tests/pt/test_data_modifier.py
New tests: three modifier implementations (random, zero, scaling), plugin registrations, and unit tests covering modification, caching, single-application guarantees, and inference integration.
Paddle utils update
deepmd/pd/utils/dataset.py
Propagate NUM_WORKERS to paddle dataset call sites (get_item_paddle(index, max(1, NUM_WORKERS))).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor CLI as Entrypoint
  participant Factory as get_data_modifier
  participant Loader as DpLoaderSet / Dataset
  participant Data as DeepmdData
  participant Trainer as ModelWrapper
  participant Freezer as Script Export (.pth)
  participant Eval as DeepEval

  CLI->>Factory: request modifier from model params
  Factory-->>CLI: BaseModifier instance (jitable?)
  CLI->>Loader: construct loaders with modifier
  Loader->>Data: attach modifier to dataset
  Loader->>Data: preload_and_modify_all_data_torch(num_worker)
  Data->>Data: apply modifier per-frame (threaded) and cache results
  CLI->>Trainer: instantiate ModelWrapper with modifier
  Trainer->>Trainer: compute model_pred
  Trainer->>Trainer: compute modifier_pred and combine outputs
  CLI->>Freezer: freeze -> serialize model + extra_files["data_modifier.pth"]
  Eval->>Freezer: load .pth (with extra_files)
  Freezer-->>Eval: provide `data_modifier.pth` bytes
  Eval->>Factory: torch.jit.load(bytes) -> modifier instance
  Eval->>Trainer: instantiate ModelWrapper with loaded modifier
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • njzjz
  • wanghan-iapcm
  • iProzd

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.10% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat(pt): add plugin for data modifier' accurately and concisely describes the main change in this PR—adding a data modifier plugin system to the PyTorch implementation.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
deepmd/pt/modifier/base_modifier.py (2)

41-44: Consider simplifying the box assignment.
A ternary operator can reduce verbosity here:

- if data["box"] is None:
-     box = None
- else:
-     box = data["box"][:get_nframes, :]
+ box = None if data["box"] is None else data["box"][:get_nframes, :]
🧰 Tools
🪛 Ruff (0.8.2)

41-44: Use ternary operator box = None if data["box"] is None else data["box"][:get_nframes, :] instead of if-else-block

Replace if-else-block with box = None if data["box"] is None else data["box"][:get_nframes, :]

(SIM108)


47-47: Remove or use the nframes variable.
Currently, nframes = coord.shape[0] is not used, which may confuse future maintainers.

🧰 Tools
🪛 Ruff (0.8.2)

47-47: Local variable nframes is assigned to but never used

Remove assignment to unused variable nframes

(F841)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between baaaa17 and e6bca31.

📒 Files selected for processing (4)
  • deepmd/pt/modifier/__init__.py (1 hunks)
  • deepmd/pt/modifier/base_modifier.py (1 hunks)
  • deepmd/pt/train/training.py (7 hunks)
  • deepmd/pt/utils/stat.py (1 hunks)
🧰 Additional context used
🧬 Code Definitions (2)
deepmd/pt/modifier/__init__.py (1)
deepmd/pt/modifier/base_modifier.py (1) (1)
  • BaseModifier (9-56)
deepmd/pt/train/training.py (3)
deepmd/pt/modifier/base_modifier.py (2) (2)
  • BaseModifier (9-56)
  • modify_data (14-56)
deepmd/pd/train/training.py (1) (1)
  • get_additional_data_requirement (1163-1187)
deepmd/pd/utils/stat.py (1) (1)
  • make_stat_input (40-85)
🪛 Ruff (0.8.2)
deepmd/pt/modifier/base_modifier.py

41-44: Use ternary operator box = None if data["box"] is None else data["box"][:get_nframes, :] instead of if-else-block

Replace if-else-block with box = None if data["box"] is None else data["box"][:get_nframes, :]

(SIM108)


47-47: Local variable nframes is assigned to but never used

Remove assignment to unused variable nframes

(F841)

⏰ Context from checks skipped due to timeout of 90000ms (29)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Build C library (2.14, >=2.5.0rc0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Test Python (1, 3.9)
🔇 Additional comments (11)
deepmd/pt/modifier/__init__.py (1)

1-8: Good job exposing the BaseModifier API.
This new __init__.py cleanly re-exports the BaseModifier class and ensures users can import it directly from deepmd.pt.modifier.

deepmd/pt/utils/stat.py (2)

50-51: Conditional logging is well-handled.
Only logging when nbatches > 0 helps keep logs cleaner in scenarios where no batches are processed.


56-59: Logic for handling nbatches == -1 is clear and correct.
This new condition ensures the entire dataset is used when nbatches is -1. No issues found.

deepmd/pt/train/training.py (8)

39-41: Import of BaseModifier is appropriate.
This import makes the newly introduced functionality available where needed.


140-149: Modifier parameter handling is well-structured.
The assertion preventing usage in multi-task scenarios is clear and avoids incompatible configurations.


231-231: Defaulting modifier to None is appropriate.
Makes the modifier usage optional without complicating the training interface.


239-250: Verify data modification logic.
Applying modifier.modify_data to every system might lead to repeated transformations if single_model_stat is called multiple times. Confirm this matches your intended workflow.


345-345: Single-model signature usage is consistent.
Passing modifier=self.modifier ensures the same modifier instance is applied throughout the training flow.


384-384: Multi-task signature usage is consistent.
Again, passing modifier=self.modifier allows uniform data processing across tasks if needed.


1075-1081: Storing the data_modifier state is a good idea.
Consider providing a loading mechanism in the future so that data_modifier can be restored automatically.


1389-1400: Factory function for data modifiers looks good.
Encapsulates logic for dynamically obtaining modifier classes, making the code more extensible.

@codecov
Copy link

codecov bot commented Mar 20, 2025

Codecov Report

❌ Patch coverage is 42.04852% with 215 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.94%. Comparing base (fe1662d) to head (352c149).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
source/tests/pt/test_data_modifier.py 3.46% 195 Missing ⚠️
deepmd/pt/modifier/base_modifier.py 72.58% 17 Missing ⚠️
deepmd/pt/modifier/__init__.py 81.81% 2 Missing ⚠️
deepmd/utils/data.py 96.42% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4661      +/-   ##
==========================================
- Coverage   82.15%   81.94%   -0.21%     
==========================================
  Files         709      712       +3     
  Lines       72468    72824     +356     
  Branches     3616     3615       -1     
==========================================
+ Hits        59535    59679     +144     
- Misses      11769    11982     +213     
+ Partials     1164     1163       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Collaborator

@wanghan-iapcm wanghan-iapcm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add UT for the implementation.

@wanghan-iapcm wanghan-iapcm requested review from iProzd and njzjz March 21, 2025 03:53
@njzjz njzjz linked an issue Mar 21, 2025 that may be closed by this pull request
@njzjz njzjz requested a review from Copilot March 21, 2025 17:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new plugin for modifying data in the PyTorch backend and integrates it into the training workflow. Key changes include:

  • Creation of a new BaseModifier class and registration in the modifier package.
  • Integration of the data modifier into the training process, including saving its state.
  • Minor adjustments to the statistics data preparation in the utils module.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
deepmd/pt/modifier/init.py Exposes BaseModifier for external use.
deepmd/pt/modifier/base_modifier.py Adds a new BaseModifier class for data modification.
deepmd/pt/train/training.py Integrates the data modifier into training data preparation and model saving.
deepmd/pt/utils/stat.py Tweaks logging and batch calculation in the statistics utility.
Comments suppressed due to low confidence (2)

deepmd/pt/modifier/base_modifier.py:9

  • Ensure that make_base_modifier() returns a valid class to use for multiple inheritance with torch.nn.Module. If it does not, consider revising the inheritance structure or renaming for clarity.
class BaseModifier(torch.nn.Module, make_base_modifier()):

deepmd/pt/modifier/base_modifier.py:40

  • The variable get_nframes is explicitly set to None, which will slice the full array; if a limit on the number of frames was intended, assign get_nframes an appropriate value.
coord = data["coord"][:get_nframes, :]

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
deepmd/pt/utils/dataset.py (1)

74-75: Remove or implement the commented-out clear_modified_frame_cache method.

Similar to the dataloader, this commented-out method suggests incomplete implementation. Either implement the functionality if needed, or remove the comment to reduce clutter.

This is the same issue as in deepmd/pt/utils/dataloader.py lines 245-247. Consider a unified approach to cache management across both files.

deepmd/pt/modifier/base_modifier.py (1)

50-50: Remove or use the commented-out nframes variable.

The variable nframes is calculated but immediately commented out and never used. Either remove the comment and use the variable, or delete the line entirely to reduce clutter.

Based on learnings, if this is intentionally kept for future use, please clarify with a TODO comment.

🧹 Nitpick comments (2)
deepmd/pt/modifier/base_modifier.py (1)

17-17: Clarify the purpose of the unused data_sys parameter.

The data_sys parameter is flagged as unused by static analysis. If it's intended for future use or required by subclasses, consider adding a docstring note or a TODO comment. Otherwise, consider removing it to reduce the parameter surface.

deepmd/pt/utils/dataloader.py (1)

245-247: Remove or implement the commented-out clear_modified_frame_cache method.

The commented-out method suggests incomplete implementation. Either implement the cache-clearing functionality if it's needed, or remove the comment to keep the codebase clean.

If cache clearing is required in the future, consider opening an issue to track the feature.

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5369e80 and 859292f.

📒 Files selected for processing (6)
  • deepmd/pt/entrypoints/main.py (4 hunks)
  • deepmd/pt/modifier/__init__.py (1 hunks)
  • deepmd/pt/modifier/base_modifier.py (1 hunks)
  • deepmd/pt/train/training.py (2 hunks)
  • deepmd/pt/utils/dataloader.py (4 hunks)
  • deepmd/pt/utils/dataset.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/pt/modifier/init.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2024-10-16T21:50:10.680Z
Learnt from: njzjz
Repo: deepmodeling/deepmd-kit PR: 4226
File: deepmd/dpmodel/model/make_model.py:370-373
Timestamp: 2024-10-16T21:50:10.680Z
Learning: In `deepmd/dpmodel/model/make_model.py`, the variable `nall` assigned but not used is intentional and should not be flagged in future reviews.

Applied to files:

  • deepmd/pt/modifier/base_modifier.py
🧬 Code graph analysis (4)
deepmd/pt/train/training.py (3)
deepmd/pt/utils/dataloader.py (1)
  • preload_and_modify_all_data (241-243)
deepmd/pt/utils/dataset.py (1)
  • preload_and_modify_all_data (71-72)
deepmd/utils/data.py (1)
  • preload_and_modify_all_data (498-513)
deepmd/pt/entrypoints/main.py (3)
deepmd/pt/modifier/base_modifier.py (1)
  • BaseModifier (12-60)
deepmd/utils/plugin.py (1)
  • get_class_by_type (144-154)
deepmd/pt/utils/dataloader.py (1)
  • DpLoaderSet (65-247)
deepmd/pt/utils/dataset.py (2)
deepmd/pt/modifier/base_modifier.py (1)
  • BaseModifier (12-60)
deepmd/pt/utils/dataloader.py (1)
  • preload_and_modify_all_data (241-243)
deepmd/pt/utils/dataloader.py (3)
deepmd/pt/modifier/base_modifier.py (1)
  • BaseModifier (12-60)
deepmd/pt/utils/dataset.py (1)
  • preload_and_modify_all_data (71-72)
deepmd/utils/data.py (1)
  • preload_and_modify_all_data (498-513)
🪛 Ruff (0.14.8)
deepmd/pt/entrypoints/main.py

116-116: Avoid specifying long messages outside the exception class

(TRY003)

deepmd/pt/modifier/base_modifier.py

17-17: Unused method argument: data_sys

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
  • GitHub Check: Test Python (6, 3.13)
  • GitHub Check: Test Python (4, 3.13)
  • GitHub Check: Test Python (5, 3.13)
  • GitHub Check: Test Python (6, 3.10)
  • GitHub Check: Test Python (2, 3.13)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Test Python (5, 3.10)
  • GitHub Check: Test C++ (false, true, true, false)
  • GitHub Check: Test C++ (true, true, true, false)
  • GitHub Check: Test Python (2, 3.10)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (1, 3.13)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test Python (4, 3.10)
  • GitHub Check: Test C++ (false, false, false, true)
  • GitHub Check: Test Python (3, 3.10)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test Python (3, 3.13)
  • GitHub Check: Test Python (1, 3.10)
  • GitHub Check: Test C++ (true, false, false, true)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
🔇 Additional comments (9)
deepmd/pt/entrypoints/main.py (3)

111-121: LGTM: Data modifier instantiation is well-structured.

The get_data_modifier function correctly validates the presence of the type field, instantiates the modifier using the plugin system, and moves it to the appropriate device.


129-133: LGTM: Modifier wiring integrates cleanly.

The modifier is correctly extracted from model parameters and instantiated only when present, maintaining backward compatibility for configurations without modifiers.


164-180: LGTM: Modifier is correctly passed to both training and validation data loaders.

The modifier is properly threaded through to both DpLoaderSet instances, ensuring consistent data preprocessing across training and validation.

deepmd/pt/train/training.py (2)

353-355: LGTM: Preload and modify step correctly positioned.

The preload_and_modify_all_data() calls are correctly placed after data requirements are added by single_model_stat and before data loaders are constructed. This ensures the modifier operates on the complete, configured data before training begins.


394-396: LGTM: Multi-model path mirrors single-model preload logic.

The multi-model code path correctly applies preload_and_modify_all_data() to each model's training and validation data, maintaining consistency with the single-model approach.

deepmd/pt/utils/dataloader.py (2)

82-102: LGTM: Modifier parameter correctly threaded through DpLoaderSet.

The modifier parameter is properly added to the constructor signature, stored, and forwarded to each DeepmdDataSetForLoader instance. The optional typing (BaseModifier | None) maintains backward compatibility.


241-243: LGTM: Preload method delegates correctly.

The preload_and_modify_all_data() method properly iterates over all systems and delegates to each system's preload method, enabling batch preprocessing before training.

deepmd/pt/utils/dataset.py (2)

22-40: LGTM: Modifier integration is clean and complete.

The modifier parameter is properly added to the constructor, documented, stored as an instance attribute, and forwarded to the underlying DeepmdData instance. This enables the data modification pipeline while maintaining a clear separation of concerns.


71-72: LGTM: Preload method correctly delegates to the data system.

The preload_and_modify_all_data() method properly delegates to self._data_system.preload_and_modify_all_data(), maintaining the abstraction layers and enabling efficient batch preprocessing.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
deepmd/pt/modifier/base_modifier.py (2)

43-43: Clarify or remove the get_nframes pattern.

Setting get_nframes = None and then using it in slicing operations (e.g., [:get_nframes, :]) is equivalent to [:None, :] which means "all frames". This pattern appears to be a placeholder for future functionality but currently adds no value and may confuse readers.

If this is intentional for future extension, consider adding a comment explaining the planned usage. Otherwise, simplify to direct slicing:

-        get_nframes = None
-        coord = data["coord"][:get_nframes, :]
+        coord = data["coord"]

54-59: Consider adding shape validation before reshape.

The code assumes that the outputs from forward() are compatible with the target shapes in data. If a subclass returns tensors with incompatible shapes, the reshape() operations will fail with potentially unclear error messages.

Consider adding assertions to provide clearer error messages:

if "find_energy" in data and data["find_energy"] == 1.0:
    expected_shape = data["energy"].shape
    assert tot_e.numel() == np.prod(expected_shape), \
        f"Energy shape mismatch: forward returned {tot_e.shape}, expected {expected_shape}"
    data["energy"] -= tot_e.reshape(expected_shape)
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 859292f and c69102c.

📒 Files selected for processing (4)
  • deepmd/pt/modifier/base_modifier.py (1 hunks)
  • deepmd/pt/utils/dataloader.py (4 hunks)
  • deepmd/pt/utils/dataset.py (2 hunks)
  • source/tests/pt/test_data_modifier.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2024-10-16T21:50:10.680Z
Learnt from: njzjz
Repo: deepmodeling/deepmd-kit PR: 4226
File: deepmd/dpmodel/model/make_model.py:370-373
Timestamp: 2024-10-16T21:50:10.680Z
Learning: In `deepmd/dpmodel/model/make_model.py`, the variable `nall` assigned but not used is intentional and should not be flagged in future reviews.

Applied to files:

  • deepmd/pt/modifier/base_modifier.py
🧬 Code graph analysis (3)
deepmd/pt/utils/dataset.py (2)
deepmd/pt/modifier/base_modifier.py (1)
  • BaseModifier (12-59)
deepmd/utils/data.py (2)
  • DeepmdData (34-1069)
  • preload_and_modify_all_data (498-513)
deepmd/pt/modifier/base_modifier.py (1)
deepmd/dpmodel/modifier/base_modifier.py (1)
  • make_base_modifier (17-77)
deepmd/pt/utils/dataloader.py (3)
deepmd/pt/modifier/base_modifier.py (1)
  • BaseModifier (12-59)
deepmd/pt/utils/dataset.py (1)
  • preload_and_modify_all_data (71-72)
deepmd/utils/data.py (1)
  • preload_and_modify_all_data (498-513)
🪛 Ruff (0.14.8)
source/tests/pt/test_data_modifier.py

81-81: Unused method argument: data_sys

(ARG002)


141-141: Unused method argument: data_sys

(ARG002)

deepmd/pt/modifier/base_modifier.py

17-17: Unused method argument: data_sys

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Test C++ (false, true, true, false)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Test C++ (false, false, false, true)
  • GitHub Check: Test C++ (true, false, false, true)
  • GitHub Check: Analyze (python)
  • GitHub Check: Test C++ (true, true, true, false)
  • GitHub Check: Test Python (5, 3.10)
  • GitHub Check: Test Python (6, 3.13)
  • GitHub Check: Test Python (2, 3.13)
  • GitHub Check: Test Python (3, 3.13)
  • GitHub Check: Test Python (1, 3.13)
  • GitHub Check: Test Python (4, 3.13)
  • GitHub Check: Test Python (4, 3.10)
  • GitHub Check: Test Python (5, 3.13)
  • GitHub Check: Test Python (6, 3.10)
  • GitHub Check: Test Python (2, 3.10)
  • GitHub Check: Test Python (3, 3.10)
  • GitHub Check: Test Python (1, 3.10)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
🔇 Additional comments (9)
deepmd/pt/modifier/base_modifier.py (2)

1-15: LGTM!

The imports and class structure are clean. The BaseModifier correctly inherits from torch.nn.Module and the dpmodel base class.


36-41: Good defensive check.

The early return when no find_* flags are present avoids unnecessary computation.

source/tests/pt/test_data_modifier.py (3)

28-35: LGTM!

The plugin registrations for test modifiers are correctly structured.


159-173: Good test setup.

The test configuration properly loads a base config and customizes training parameters for testing.


175-223: Excellent test coverage.

The tests effectively validate:

  1. That data modification is applied during initialization (zero_tester)
  2. That modification happens exactly once and persists through training (random_tester)

This covers the key behaviors of the modifier system.

deepmd/pt/utils/dataloader.py (2)

28-30: Clean integration of modifier support.

The modifier parameter is properly typed, documented in the docstring (implicitly through the system parameter), and correctly propagated to the dataset constructor.

Also applies to: 82-102


241-243: LGTM!

The preload_and_modify_all_data method correctly delegates to each underlying system, providing a convenient way to preload and modify all data in the loader set.

deepmd/pt/utils/dataset.py (2)

12-14: Excellent integration of modifier support.

The modifier parameter is properly:

  • Typed with BaseModifier | None
  • Documented in the docstring
  • Stored as an instance variable
  • Propagated to the DeepmdData constructor

The implementation is clean and consistent with the overall modifier pattern.

Also applies to: 22-44


71-72: LGTM!

The preload_and_modify_all_data method correctly delegates to the underlying data system, maintaining a clean separation of concerns.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
deepmd/pt/modifier/base_modifier.py (1)

93-107: Clarify the obscure get_nframes = None usage.

Line 93 sets get_nframes = None and then uses it in slicing operations like data["coord"][:get_nframes, :]. When get_nframes is None, the slice [:None, :] is equivalent to [:, :] (selecting all frames).

This pattern is intentional but not immediately obvious to readers.

Consider adding a comment to clarify:

-        get_nframes = None
+        get_nframes = None  # None in slice [:None] means select all frames
         coord = data["coord"][:get_nframes, :]

Or use a more explicit pattern:

-        get_nframes = None
-        coord = data["coord"][:get_nframes, :]
-        atype = data["atype"][:get_nframes, :]
+        # Process all frames in the batch
+        coord = data["coord"]
+        atype = data["atype"]
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c69102c and d804866.

📒 Files selected for processing (3)
  • deepmd/pt/modifier/base_modifier.py (1 hunks)
  • deepmd/utils/data.py (3 hunks)
  • source/tests/pt/test_data_modifier.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2024-10-16T21:50:10.680Z
Learnt from: njzjz
Repo: deepmodeling/deepmd-kit PR: 4226
File: deepmd/dpmodel/model/make_model.py:370-373
Timestamp: 2024-10-16T21:50:10.680Z
Learning: In `deepmd/dpmodel/model/make_model.py`, the variable `nall` assigned but not used is intentional and should not be flagged in future reviews.

Applied to files:

  • deepmd/pt/modifier/base_modifier.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (3)
deepmd/pt/modifier/base_modifier.py (1)
  • modify_data (66-117)
deepmd/pt/utils/dataset.py (1)
  • preload_and_modify_all_data (71-72)
deepmd/pt/utils/dataloader.py (1)
  • preload_and_modify_all_data (241-243)
🪛 Ruff (0.14.8)
source/tests/pt/test_data_modifier.py

52-52: Unused method argument: atype

(ARG002)


53-53: Unused method argument: box

(ARG002)


54-54: Unused method argument: fparam

(ARG002)


55-55: Unused method argument: aparam

(ARG002)


59-59: Unused method argument: data_sys

(ARG002)


89-89: Unused method argument: atype

(ARG002)


90-90: Unused method argument: box

(ARG002)


91-91: Unused method argument: fparam

(ARG002)


92-92: Unused method argument: aparam

(ARG002)


96-96: Unused method argument: data_sys

(ARG002)

deepmd/pt/modifier/base_modifier.py

66-66: Unused method argument: data_sys

(ARG002)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (3, 3.13)
  • GitHub Check: Test Python (2, 3.10)
  • GitHub Check: Test Python (1, 3.10)
  • GitHub Check: Test Python (5, 3.10)
  • GitHub Check: Test Python (2, 3.13)
  • GitHub Check: Test Python (6, 3.13)
  • GitHub Check: Test Python (5, 3.13)
  • GitHub Check: Test Python (6, 3.10)
  • GitHub Check: Test Python (4, 3.10)
  • GitHub Check: Test Python (1, 3.13)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Test Python (4, 3.13)
  • GitHub Check: Test Python (3, 3.10)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test C++ (true, true, true, false)
  • GitHub Check: Test C++ (false, true, true, false)
  • GitHub Check: Test C++ (false, false, false, true)
  • GitHub Check: Test C++ (true, false, false, true)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
🔇 Additional comments (8)
deepmd/utils/data.py (3)

142-147: LGTM! Safe initialization of modifier caching.

The caching mechanism is initialized correctly with appropriate guards. The _modified_frame_cache is only created when a modifier exists, and all cache accesses in get_single_frame properly check self.modifier is not None before accessing the cache.


387-394: LGTM! Efficient cache lookup with proper guards.

The early return from cache is well-guarded with all necessary conditions and provides a good optimization for repeated frame access.


498-514: LGTM! Well-implemented preload method with progress logging.

The preload method correctly handles early returns, efficiently skips already-cached frames, and provides useful progress feedback. The implementation aligns well with the broader data loading pipeline shown in the related files.

deepmd/pt/modifier/base_modifier.py (1)

1-65: LGTM! Well-structured base class with proper abstractions.

The class correctly inherits from torch.nn.Module and uses @abstractmethod to enforce subclass implementation of forward(). The serialization methods follow standard patterns.

source/tests/pt/test_data_modifier.py (4)

29-36: LGTM! Plugin registrations correctly implemented.

Both modifier plugins are properly registered with the modifier_args_plugin registry and return empty argument lists, which is appropriate for these test modifiers.


39-73: LGTM! Test modifier correctly implements custom logic.

The ModifierRandomTester class correctly:

  • Sets modifier_type = "random_tester" matching its registration name (line 47)
  • Overrides modify_data with custom randomization logic (lines 59-73)
  • The forward method returns minimal data since modify_data is completely overridden and never calls the base implementation

The unused parameters flagged by static analysis (atype, box, fparam, aparam in forward, and data_sys in modify_data) are required by the base class interface but not used in this simple test implementation. This is acceptable for test code.


76-110: LGTM! Test modifier correctly implements zeroing behavior.

The ModifierZeroTester class correctly sets modifier_type = "zero_tester" matching its registration and implements the expected zeroing logic in modify_data. Like ModifierRandomTester, it completely overrides the base modify_data implementation, so the minimal forward method is acceptable.


113-187: LGTM! Comprehensive test coverage for modifier integration.

The test cases effectively verify:

  1. test_init_modify_data - confirms that zero_tester correctly zeros out training and validation data (lines 130-147)
  2. test_full_modify_data - validates that random_tester produces consistent results before and after trainer.run(), which tests the caching mechanism (lines 149-178)

The tests cover the critical integration points between modifiers, data loading, and the training pipeline. The cleanup in tearDown is thorough.

- Add data modifier support in model inference pipeline
- Enable saving and loading data modifiers with frozen models
- Add ModifierScalingTester for scaling model predictions as data modification
- Update test cases to verify data modifier functionality in inference
- Enhance modifier argument registration with documentation

This allows data modifiers to be applied during model inference and
preserves them when saving frozen models for consistent behavior
across training and inference stages.
…to save the data modification before training or to perform modification on-the-fly.
@ChiahsinChu ChiahsinChu force-pushed the devel-modifier-plugin branch from f402bab to d4919c3 Compare January 6, 2026 09:11
@ChiahsinChu ChiahsinChu requested a review from iProzd January 6, 2026 09:12
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
deepmd/pt/train/wrapper.py (1)

191-194: Consider validating modifier prediction keys.

The code assumes all keys in modifier_pred exist in model_pred. If a modifier returns unexpected keys, this will raise a KeyError at line 194. While this may be acceptable for controlled scenarios, consider adding a defensive check if robustness is preferred.

🔎 Optional defensive check
 if self.modifier is not None:
     modifier_pred = self.modifier(**input_dict)
     for k, v in modifier_pred.items():
-        model_pred[k] = model_pred[k] + v
+        if k in model_pred:
+            model_pred[k] = model_pred[k] + v
source/tests/pt/test_data_modifier.py (1)

96-97: Note: __new__ methods appear redundant but may be required.

All three test modifier classes define __new__ methods that simply call super().__new__(cls). While this seems redundant, it may be required by the plugin registration system. If not needed, these methods can be removed to simplify the code.

Also applies to: 142-143, 184-185

📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35b7d84 and f402bab.

📒 Files selected for processing (2)
  • deepmd/pt/train/wrapper.py
  • source/tests/pt/test_data_modifier.py
🧰 Additional context used
🧬 Code graph analysis (1)
source/tests/pt/test_data_modifier.py (5)
deepmd/pt/infer/deep_eval.py (2)
  • DeepEval (87-863)
  • eval (321-395)
deepmd/pt/entrypoints/main.py (2)
  • main (543-602)
  • freeze (382-408)
deepmd/pt/modifier/base_modifier.py (3)
  • BaseModifier (30-187)
  • forward (78-87)
  • modify_data (90-187)
deepmd/utils/data.py (2)
  • DeepmdData (35-1079)
  • add (150-203)
deepmd/pt/train/wrapper.py (1)
  • forward (155-205)
🪛 Ruff (0.14.10)
source/tests/pt/test_data_modifier.py

96-96: Unused static method argument: args

(ARG004)


96-96: Unused static method argument: kwargs

(ARG004)


113-113: Unused method argument: coord

(ARG002)


114-114: Unused method argument: atype

(ARG002)


115-115: Unused method argument: box

(ARG002)


116-116: Unused method argument: fparam

(ARG002)


117-117: Unused method argument: aparam

(ARG002)


118-118: Unused method argument: do_atomic_virial

(ARG002)


123-123: Unused method argument: data_sys

(ARG002)


142-142: Unused static method argument: args

(ARG004)


142-142: Unused static method argument: kwargs

(ARG004)


155-155: Unused method argument: coord

(ARG002)


156-156: Unused method argument: atype

(ARG002)


157-157: Unused method argument: box

(ARG002)


158-158: Unused method argument: fparam

(ARG002)


159-159: Unused method argument: aparam

(ARG002)


160-160: Unused method argument: do_atomic_virial

(ARG002)


165-165: Unused method argument: data_sys

(ARG002)


184-184: Unused static method argument: args

(ARG004)


184-184: Unused static method argument: kwargs

(ARG004)

🔇 Additional comments (2)
source/tests/pt/test_data_modifier.py (2)

314-384: LGTM - Complex but correct test logic.

This test verifies that a frozen model with an embedded modifier correctly applies the modifier during inference. The test:

  1. Trains a base model and freezes it (used as the modifier source)
  2. Trains a new model with a scaling_tester modifier referencing the base model
  3. Verifies that inference applies the modifier correctly: output = base_output + scaled_modifier_output

The test logic is sound and properly validates the inference path in wrapper.py.


111-121: Note: Unused parameters are expected for interface compliance.

The static analysis warnings about unused parameters in forward() methods are false positives. These parameters are required to match the BaseModifier abstract interface. The random_tester and zero_tester modifiers only modify training data via modify_data() and don't perform inference-time modifications, hence their forward() methods return empty dictionaries.

Also applies to: 153-163

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI Agents
In @deepmd/pt/infer/deep_eval.py:
- Around line 175-189: The .pt loading path creates a local variable modifier
but never sets the instance attribute, causing self.modifier to be undefined
later; change the .pt branch so it assigns the loaded modifier to self.modifier
(not just a local modifier), ensure self.modifier is initialized to None before
the conditional if needed, and pass self.modifier into ModelWrapper
(ModelWrapper(model, modifier=self.modifier)) so both the instance attribute and
wrapper receive the same value when using torch.jit.load with extra_files.
🧹 Nitpick comments (1)
deepmd/pt/utils/dataset.py (1)

74-75: Add docstring for the new public method.

The preload_and_modify_all_data_torch method is a new public API but lacks documentation explaining its purpose, when to call it, and its behavior.

🔎 Proposed docstring
     def preload_and_modify_all_data_torch(self) -> None:
+        """Preload and apply modifier to all frames in the dataset.
+        
+        This method should be called before training to apply any data
+        modifications and optionally cache the results for improved performance.
+        Uses worker threads to avoid CUDA re-initialization issues.
+        """
         self._data_system.preload_and_modify_all_data_torch(max(1, NUM_WORKERS))
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f402bab and d4919c3.

📒 Files selected for processing (12)
  • deepmd/pd/utils/dataset.py
  • deepmd/pt/entrypoints/main.py
  • deepmd/pt/infer/deep_eval.py
  • deepmd/pt/infer/inference.py
  • deepmd/pt/modifier/__init__.py
  • deepmd/pt/modifier/base_modifier.py
  • deepmd/pt/train/training.py
  • deepmd/pt/train/wrapper.py
  • deepmd/pt/utils/dataloader.py
  • deepmd/pt/utils/dataset.py
  • deepmd/utils/data.py
  • source/tests/pt/test_data_modifier.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • deepmd/pd/utils/dataset.py
  • deepmd/pt/infer/inference.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2024-10-16T21:50:10.680Z
Learnt from: njzjz
Repo: deepmodeling/deepmd-kit PR: 4226
File: deepmd/dpmodel/model/make_model.py:370-373
Timestamp: 2024-10-16T21:50:10.680Z
Learning: In `deepmd/dpmodel/model/make_model.py`, the variable `nall` assigned but not used is intentional and should not be flagged in future reviews.

Applied to files:

  • deepmd/pt/modifier/base_modifier.py
🧬 Code graph analysis (8)
deepmd/pt/utils/dataset.py (4)
deepmd/pt/modifier/base_modifier.py (1)
  • BaseModifier (30-187)
deepmd/utils/data.py (3)
  • DataRequirementItem (1082-1162)
  • DeepmdData (35-1079)
  • preload_and_modify_all_data_torch (517-532)
deepmd/pd/utils/dataset.py (1)
  • DeepmdDataSetForLoader (17-56)
deepmd/pt/utils/dataloader.py (1)
  • preload_and_modify_all_data_torch (241-243)
deepmd/pt/train/training.py (2)
deepmd/pd/train/training.py (2)
  • get_additional_data_requirement (1222-1246)
  • single_model_stat (209-239)
deepmd/pt/utils/dataset.py (2)
  • add_data_requirement (58-72)
  • preload_and_modify_all_data_torch (74-75)
deepmd/pt/modifier/base_modifier.py (2)
deepmd/dpmodel/modifier/base_modifier.py (1)
  • make_base_modifier (17-77)
deepmd/utils/data.py (1)
  • DeepmdData (35-1079)
deepmd/pt/utils/dataloader.py (3)
deepmd/pt/modifier/base_modifier.py (1)
  • BaseModifier (30-187)
deepmd/pt/utils/dataset.py (1)
  • preload_and_modify_all_data_torch (74-75)
deepmd/utils/data.py (1)
  • preload_and_modify_all_data_torch (517-532)
deepmd/pt/entrypoints/main.py (2)
deepmd/pt/modifier/__init__.py (1)
  • get_data_modifier (17-23)
deepmd/pt/infer/inference.py (1)
  • Tester (28-76)
deepmd/pt/infer/deep_eval.py (1)
deepmd/pt/train/wrapper.py (1)
  • ModelWrapper (16-217)
source/tests/pt/test_data_modifier.py (4)
deepmd/pt/entrypoints/main.py (3)
  • main (543-602)
  • freeze (382-408)
  • get_trainer (100-216)
deepmd/pt/modifier/base_modifier.py (3)
  • BaseModifier (30-187)
  • forward (78-87)
  • modify_data (90-187)
deepmd/utils/data.py (1)
  • DeepmdData (35-1079)
deepmd/pt/train/training.py (1)
  • get_data (1247-1292)
deepmd/pt/modifier/__init__.py (2)
deepmd/pt/modifier/base_modifier.py (1)
  • BaseModifier (30-187)
deepmd/utils/plugin.py (1)
  • get_class_by_type (144-154)
🪛 Ruff (0.14.10)
deepmd/pt/modifier/base_modifier.py

90-90: Unused method argument: data_sys

(ARG002)


163-166: Avoid specifying long messages outside the exception class

(TRY003)


172-175: Avoid specifying long messages outside the exception class

(TRY003)


181-184: Avoid specifying long messages outside the exception class

(TRY003)

source/tests/pt/test_data_modifier.py

96-96: Unused static method argument: args

(ARG004)


96-96: Unused static method argument: kwargs

(ARG004)


113-113: Unused method argument: coord

(ARG002)


114-114: Unused method argument: atype

(ARG002)


115-115: Unused method argument: box

(ARG002)


116-116: Unused method argument: fparam

(ARG002)


117-117: Unused method argument: aparam

(ARG002)


118-118: Unused method argument: do_atomic_virial

(ARG002)


123-123: Unused method argument: data_sys

(ARG002)


142-142: Unused static method argument: args

(ARG004)


142-142: Unused static method argument: kwargs

(ARG004)


155-155: Unused method argument: coord

(ARG002)


156-156: Unused method argument: atype

(ARG002)


157-157: Unused method argument: box

(ARG002)


158-158: Unused method argument: fparam

(ARG002)


159-159: Unused method argument: aparam

(ARG002)


160-160: Unused method argument: do_atomic_virial

(ARG002)


165-165: Unused method argument: data_sys

(ARG002)


184-184: Unused static method argument: args

(ARG004)


184-184: Unused static method argument: kwargs

(ARG004)

deepmd/pt/modifier/__init__.py

22-22: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (27)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Test Python (2, 3.10)
  • GitHub Check: Test Python (5, 3.13)
  • GitHub Check: Test Python (6, 3.10)
  • GitHub Check: Test Python (5, 3.10)
  • GitHub Check: Test Python (6, 3.13)
  • GitHub Check: Test Python (4, 3.13)
  • GitHub Check: Test Python (3, 3.10)
  • GitHub Check: Test Python (1, 3.13)
  • GitHub Check: Test Python (3, 3.13)
  • GitHub Check: Test Python (2, 3.13)
  • GitHub Check: Test Python (4, 3.10)
  • GitHub Check: Test Python (1, 3.10)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Test C++ (true, false, false, true)
  • GitHub Check: Test C++ (true, true, true, false)
  • GitHub Check: Test C++ (false, true, true, false)
  • GitHub Check: Test C++ (false, false, false, true)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
🔇 Additional comments (15)
deepmd/utils/data.py (3)

143-148: LGTM: Modifier cache initialization.

The logic correctly initializes the cache only when a modifier is present and respects the modifier's use_cache preference. This provides flexibility for users to control memory usage.


256-287: LGTM: Consistent method signature updates.

Both get_item_torch and get_item_paddle now accept the num_worker parameter and delegate to get_single_frame. The default value maintains backward compatibility.


517-533: LGTM: Preload method is well-structured.

The method efficiently handles the preloading workflow with appropriate guards (early return when caching is disabled) and helpful progress logging for large datasets.

deepmd/pt/modifier/__init__.py (1)

17-23: LGTM: Factory function is correctly implemented.

The function properly:

  • Creates a defensive copy to prevent parameter mutation
  • Validates the required "type" field
  • Delegates to the plugin registry for instantiation

The static analysis hint (TRY003) about the error message is a minor style preference and doesn't affect functionality.

deepmd/pt/utils/dataloader.py (2)

28-30: LGTM: Clean modifier integration.

The modifier parameter is properly threaded through the data loader initialization with appropriate type hints and backward-compatible defaults.

Also applies to: 89-89, 101-101


241-243: LGTM: Preload delegation is straightforward.

The method correctly delegates preloading to each system in the dataset. The naming is consistent with the underlying API.

deepmd/pt/entrypoints/main.py (3)

118-123: LGTM: Modifier creation is correctly guarded.

The code properly checks for modifier configuration, creates the modifier via the factory function, and ensures it's on the correct device.


158-158: LGTM: Modifier propagation to data loaders.

The modifier is consistently passed to both training and validation data loaders, ensuring uniform data modification during training.

Also applies to: 168-168


387-407: LGTM: Improved modifier serialization using in-memory buffer.

The freeze function now correctly uses io.BytesIO for modifier serialization instead of temporary files, addressing the previous review concern about resource cleanup. This is a cleaner and safer approach.

deepmd/pt/infer/deep_eval.py (1)

2-2: LGTM: Modifier deserialization logic (aside from initialization issue).

The modifier loading implementation correctly:

  • Uses io.BytesIO for in-memory deserialization
  • Checks for modifier presence in extra_files
  • Passes the modifier to ModelWrapper

Once the self.modifier initialization issue is addressed, this implementation will be complete.

Note: This approval is contingent on fixing the self.modifier initialization issue flagged in the separate comment.

Also applies to: 175-188

deepmd/pt/train/wrapper.py (1)

191-194: LGTM! Modifier application logic is correct.

The modifier predictions are correctly added to the model predictions during inference. This aligns with the training-time behavior where modifier outputs are subtracted from labels, ensuring consistent treatment of modifier corrections across training and inference.

deepmd/pt/train/training.py (2)

335-344: Correct ordering of data preparation steps.

The data requirements are properly constructed, attached to datasets, and then all data is preloaded and modified before statistics computation. This ensures that model statistics (mean/std for normalization) are computed on the modified data, which is the correct behavior.


379-390: Correct multi-task data preparation flow.

Each model's data requirements are properly established and data is preloaded/modified before statistics computation, maintaining consistency with the single-task path.

deepmd/pt/modifier/base_modifier.py (1)

161-187: Good error handling for missing modifier outputs.

The method properly validates that the modifier provides all required outputs (energy, force, virial) when their corresponding flags are set, with clear error messages identifying the missing keys.

source/tests/pt/test_data_modifier.py (1)

314-384: Well-structured inference test.

The test properly validates the modifier's behavior during inference by training a model with a modifier and verifying that the final predictions correctly include the scaled modifier contributions. The test setup is comprehensive and the assertions are appropriate.

@ChiahsinChu ChiahsinChu force-pushed the devel-modifier-plugin branch from d4919c3 to 352c149 Compare January 6, 2026 11:49
if self.modifier is not None:
with ThreadPoolExecutor(max_workers=num_worker) as executor:
# Apply modifier if it exists
future = executor.submit(

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable future is not used.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI Agents
In @deepmd/utils/data.py:
- Around line 502-512: The submitted future from ThreadPoolExecutor for
modifier.modify_data is never awaited so exceptions are swallowed; after
submitting (future = executor.submit(self.modifier.modify_data, frame_data,
self)), call future.result() to re-raise any exceptions and ensure the modifier
completed before continuing, then proceed with the existing caching logic that
writes to self._modified_frame_cache when self.use_modifier_cache is True.
- Around line 400-406: The cached frame is currently returned directly from the
modified frame cache (check involving self.use_modifier_cache, self.modifier and
self._modified_frame_cache[index]), allowing callers to mutate the cached numpy
arrays; change the return so you return a deep copy of the cached frame (e.g.,
use a deep copy utility to duplicate nested dicts/arrays) before returning to
preserve cache immutability and prevent in-place modifications from corrupting
the cache.
🧹 Nitpick comments (1)
source/tests/pt/test_data_modifier.py (1)

96-97: Remove unnecessary __new__ method.

The __new__ method simply calls super().__new__(cls), which is Python's default behavior. This method can be removed entirely.

🔎 Proposed fix
 @BaseModifier.register("random_tester")
 class ModifierRandomTester(BaseModifier):
-    def __new__(cls, *args, **kwargs):
-        return super().__new__(cls)
-
     def __init__(
📜 Review details

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d4919c3 and 352c149.

📒 Files selected for processing (3)
  • deepmd/pt/train/wrapper.py
  • deepmd/utils/data.py
  • source/tests/pt/test_data_modifier.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/pt/train/wrapper.py
🧰 Additional context used
🧬 Code graph analysis (1)
deepmd/utils/data.py (3)
deepmd/pt/modifier/base_modifier.py (1)
  • modify_data (90-187)
deepmd/pt/utils/dataset.py (1)
  • preload_and_modify_all_data_torch (74-75)
deepmd/pt/utils/dataloader.py (1)
  • preload_and_modify_all_data_torch (241-243)
🪛 Ruff (0.14.10)
source/tests/pt/test_data_modifier.py

96-96: Unused static method argument: args

(ARG004)


96-96: Unused static method argument: kwargs

(ARG004)


113-113: Unused method argument: coord

(ARG002)


114-114: Unused method argument: atype

(ARG002)


115-115: Unused method argument: box

(ARG002)


116-116: Unused method argument: fparam

(ARG002)


117-117: Unused method argument: aparam

(ARG002)


118-118: Unused method argument: do_atomic_virial

(ARG002)


123-123: Unused method argument: data_sys

(ARG002)


142-142: Unused static method argument: args

(ARG004)


142-142: Unused static method argument: kwargs

(ARG004)


155-155: Unused method argument: coord

(ARG002)


156-156: Unused method argument: atype

(ARG002)


157-157: Unused method argument: box

(ARG002)


158-158: Unused method argument: fparam

(ARG002)


159-159: Unused method argument: aparam

(ARG002)


160-160: Unused method argument: do_atomic_virial

(ARG002)


165-165: Unused method argument: data_sys

(ARG002)


184-184: Unused static method argument: args

(ARG004)


184-184: Unused static method argument: kwargs

(ARG004)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (28)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Test Python (3, 3.10)
  • GitHub Check: Test Python (2, 3.13)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test Python (1, 3.13)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Test Python (4, 3.13)
  • GitHub Check: Test Python (3, 3.13)
  • GitHub Check: Test Python (6, 3.10)
  • GitHub Check: Test Python (2, 3.10)
  • GitHub Check: Test Python (4, 3.10)
  • GitHub Check: Test Python (6, 3.13)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (5, 3.13)
  • GitHub Check: Test Python (5, 3.10)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test Python (1, 3.10)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Analyze (python)
  • GitHub Check: Test C++ (false, false, false, true)
  • GitHub Check: Test C++ (false, true, true, false)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Test C++ (true, true, true, false)
  • GitHub Check: Test C++ (true, false, false, true)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
🔇 Additional comments (5)
deepmd/utils/data.py (2)

256-270: LGTM! Clean refactoring to centralize frame loading.

Both get_item_torch and get_item_paddle now properly delegate to get_single_frame with the new num_worker parameter, which centralizes the frame loading and modification logic.

Also applies to: 272-287


515-530: LGTM! Preload method correctly caches all frames.

The preload method correctly iterates through all frames, applies modifications, and caches results. Progress logging every 100 frames is reasonable for typical dataset sizes.

source/tests/pt/test_data_modifier.py (3)

64-91: LGTM! Plugin registrations are properly documented.

The three modifier plugin registration functions (modifier_random_tester, modifier_zero_tester, modifier_scaling_tester) are well-structured with clear documentation and appropriate argument specifications.


253-384: Excellent test coverage for modifier functionality.

The three test methods provide comprehensive coverage:

  • test_init_modify_data validates that the zero modifier correctly zeros out training and validation data.
  • test_full_modify_data ensures modification is applied consistently and only once.
  • test_inference performs end-to-end validation with model training, freezing, and scaled predictions.

The parameterization across batch sizes and cache settings strengthens the test suite.


386-401: LGTM! Robust cleanup with proper error handling.

The tearDown method correctly uses try-except blocks to ensure all cleanup attempts are made, even if individual file removals fail. This prevents test artifacts from accumulating.

@njzjz njzjz added this pull request to the merge queue Jan 7, 2026
Merged via the queue into deepmodeling:master with commit dfeba54 Jan 7, 2026
58 checks passed
@ChiahsinChu ChiahsinChu deleted the devel-modifier-plugin branch January 8, 2026 02:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request] Data modifier in pytorch

4 participants