Skip to content

Refactor DataSampleSchema to use tensors #348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 20, 2025

Conversation

laserkelvin
Copy link

This PR refactors the DataSampleSchema to use Annotated abstraction instead of a field and model validators, and replacing the shape checking with methods in the new matsciml.datasets.validators module.

The intention behind these changes are:

  1. Ensure that tensors are used first and foremost, which otherwise introduces barriers to usage in the full pipeline because tensors need to be moved to devices, expected dtypes, etc.
  2. The tensor annotations in the model schema are now more semantically meaningful; e.g. CoordTensor has more implications to it than NDArray['*, 3']. For the user/developer, these effectively "named" tensors in the annotations are likely better to work with.
  3. Reusable validator workflows: nominally they can be used with batched data as well.

@laserkelvin laserkelvin added data Issues related to data loading, pipelining, etc. code maintenance Issue/PR for refactors, code clean up, etc. labels Mar 19, 2025
Copy link
Collaborator

@smiret-intel smiret-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good validation setup for commonly used Tensors. Can you add some brief documentation on the types for future users?

@laserkelvin
Copy link
Author

Looks like a good validation setup for commonly used Tensors. Can you add some brief documentation on the types for future users?

They're not in the main pipeline just yet, but I've already started documenting them in the readthedocs

@laserkelvin laserkelvin merged commit 3374292 into IntelLabs:main Mar 20, 2025
2 of 3 checks passed
@laserkelvin laserkelvin deleted the sample-schema-allow-tensor branch March 20, 2025 00:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code maintenance Issue/PR for refactors, code clean up, etc. data Issues related to data loading, pipelining, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants