Standardization of properties #190

ndaelman-hu · 2025-05-15T15:06:34Z

ndaelman-hu
May 15, 2025
Maintainer

Minutes from today's meeting on PhysicalProperty

Conveying Semantics

Historical Summary

The legacy schema (run) conveyed the physical semantics via attribute (quantity or subsection) names and definitions. Plotting was fully controlled by the GUI and the quantities it read in.

The first implementation of PhysicalProperty tried to standardize the schema via a semantics that mimics plotting, providing the standard fields value (dependent variable), (independent) variables. Metadata was largely left non-standardized, except for some (a) context qualifiers, e.g. type, label; or (b) organizational metadata, e.g. name, iri, is_derived. The physical semantics were fully denoted by the section.

It also applied an inverted schema, where some constraints were set via normalization, after the data had been provided.
This inverted schema was removed due to conflicts with MappingAnnotation. It also runs opposite of typical type checkers.
Likewise, variables have been removed, and are left up to the user to specify. Indeed, most simple properties are sufficiently specified with value and do not require any additional variables. These corrections diverge from the DatasetTemplate, which formalizes the relationships between values and variables.

Planning

At this stage, we are looking into a (set of) general usage template for complex properties. We identified a couple of cases where we have to make a trade-off between:

normalization logic: unique attributes, i.e. that are not shared among properties, as well as searchable attributes are best collected in the same section where they support a property.
physical semantics: when the role of value could be shared among several attributes, e.g. eigenvalues and eigenstates and occupations. These may stand alone in their own sections, but —as they share structure and metadata— could also be grouped together. Perhaps this is should underpin our organizational principle.
plotting semantics: we need a system for determining the relation between different values: are they grouped into a single plot (unlikely), or do they power different plots each? We could follow the NOMAD annotations here.

Future implementations should only suggest standardized semantics, e.g. via the documentation. Experience shows that enforcing behavior via normalization is time-consuming to implement, error-prone, and may limit interoperability down the line.

Moreover, we prefer to retain a single PhysicalProperty base section that handles as many edge cases as possible, i.e. single section definition policy. When generating different templates for each case, we would also need an assembly strategy (e.g. multi-inheritance) for re-using the templates in overlapping use cases. This directive may also apply to subsections that are not PhysicalProperty themselves, but serve as subsections within it.

Breaking down Properties

Properties like the DOS, band structure, energies, forces, etc. may be flexibly broken down into subtypes or contributions.
Analogous to the aforementioned approach, we can standardize their attribute name to contributions or components. There are several caveats, though.

Structuring the Contributions

Each contribution mimics the top-level schema. Data-wise, some attributes will differ, especially those to be accumulated (see below), while others may stay unaltered. Should we fall back to Reference for the latter? While this matches NOMAD best practices, it requires modifying the schema, thus violating the single section definition policy.

It may also become hard to preemptively identify the unaltered properties from the rest. Or, the property may solely alter in so little cases, that we still prefer deferring to Reference for performance and storage reasons.

Responsibility Distribution with Superstructure

outputs is already a list[list[PhysalProperties]], where each system refers to an element in the outer-list, i.e. an output. Systems may be introduced via the hierarchy (i.e. sub-systems), time, or other options. We should clarify which structures are captured but the outer-list (using a multi-key?) vs PhysicalProperty.contributions. A particular edge case are atom / element-decomposed DOS, etc.

Nested vs Flat

Should we allow for tree-like decomposition, or constrain it to a list? To prevent too deep structures and allow scanning at multiple levels, I'd argue for a list with a multi-key. This, of course, raises the same question of "how to define such a key that may stem from multiple (complex) sources". One option would be to annotate a few quantities and provide a representation method that converts them to tuple[str, ...].

Normalization Logic

In many cases, the decomposition is actually a partition, and some attributes should accumulate to their top-level counterpart. One example would be the projection of the DOS, or energy contributions, as defined by the Hamiltonian.
It's hard to gauge whether the data supplied matches these criteria. We may denote the distinction via the attribute name: contributions when we expect a partition vs components.

For contributions, then, we should verify consistency or fill in the total value when missing. ~~The normalization approach is complicated here, due to the top-down execution order: this matches the direction for consistency checks, but filling missing data goes in the opposite direction.~~

Since the accumulation function can vary (e.g. between sum and mean), it should be defined by the contribution section and called by the total section. This requirement (superficially) conflicts with the single section definition policy.

ndaelman-hu · 2025-05-16T15:04:57Z

ndaelman-hu
May 16, 2025
Maintainer Author

Responsibility Distribution with Superstructure

Another structure to reckon with are intermediate steps within a SCF cycle. Atm, each PhysicalProperty denotes whether it was from a converged run, or an intermediate step. This means that outputs will be a mix of converged and unconverged properties, likely prompting extra structuration levels, e.g. extra list.

I think we should step away from a capture the world perspective on file processing. Most intermediate data is not relevant, but only serves to inflate statistics. From a plotting perspective, the only relevant data are the final data (converged or not), as well as the convergence metrics. The latter are easily plotted wrt iteration_cycles. This also extrapolates nicely to non-scf calculations.

Finally, the label is_converged only really applies to a calculation, not its properties. It's only utility is for search. We should ratify whether properties solely meant for querying belong in data or results/results_2.
Anyhow, the label can be derived from the method's data, i.e. the targets (and how many should pass). These thresholds should also be copied over to the plotting section.

2 replies

JFRudzinski May 22, 2025
Maintainer

Do you mean here not to store at all the intermediate properties? I understand why you say "Most intermediate data is not relevant" in this scf case, but I fear that it sets a dangerous precedent in general, since the parser developer would be deciding the relevance of the (meta)data for all users.

ndaelman-hu May 22, 2025
Maintainer Author

Do you mean here not to store at all the intermediate properties?

Largely, yes. I would just store properties relevant to assessing the convergence itself, e.g. energy, forces, SCF cycles. If we want to plot anything like that (which we do atm), we'll anyhow need a specialized section, that gets normalized from the superstructure.

MD is an exception, but with geometry optimizations, people care little for the intermediate structures. If they want to see that level of detail, we can have a NORTH tool to visualize the entire convergence.

JFRudzinski · 2025-05-22T05:56:50Z

JFRudzinski
May 22, 2025
Maintainer

@ndaelman-hu Thank you for this very nice summary! I will refer to it in the coming days when I am assessing the schema developments.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Standardization of properties #190

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Standardization of properties #190

Uh oh!

Uh oh!

ndaelman-hu May 15, 2025 Maintainer

Conveying Semantics

Historical Summary

Planning

Breaking down Properties

Structuring the Contributions

Responsibility Distribution with Superstructure

Nested vs Flat

Normalization Logic

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

ndaelman-hu May 16, 2025 Maintainer Author

Uh oh!

JFRudzinski May 22, 2025 Maintainer

Uh oh!

ndaelman-hu May 22, 2025 Maintainer Author

Uh oh!

JFRudzinski May 22, 2025 Maintainer

ndaelman-hu
May 15, 2025
Maintainer

Replies: 2 comments 2 replies

ndaelman-hu
May 16, 2025
Maintainer Author

JFRudzinski May 22, 2025
Maintainer

ndaelman-hu May 22, 2025
Maintainer Author

JFRudzinski
May 22, 2025
Maintainer