Standardization of properties #190
Replies: 2 comments 2 replies
-
Another structure to reckon with are intermediate steps within a SCF cycle. Atm, each I think we should step away from a capture the world perspective on file processing. Most intermediate data is not relevant, but only serves to inflate statistics. From a plotting perspective, the only relevant data are the final data (converged or not), as well as the convergence metrics. The latter are easily plotted wrt Finally, the label |
Beta Was this translation helpful? Give feedback.
-
|
@ndaelman-hu Thank you for this very nice summary! I will refer to it in the coming days when I am assessing the schema developments. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Minutes from today's meeting on
PhysicalPropertyConveying Semantics
Historical Summary
The legacy schema (
run) conveyed the physical semantics via attribute (quantity or subsection) names and definitions. Plotting was fully controlled by the GUI and the quantities it read in.The first implementation of
PhysicalPropertytried to standardize the schema via a semantics that mimics plotting, providing the standard fieldsvalue(dependent variable), (independent)variables. Metadata was largely left non-standardized, except for some (a) context qualifiers, e.g.type,label; or (b) organizational metadata, e.g.name,iri,is_derived. The physical semantics were fully denoted by the section.It also applied an inverted schema, where some constraints were set via normalization, after the data had been provided.
This inverted schema was removed due to conflicts with
MappingAnnotation. It also runs opposite of typical type checkers.Likewise,
variableshave been removed, and are left up to the user to specify. Indeed, most simple properties are sufficiently specified withvalueand do not require any additional variables. These corrections diverge from theDatasetTemplate, which formalizes the relationships between values and variables.Planning
At this stage, we are looking into a (set of) general usage template for complex properties. We identified a couple of cases where we have to make a trade-off between:
valuecould be shared among several attributes, e.g. eigenvalues and eigenstates and occupations. These may stand alone in their own sections, but —as they share structure and metadata— could also be grouped together. Perhaps this is should underpin our organizational principle.Future implementations should only suggest standardized semantics, e.g. via the documentation. Experience shows that enforcing behavior via normalization is time-consuming to implement, error-prone, and may limit interoperability down the line.
Moreover, we prefer to retain a single
PhysicalPropertybase section that handles as many edge cases as possible, i.e. single section definition policy. When generating different templates for each case, we would also need an assembly strategy (e.g. multi-inheritance) for re-using the templates in overlapping use cases. This directive may also apply to subsections that are notPhysicalPropertythemselves, but serve as subsections within it.Breaking down Properties
Properties like the DOS, band structure, energies, forces, etc. may be flexibly broken down into subtypes or contributions.
Analogous to the aforementioned approach, we can standardize their attribute name to
contributionsorcomponents. There are several caveats, though.Structuring the Contributions
Each contribution mimics the top-level schema. Data-wise, some attributes will differ, especially those to be accumulated (see below), while others may stay unaltered. Should we fall back to
Referencefor the latter? While this matches NOMAD best practices, it requires modifying the schema, thus violating the single section definition policy.It may also become hard to preemptively identify the unaltered properties from the rest. Or, the property may solely alter in so little cases, that we still prefer deferring to
Referencefor performance and storage reasons.Responsibility Distribution with Superstructure
outputsis already alist[list[PhysalProperties]], where each system refers to an element in the outer-list, i.e. anoutput. Systems may be introduced via the hierarchy (i.e. sub-systems), time, or other options. We should clarify which structures are captured but the outer-list (using a multi-key?) vsPhysicalProperty.contributions. A particular edge case are atom / element-decomposed DOS, etc.Nested vs Flat
Should we allow for tree-like decomposition, or constrain it to a list? To prevent too deep structures and allow scanning at multiple levels, I'd argue for a list with a multi-key. This, of course, raises the same question of "how to define such a key that may stem from multiple (complex) sources". One option would be to annotate a few quantities and provide a representation method that converts them to
tuple[str, ...].Normalization Logic
In many cases, the decomposition is actually a partition, and some attributes should accumulate to their top-level counterpart. One example would be the projection of the DOS, or energy contributions, as defined by the Hamiltonian.
It's hard to gauge whether the data supplied matches these criteria. We may denote the distinction via the attribute name:
contributionswhen we expect a partition vscomponents.For
contributions, then, we should verify consistency or fill in the total value when missing.The normalization approach is complicated here, due to thetop-downexecution order: this matches the direction for consistency checks, but filling missing data goes in the opposite direction.Since the accumulation function can vary (e.g. between
sumandmean), it should be defined by the contribution section and called by the total section. This requirement (superficially) conflicts with the single section definition policy.Beta Was this translation helpful? Give feedback.
All reactions