Skip to content

provenance template #4118

@bouweandela

Description

@bouweandela

As part of developing a provenance recording solution for the REF Climate-REF/climate-ref#26 and Climate-REF/climate-ref#162, @aspinuso suggested that we write down a provenance template (see here for an introduction and here and here for IPCC examples) describing how ESMValCore records provenance to facilitate discussion. A general introduction to ESMValCore provenance is available here.

Here is a first attempt at a provenance template in PROV-N format:

document
  prefix var <http://openprovenance.org/var#>
  prefix attribute <https://www.esmvaltool.org/attribute>
  prefix preprocessor <https://www.esmvaltool.org/preprocessor>

  activity(var:diagnosticTask, -, -)
  activity(var:preprocessingTask, -, -)
  activity(var:software, -, -)
  agent(
    var:diagnosticAuthor,
    [
      attribute:email='var:emailDiagnosticAuthor',
      attribute:orcid='var:orcidDiagnosticAuthor',
      attribute:github='var:githubDiagnosticAuthor',
      attribute:institute='var:instituteDiagnosticAuthor'
    ]
  )
  agent(
    var:recipeAuthor,
    [
      attribute:email='var:emailRecipeAuthor',
      attribute:orcid='var:orcidRecipeAuthor',
      attribute:github='var:githubRecipeAuthor',
      attribute:institute='var:instituteRecipeAuthor'
    ]
  )
  agent(var:project)
  entity(
    var:inputFile,
    [
      attribute:Conventions='var:inputFileConventions',
      attribute:branch_time='var:inputFileBranchTime',
      attribute:cmor_version='var:inputFileCMORVersion',
      attribute:model_id='var:inputFileModelId',
      ...
    ]
  )
  entity(
    var:preprocessedFile,
    [
      attribute:Conventions='var:inputFileConventions',
      attribute:branch_time='var:inputFileBranchTime',
      attribute:cmor_version='var:inputFileCMORVersion',
      attribute:model_id='var:inputFileModelId',
      ...
      preprocessor:regrid='var:regridPreprocessorSettings',
      preprocessor:convert_units='var:convertUnitsPreprocessorSettings',
      ...
    ]
  )
  entity(
    var:resultFile,
    [
      attribute:caption='var:resultCaption',
      attribute:domains='var:resultDomains',
      attribute:realm='var:resultRealm',
      attribute:references='var:resultReferences'
      ...
    ]
  )
  entity(
    var:recipe,
    [
      attribute:description='var:recipeDescription',
      attribute:references='var:recipeReferences'
    ]
  )
  wasDerivedFrom(var:preprocessedFile, var:inputFile, var:preprocessingTask, -, -)
  wasDerivedFrom(var:resultFile, var:preprocessedFile, var:diagnosticTask, -, -)
  wasAttributedTo(var:recipe, var:recipeAuthor)
  wasAttributedTo(var:recipe, var:project)
  wasAttributedTo(var:resultFile, var:recipeAuthor)
  wasAttributedTo(var:resultFile, var:diagnosticAuthor)
  wasStartedBy(var:preprocessingTask, var:recipe, var:software, -)
  wasStartedBy(var:diagnosticTask, var:recipe, var:software, -)
endDocument

Note that this describes the current implementation, which may not be optimal.

The items in the attribute namespace are pretty much free-form. For datasets like CMIP5, CMIP6, obs4MIPs, and CORDEX there exist controlled vocabularies prescribing the required global attributes, but for other data this is not the case and even when there is a prescribed controlled vocabulary, experience has shown that often data does not comply with it. The attributes of the resultFile will always contain certain items, but users may add more by specifying them in the recipe under the diagnostic script. To make linked data 'work', using a proper namespace as suggested in #649 (review) would be nice, but this is challenging because of the variation in available attributes.

More complicated templates than the above are possible, e.g. when using a multi model preprocessor function like multi_model_statistics there will be an extra intermediate preprocessedFile in the provenance record, and when using the ancestors feature resultFiles may be derived from other resultFiles as well as preprocessedFiles.

See also #29 and #649 for previous discussion.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions