Skip to content

RFC-84 : Schema Annotation #84

Open
@andrea-gioia

Description

@andrea-gioia

RFC-84 : Schema Annotation

Champion: @andrea-gioia

Summary

A dataset schema is the definition of how data is organized within a dataset. Schema definition can be used to validate shared data structure (schema validation), annotate data structure with metadata (schema annotation), or create the data structure on a specific datastore (schema creation or data definition language).

In DPDS dataset schemas are used to describe the data shared by ports. The schema definition is not part of the specification. It is delegated to the specification used to describe the API of the specific port. The schema definition format depends on the format used by the API specification to serialize exchanged data. The following tables show some commonly used schema definition format

Schema definition format Data serialization format
JSON Schema JSON
YAML Schema YAML
AVRO Schema AVRO
XML Schema Definition XML

We want to define a sub-specification of DPDS to standardize the way schemas are annotated.

Motivation

Through schema annotation is possible to provide metadata to describe the underlying data beyond the way it is structured. Examples of information that can be encoded through schema annotations are:

  • human-readable descriptions of the dataset and its field
  • data usage guidelines
  • technical information on how data is stored
  • constraints on data exchanged that can be then converted into quality checks
  • how the dataset and its fields can be linked to external datasets (syntactic linking)
  • how the dataset and its fields can be linked to external ontologies (semantic linking)

While each schema definition format has different modalities for defining annotations, it's important to define a common vocabulary for the information that can be used to enrich the schema definition. In other words, defining the admissible annotations and their meaning is important.

Describing the schema of a dataset using standard schema definition format and annotations is better than defining a custom structure to store schema metadata for the following reasons:

  • it is possible to have only one schema for serialization, validation, and metadata
  • metadata is embedded in the schema shared between producer and consumers. Once the consumer gets the schema gets also its metadata no matter what tool manages the schema repository
  • Using standard schema format permits leveraging the ecosystem of tools supporting the format (e.g., validator, linter, registry, gift, etc.).

References

Metadata

Metadata

Assignees

Type

No type

Projects

Status

In progress

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions