Skip to content

Formatter / schema.org / Add croissant spec 🥐 #8939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

fxprunayre
Copy link
Member

@fxprunayre fxprunayre commented Jul 16, 2025

"Croissant 🥐 is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file; it works with existing datasets to make them easier to find, use, and support with tools. Croissant builds on schema.org, and its Dataset vocabulary, a widely used format to represent datasets on the Web, and make them searchable." https://docs.mlcommons.org/croissant/

Croissant is extending schema.org, this improvement review the current schema.org formatter to support additional 🥐 metadata available in ISO format. This is mainly about adding:

  • croissant fileObject based on online resources with a download protocol
  • croissant recordSet based on the feature catalogue

Refactor JSON-LD formatter for using same base formatter for both ISO19139 and ISO19115-3 to facilitate maintenance (similar to citation and DCAT formatter).

Improve formatter producing JSON output by ensuring the output is JSON valid, format it and log any error in order to be able to track errors and improve not well managed encoding.

schema.org improvement:

  • inLanguage correspond to the resource language, not the metadata language.
  • dispatch parties by role instead of only using producer (eg. provider, producer, copyrightHolder, publisher, author, funder)
  • do not generate element (eg. temporalCoverage) if no corresponding element in input document

Similar initiatives:

Checklist

  • I have read the contribution guidelines
  • Pull request provided for main branch, backports managed with label
  • Good housekeeping of code, cleaning up comments, tests, and documentation
  • Clean commit history broken into understandable chucks, avoiding big commits with hundreds of files, cautious of reformatting and whitespace changes
  • Clean commit messages, longer verbose messages are encouraged
  • API Changes are identified in commit messages
  • Testing provided for features or enhancements using automatic tests
  • User documentation provided for new features or enhancements in manual
  • Build documentation provided for development instructions in README.md files
  • Library management using pom.xml dependency management. Update build documentation with intended library use and library tutorials or documentation

Funded by BRGM.

@fxprunayre fxprunayre added this to the 4.4.9 milestone Jul 16, 2025
"Croissant 🥐 is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file; it works with existing datasets to make them easier to find, use, and support with tools. Croissant builds on schema.org, and its Dataset vocabulary, a widely used format to represent datasets on the Web, and make them searchable. "

Croissant is extending schema.org, this improvement review the current schema.org formatter to support additional 🥐 metadata available in ISO format. This is mainly about adding:
* croissant fileObject based on online resources with a download protocol
* croissant recordSet based on the feature catalogue

Refactor JSON-LD formatter for using same base formatter for both ISO19139 and ISO19115-3 to facilitate maintenance (similar to citation and DCAT formatter).
Formatter producing JSON may produce invalid document as XSLT process output text which is written in the response. Ensure the JSON is valid and format it. Log any error, to be able to monitor them and improve the formatter for not well managed encoding.

In future version, consider using XSLT3 which support JSON output (https://www.w3.org/TR/xslt-30/#json).
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant