Formatter / schema.org / Add croissant spec 🥐 #8939

fxprunayre · 2025-07-16T15:32:27Z

"Croissant 🥐 is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file; it works with existing datasets to make them easier to find, use, and support with tools. Croissant builds on schema.org, and its Dataset vocabulary, a widely used format to represent datasets on the Web, and make them searchable." https://docs.mlcommons.org/croissant/

Croissant is extending schema.org, this improvement review the current schema.org formatter to support additional 🥐 metadata available in ISO format. This is mainly about adding:

croissant fileObject based on online resources with a download protocol
croissant recordSet based on the feature catalogue

Refactor JSON-LD formatter for using same base formatter for both ISO19139 and ISO19115-3 to facilitate maintenance (similar to citation and DCAT formatter).

Improve formatter producing JSON output by ensuring the output is JSON valid, format it and log any error in order to be able to track errors and improve not well managed encoding.

schema.org improvement:

inLanguage correspond to the resource language, not the metadata language.
dispatch parties by role instead of only using producer (eg. provider, producer, copyrightHolder, publisher, author, funder)
do not generate element (eg. temporalCoverage) if no corresponding element in input document

Similar initiatives:

Checklist

Funded by BRGM.

"Croissant 🥐 is a high-level format for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file; it works with existing datasets to make them easier to find, use, and support with tools. Croissant builds on schema.org, and its Dataset vocabulary, a widely used format to represent datasets on the Web, and make them searchable. " Croissant is extending schema.org, this improvement review the current schema.org formatter to support additional 🥐 metadata available in ISO format. This is mainly about adding: * croissant fileObject based on online resources with a download protocol * croissant recordSet based on the feature catalogue Refactor JSON-LD formatter for using same base formatter for both ISO19139 and ISO19115-3 to facilitate maintenance (similar to citation and DCAT formatter).

Formatter producing JSON may produce invalid document as XSLT process output text which is written in the response. Ensure the JSON is valid and format it. Log any error, to be able to monitor them and improve the formatter for not well managed encoding. In future version, consider using XSLT3 which support JSON output (https://www.w3.org/TR/xslt-30/#json).

sonarqubecloud · 2025-07-16T15:55:06Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

fxprunayre added this to the 4.4.9 milestone Jul 16, 2025

fxprunayre added 2 commits July 16, 2025 17:47

fxprunayre force-pushed the 44-croissant branch from 7a0880a to 4a7ea6c Compare July 16, 2025 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Formatter / schema.org / Add croissant spec 🥐 #8939

Formatter / schema.org / Add croissant spec 🥐 #8939

Uh oh!

fxprunayre commented Jul 16, 2025 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Jul 16, 2025

Uh oh!

Uh oh!

Uh oh!

Formatter / schema.org / Add croissant spec 🥐 #8939

Are you sure you want to change the base?

Formatter / schema.org / Add croissant spec 🥐 #8939

Uh oh!

Conversation

fxprunayre commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

sonarqubecloud bot commented Jul 16, 2025

Quality Gate passed

Uh oh!

Uh oh!

fxprunayre commented Jul 16, 2025 •

edited

Loading