Standardization and best practices #1

samikanza · 2024-10-21T11:20:07Z

samikanza
Oct 21, 2024
Maintainer

MADICES 2024 Recap
We delved into the realm of semantic annotation, emphasizing the adoption of RDF serialization formats for data representation, using JSON-LD as a working example, to enhance the interpretability and reusability of research datasets. Participants explored methodologies for incorporating metadata, handling missing values, and standardizing units for measurements. Tools and guidelines were developed to streamline the annotation process and ensure compatibility across different knowledge domains.

MADICES 2025 Focus Area
We want to focus on further refinement of protocols and guidelines for semantic annotation and interoperable data exchange. Recent discussions with members at the Future Labs Live 2024 event in Basel illuminated the interest in semantic interoperability of research data and the need for standards and tooling to facilitate ease-of-use and reduce the burden/barrier to semantically annotate datasets.

Please include links, comments, and discuss plans for this focus area below.

hampusnasstrom · 2024-11-04T10:27:51Z

hampusnasstrom
Nov 4, 2024

There is a lot of focus on the semantic web with annotated data sets serialized as RDF, JSON-LD, or XML. However, I don't think we can expect every scientist to search through ontologies and connect them. This is rather a task for data stewards but the unfortunate reality is that most groups don't have access to one, and won't for the foreseeable future. I therefore believe that we need a way to Find, Interoperate, Access, and Reuse not only the definitions but the collection of the these in schemas/forms.

1 reply

edan-bainglass Apr 25, 2025
Maintainer

I don't think we can expect every scientist to search through ontologies and connect them

This burden is likely a non-starter for most researchers. Too busy doing research and writing papers :) Key is to find a way to integrate these tasks seamlessly into research workflows. At AiiDA, we are looking into integrating semantics directly into our data and logic pipelines, such that any research carried out via the AiiDA engine is automatically semantic!

One challenge, as you've pointed out, is which ontologies to use? How to handle "redundant" ontologies, i.e., reinvented ontologies (due perhaps to lack of awareness of other works) or mildly different ontologies? How do we map these without burdening users, such that they are easily connected to the "global research knowledge"?

We've been discussing internally how LLMs might assist in such tasks. AI data stewards? Surely this would be discussed at Future Labs Live (I believe it was already in 2024). Best to keep an eye on this and try to drag a few experts to MADICES :)

PeterKraus · 2024-11-04T16:16:59Z

PeterKraus
Nov 4, 2024
Maintainer

One of the goals I have for MADICES-2025 is to understand how do I need to annotate my NetCDF datasets so that the annotations are useful downstream. By this I mean the mechanics of it, not figuring out the ontology labels etc.

I am happy to provide some example files, and prepare a spreadsheet matching data headers to ontology entries. These annotations could be implemented into either yadg or tomato.

0 replies

ml-evs · 2024-11-04T20:42:25Z

ml-evs
Nov 4, 2024
Maintainer

My aim is similar to Peter's, how can I best annotate public or published entries (e.g. samples, devices, their relationships and attached measurements) in datalab instances so that it can be used downstream, even if semantic annotations are missing or incomplete.

0 replies

samikanza · 2025-04-23T10:25:58Z

samikanza
Apr 23, 2025
Maintainer Author

We are also keen to discuss interoperability between ELNs! There are clearly a few different efforts here, on the open source side there is the ELN Consortia, with the ELNFileFormat, and I'm seeing a lot of discussions about this at the Industry based conferences that I go to about the Allotrope Foundation and various companies working with the large-scale commercial ELNs to provide an archive format to pull data between them.

Also interested in discussing common templates across ELNs

2 replies

cbarillari Apr 23, 2025

I am also very much interested in this topic. In the openBIS team we have done some work in this direction and our developers could contribute to discussions on this topic.

edan-bainglass Apr 25, 2025
Maintainer

We need to understand precisely where the .eln file format was lacking leading us at PREMISE to pursue a custom extension of the RO Crate spec. It is important that we connect our PREMISE team with that of the ELN Consortium (particularly @SteffenBrinckmann), so that we sync up efforts and avoid confusion. @cbarillari great if you can ping your team with this topic, so they may comment.

bdeadman · 2025-04-24T15:12:41Z

bdeadman
Apr 24, 2025

We (https://github.com/open-reaction-database, ORD) are interested in this topic. The ORD is an open-access repository and schema for chemical reaction data, particular experimental data of small molecule reactions. We use the Protocol Buffers (similar to XML) for storing and exchanging reaction datasets, but also provide an Object Relational Mapper to unpack the data into a postgreSQL database.

Our interest is in working with other reaction data formats to ensure interoperability, but also thinking how we could fit into upstream and downstream data standards, and applications.

1 reply

edan-bainglass Apr 25, 2025
Maintainer

Pinging @fabioacl from the PREMISE team. One of our work packages addressed authoring semantically annotated schemas (and standards for doing so) for microscopy data (see here), particularly studying molecules on surfaces. The schemas are being used to demonstrate interoperability between ELNs (openBIS in this case), where experimental data/metadata is being stored in the developed standards, and simulation engines (AiiDA in this case), where STM simulations are carried out to elucidate experimental results.

A key deliverable for us is the demonstration of data/metadata exchange across these platforms, for example:

Lab results to openBIS
openBIS objects to AiiDA for simulation
AiiDA simulation results back to openBIS
openBIS/AiiDA data/metadata to public archives

We also do something similar for battery research (another PREMISE work package), where battery testing experiments are orchestrated by a workflow engine, the data of which is exchanged with the openBIS ELN, again in PREMISE-developed standards/protocols. Pinging @NukP.

In both cases, we are exploring two paths:

Direct one-to-one interoperability using dedicated APIs from both platforms
A generalized platform-agnostic means of exchanging resources by means of RO Crates

Happy to share more here, in upcoming events, and of course at the workshop for those interested :)

juan-fuentes-sis · 2025-04-25T09:20:23Z

juan-fuentes-sis
Apr 25, 2025

We really need a dedicated working group focused on advancing standardization efforts to improve the interoperability of systems, particularly with respect to data formats and associated tools.

As many of you have already mentioned, either directly or indirectly, we are facing a fragmented landscape. Moreover, placing the burden of dataset annotation solely on individual scientists is not a realistic approach.

In this context, I believe it is essential for the working group to:

Identify the key stakeholders currently active in this space and the data formats they support.
Assess the advantages and limitations of the principal formats in use today.
Catalogue the tools and systems that support these formats.

Based on this analysis, the group should then develop informed recommendations regarding formats and tools for future adoption.

I would be glad to contribute to this effort, and I am willing to take a leading role in organizing and steering the working group.

0 replies

SteffenBrinckmann · 2025-04-27T11:20:52Z

SteffenBrinckmann
Apr 27, 2025

We should acknowledge that the ELNs have different goals with different advantages and disadvantages (like rock-paper-scissor none is superior to all). If the ELNs have different functionalities, they have necessarily different requirements when it comes to exchange format, extractors. ... As such a exchange-module for all ELN does likely not exist. We should accept these differences and not force alignment, where it is difficult / impossible.

We also should acknowledge that there are different budgets; some elns have 1 person doing development, design, teaching, ... and on the other hand we have multi-billion euro companies. What can be achieved with one might not be feasible for the other, including participating in all meetings and working groups. Plus there is a difference in data-structure: some that go along the RDF and some that are not; and all 256 shades in between.

I think we aware of xkcd.927 and want to not explicitly create new formats. But that also means that there are already 14 formats and people chose different versions for different reasons, and that is ok.

I am aware of 3 ELN exchange formats, in alphabetical order: allotrope (multi-billion company, quite strong RDF), eln (no funding, open consortium, semantic annotation is possible but not necessary) and the openBIS suggestion from Andreas (quite strong RDF). Please correct me if there are more or you do not agree with my categorization.

What triggers me often in these discussions are general statements that focus on the own solution. e.g. "FAIR research data is important and hence we have to use my product XYZ" (I haven't heard this specific argument and do want to not put anybody on the block) because it implies that other products do not have the same goal. I think we all have almost 100% goals but try different paths to get there. We should openly learn, ask: why did you not chose this/that solution. But let's accept the differences and refrain from such statements.

The open discussion should also lead to an intermediate review: as a community we have been in this RDM-process for ~5 years. Somethings were successful others not (my TAPIR project on extractors). We should try to identify the reasons for success / failure (eg not polished product due to insufficient funding) and try to learn from them. That could lead to some best practices for RDM project developers, which I would find really interesting.

Peace.

1 reply

samikanza Jul 3, 2025
Maintainer Author

@SteffenBrinckmann completely agree with your categorisation, and all your points!

Is anyone interested in doing some hacking using these different approaches? And trying to dig deeper into the actual practicalities of exchanging information between some different ELNs?

Standardization and best practices #1

Uh oh!

samikanza Oct 21, 2024 Maintainer

Replies: 7 comments · 5 replies

Uh oh!

Uh oh!

hampusnasstrom Nov 4, 2024

Uh oh!

edan-bainglass Apr 25, 2025 Maintainer

Uh oh!

PeterKraus Nov 4, 2024 Maintainer

Uh oh!

ml-evs Nov 4, 2024 Maintainer

Uh oh!

samikanza Apr 23, 2025 Maintainer Author

Uh oh!

cbarillari Apr 23, 2025

Uh oh!

edan-bainglass Apr 25, 2025 Maintainer

Uh oh!

bdeadman Apr 24, 2025

Uh oh!

edan-bainglass Apr 25, 2025 Maintainer

Uh oh!

juan-fuentes-sis Apr 25, 2025

Uh oh!

SteffenBrinckmann Apr 27, 2025

Uh oh!

samikanza Jul 3, 2025 Maintainer Author

samikanza
Oct 21, 2024
Maintainer

Replies: 7 comments 5 replies

hampusnasstrom
Nov 4, 2024

edan-bainglass Apr 25, 2025
Maintainer

PeterKraus
Nov 4, 2024
Maintainer

ml-evs
Nov 4, 2024
Maintainer

samikanza
Apr 23, 2025
Maintainer Author

edan-bainglass Apr 25, 2025
Maintainer

bdeadman
Apr 24, 2025

edan-bainglass Apr 25, 2025
Maintainer

juan-fuentes-sis
Apr 25, 2025

SteffenBrinckmann
Apr 27, 2025

samikanza Jul 3, 2025
Maintainer Author