I. Lightweight metadata schema for parsers #7

ml-evs · 2022-12-01T15:30:39Z

ml-evs
Dec 1, 2022
Maintainer

A lightweight metadata schema for parsers and associated tooling for software libraries to self-report:

The file formats they support (e.g., output files from a particular experimental apparatus, or log files from a computational chemistry code)
The shape and semantics of the data models produced, via existing formats for data description and tooling, such as , for example self-contained formats (NeXus, CIF), schemas (JSONSchema, XSD) and, semantic data (RDF, JSON-LD & CSV-LD), HDF5 and STAR (and their respective domain-specific derivatives NeXus and CIF). Depending on the output format, such metadata can be provided in-band or out-of-band in a well-defined location (e.g., a separate file, a persistent URL), following the upcoming recommendations of the MarDA Data Dictionaries WG5.
Any additional metadata required for re-use, such as code versions and environments, source and code archive URLs, and bibliographic data following, for example, Dublin Core.

markus1978 · 2022-12-01T15:44:11Z

markus1978
Dec 1, 2022

In NOMAD we use regular expressions on file-names, mime-types, and file contents to match parsers to files. Allows us to determine, if a parser is applicable with reasonable effort. It's not always 100% correct, but practical.

2 replies

ml-evs Dec 1, 2022
Maintainer Author

Is the FAIRMat work going to extend this approach for experimental data formats? Or is the push to adopt something sensible like NeXuS/HDF5 from the source instead?

markus1978 Dec 1, 2022

We use Nexus in two ways. First, we interpret Nexus data if someone uses nexus already. Here we map between our schema-language and the nexus one to directly interpret nexus hdf5 similarly to out parsed NOMAD data.

Secondly, most labs don't yet have/use nexus. Here, we have very similar to the goals of this WG. We developed a python framework, that makes it "easy" to implement converters for your typical instrument (csv/xyz/hdf5-based) formats that targets respective nexus files in the respective nexus schema. The additional challenge here is that data for one nexus file comes from different sources (e.g. instrument + human input/ELN).

markus1978 · 2022-12-01T15:52:18Z

markus1978
Dec 1, 2022

The usual schema-language suspects (as those listed above) are typically not tailed for scientific data, both in expressiveness and scale. E.g. it is cumbersome to define multi-dimensional data and handle them efficiently (json, xml, csv instead of hdf5, pandas, numpy, ...). Maybe not just treat the output as another stream/file, but as data in e.g. Python runtime.

4 replies

ml-evs Dec 1, 2022
Maintainer Author

This is definitely a big outstanding issue in the existing tech, I went looking for HDF5 schema languages a while ago but it seems like they mostly settled on domain-profile-ish things like NetCDF, NeXuS etc. Is that a fair summary? There is also the frictionless ecosystem (https://framework.frictionlessdata.io/) for tabular data schemas (with support out to many other serialization formats) and LinkML (https://linkml.io/linkml/) (with support for linked data) which we can investigate and perhaps ask the relevant communities.

I agree that the output format schema doesn't have to be tied to any file or format, though this would make it harder to use from different languages etc unless we borrow stuff from a multi-lingual data runtime (making up terms here) like Spark?

PeterKraus Dec 1, 2022
Maintainer

I think it might be better to leave the "output schema" to the parsers for now (i.e. v0 of the schema), and focus on the schema to be able to run the parsers as-is with an arbitrary input file. We lose chain-ability and the more practical aspects of the "uniform" interface, but avoid pushing any actual hard work onto parsers (supporting a new output format, whatever it is) or the supposedly lightweight extractor (converting parser specific output to the new format).

The first "incremental" improvement for v1 could be that the output conforms.

markus1978 Dec 1, 2022

We dove, delved, dived, ... went very deeply into the Nexus schema system. It has a rich schema language, but its tight to HDF5 (fields, groups, and primitives types) and a basic experiment vocabulary (experiment, sample, data, instrument). Therefore, I am a bit sceptical if this is generic enough for this WG.

markus1978 Dec 1, 2022

I agree with @PeterKraus, tackling the input side is more feasible and interesting in the beginning.

ml-evs · 2022-12-01T16:39:19Z

ml-evs
Dec 1, 2022
Maintainer Author

Related to the discussion of multi-dimensional data was the idea of calculating derived quantities. To clarify in a comment some potential situations...

Let's say you have some file in a (purposefully gross) home-baked format that looks something like:

<magic bit that identifies format>
name: morning espresso
instrument: gaggia classic
experimenter: matthew evans
target_brew_temperature: 94 degrees C
# start csv block
# pressure profile (bar), temperature profile (C)
9.01, 94
9.00, 94
8.99, 93
., . 
., .
., .
<large 10 TB array of P vs T sampled every μs>
# end csv block

Are we putting any constraint on what a "parser" or "metadata extractor" (by our definition) should return?

e.g.,

Just the metadata
Metadata + every value from the arrays
Metadata + arrays converted to Pa and Kelvin
Metadata + every value from the arrays + average temperature calculated over the raw values
Metadata + some downsampled copy of the arrays + derived values from those
A subset of the metadata that the parser understands (i.e., maybe it only looks up certain keys from the header)

We obviously want to allow/encourage 1, 2 and 3. Perhaps 4 is the average value is well-described. 5 would be useful for actual re-use yet the conversion is somehow lossy, as is 6.

The answer I guess would be "whatever the parser wants" but should these differences be somehow expressed in the schema, to make it easier for re-use? e.g., an ELN probably does not want to store/index huge useless arrays in primary data, but automated use of a parser that does 2/3/4 would do this by default)

5 replies

PeterKraus Dec 1, 2022
Maintainer

I actually think you're missing a 0. Just the schema, which would just tell you what to do to get the metadata+data, perhaps even what the format of the (meta-)data is, and an optional prompt to do 1 and above.

I'm not sure 3 is worth the CPU hours, if 2 includes unit annotation (in my view, units are data, not metadata). You probably wouldn't want to convert Hartrees or keV into Joules by default, but retain the ability to do so via something like pint.

I don't see any value in 4 - 6 to be honest. In my view, that's the job of the software that calls the extractor (whether it's the ELN in an automated way or the user in a semi-manual way). The only reason I can think of when 5 might be useful if the data files are huge, so returning a copy, or even loading the data into memory, is not possible.

ml-evs Dec 1, 2022
Maintainer Author

Good point on 0, I guess that is somewhat implicit in the WP1 framing already.

I agree that in many cases these things are not "worth it" or useful, but might reasonably be default behavior of a parser (e.g., any normal person might record the P, T profile at a sensible polling rate, so the parser would always assume it is easy enough to not care and would always do some expensive operation to convert and return every value).

So I guess my question (not super well formulated, I admit) is do we add this as a hard constraint of expected behavior, express it through the schema somehow, or allow free reign depending on the intention of the parser developer? (which may differ from the user, if e.g., the parser knows that such high fidelity is useless [could also be something like stripping 10 TB of zeros from where a sensor was turned off])

Somehow the concept of "parser" to me invokes the idea of a function that operates on data in such a way that the size of the data returned is approximately the same, or is explicitly a subset (e.g., just what the parser deems is metadata). It also invokes the idea of data that is sufficiently small that these concerns are irrelevant, which is why I want to address it...

I can think of a lot of cases when parsing data between servers/browsers where 5 would be the most useful, e.g., taking an echem file from a user and plotting it, where (server's data file parsing capability) > (data file size) >> (ability to plot/analyse the data in the browser). Having an opinionated parser here would be useful.

kjappelbaum Dec 3, 2022

I'm not sure 3 is worth the CPU hours, if 2 includes unit annotation (in my view, units are data, not metadata). You probably wouldn't want to convert Hartrees or keV into Joules by default, but retain the ability to do so via something like pint.

Agree. We used to do such conversions in cheminfo, but now we try to write parsers such that they return an as-close-as-possible mapping of the original file to JSON (while also splitting into what is "data" and what is "metadata"). Other packages then consume and potentially process this JSON.

markus1978 Dec 9, 2022

In many cases, that "magic bit that identified the format" does not exits. For example, there are many .csv flying around with nothing to them but table headers.

ml-evs Dec 9, 2022
Maintainer Author

Indeed, the point of my made-up format is that it at least elides the issue of file type detection, which will probably need a whole other discussion...

ml-evs · 2022-12-08T23:11:40Z

ml-evs
Dec 8, 2022
Maintainer Author

After the first office hours today we made repos for each of main topics. I would hope that this discussion thread can continue for general stuff, but if we want to comment on specific ideas/code then we can use PRs over at https://github.com/marda-alliance/metadata_extractors_schema (you will see a similar comment on each other thread with the appropriate link).

0 replies

ml-evs · 2022-12-16T14:31:27Z

ml-evs
Dec 16, 2022
Maintainer Author

After some feedback from @PeterKraus, I've just merged a draft file type schema at https://github.com/marda-alliance/metadata_extractors_schema that we can begin iterating on. The schema is authored as YAML using LinkML and can be converted into many different formats (e.g., JSONSchema, auto-generated Python models etc).

We will work a bit before the next meeting to generate a demo of the registry and API based on this schema.

Please take a look if you are interested!

0 replies

PeterKraus · 2023-01-25T06:06:42Z

PeterKraus
Jan 25, 2023
Maintainer

Parsing of coupled files

This was a discussion item raised in our Office Hours on 2023-01-24: What do we do for FileTypes where the full information required to parse the (meta)-data is split between multiple files?

As an example, @ml-evs mentioned the Bruker Topspin files, where for example the nmrglue library requires the parent folder or "stem" of the files:

https://nmrglue.readthedocs.io/en/latest/examples/proc_bruker_1d.html#instructions

https://github.com/jjhelmus/nmrglue/blob/a3494234497a4507537c8fc5a89081bdb09b1954/nmrglue/fileio/bruker.py#L294-L296

I see two options on how to deal with this:

define a FileType for each file type, and allow Extractors to require multiple FileTypes, i.e. punt this to the Extractor level
introduce an one-way (or two-way) dependency on other FileTypes into the "main" FileType class

I'm in favour of 1), as in both cases we would still need to touch the Extractor schema (either to allow multiple FileTypes, or to understand these dependencies). The added benefit of 1) is that it might eventually be possible to parse those FileTypes separately and extract something (metadata?), so etching such a dependency into stone might be shortsighted.

0 replies

PeterKraus · 2023-01-25T06:16:17Z

PeterKraus
Jan 25, 2023
Maintainer

Metadata vs data

OK, on the last Office Hours (2023-01-24), we've agreed it is time we discuss this. This might be of particular importance to resolve #7 (comment), as I imagine a two-pass extraction might go a long way:

first pass: extract metadata, see array sizes, if sane continue, if dumb abort
second pass: extract data, optionally with any down-/re-sampling we choose to implement

The questions I would like opinions from "the community" are:

should we, in this MaRDA WG, respect what the Extractors consider data|metadata or roll our own?
how do we develop with a useful Metadata schema?
should the API developed in this MaRDA WG provide any required elements of the Metadata, calculating things such as array sizes if they're not returned by the Extractors?

1 reply

PeterKraus Jan 25, 2023
Maintainer

I'll put my 2 cents here:

I think we should trust the Extractors to know what they're doing, at least for now. However, if the Extractor does not provide an idea about what kind of data (column headers, numbers of rows) it can extract from the provided files, this should be scraped "somehow" from the data, perhaps as an optional feature.
LinkML gets us a long way for semantic equivalency. I think a first step from MaRDA WG side could be a lightweight "equivalency" Metadata schema, second step would be a way to provide a mapping between this Metadata schema and the extracted metadata entries which should be implemented in each Extractor separately.
Yes.

I. Lightweight metadata schema for parsers #7

Uh oh!

Uh oh!

ml-evs Dec 1, 2022 Maintainer

A lightweight metadata schema for parsers and associated tooling for software libraries to self-report:

Replies: 7 comments · 12 replies

Uh oh!

markus1978 Dec 1, 2022

Uh oh!

ml-evs Dec 1, 2022 Maintainer Author

Uh oh!

markus1978 Dec 1, 2022

Uh oh!

markus1978 Dec 1, 2022

Uh oh!

ml-evs Dec 1, 2022 Maintainer Author

Uh oh!

PeterKraus Dec 1, 2022 Maintainer

Uh oh!

markus1978 Dec 1, 2022

Uh oh!

markus1978 Dec 1, 2022

Uh oh!

ml-evs Dec 1, 2022 Maintainer Author

Uh oh!

Uh oh!

PeterKraus Dec 1, 2022 Maintainer

Uh oh!

ml-evs Dec 1, 2022 Maintainer Author

Uh oh!

kjappelbaum Dec 3, 2022

Uh oh!

markus1978 Dec 9, 2022

Uh oh!

ml-evs Dec 9, 2022 Maintainer Author

Uh oh!

ml-evs Dec 8, 2022 Maintainer Author

Uh oh!

ml-evs Dec 16, 2022 Maintainer Author

Uh oh!

PeterKraus Jan 25, 2023 Maintainer

Parsing of coupled files

Uh oh!

Uh oh!

PeterKraus Jan 25, 2023 Maintainer

Metadata vs data

Uh oh!

PeterKraus Jan 25, 2023 Maintainer

ml-evs
Dec 1, 2022
Maintainer

Replies: 7 comments 12 replies

markus1978
Dec 1, 2022

ml-evs Dec 1, 2022
Maintainer Author

markus1978
Dec 1, 2022

ml-evs Dec 1, 2022
Maintainer Author

PeterKraus Dec 1, 2022
Maintainer

ml-evs
Dec 1, 2022
Maintainer Author

PeterKraus Dec 1, 2022
Maintainer

ml-evs Dec 1, 2022
Maintainer Author

ml-evs Dec 9, 2022
Maintainer Author

ml-evs
Dec 8, 2022
Maintainer Author

ml-evs
Dec 16, 2022
Maintainer Author

PeterKraus
Jan 25, 2023
Maintainer

PeterKraus
Jan 25, 2023
Maintainer

PeterKraus Jan 25, 2023
Maintainer