Replies: 7 comments 12 replies
-
|
In NOMAD we use regular expressions on file-names, mime-types, and file contents to match parsers to files. Allows us to determine, if a parser is applicable with reasonable effort. It's not always 100% correct, but practical. |
Beta Was this translation helpful? Give feedback.
-
|
The usual schema-language suspects (as those listed above) are typically not tailed for scientific data, both in expressiveness and scale. E.g. it is cumbersome to define multi-dimensional data and handle them efficiently (json, xml, csv instead of hdf5, pandas, numpy, ...). Maybe not just treat the output as another stream/file, but as data in e.g. Python runtime. |
Beta Was this translation helpful? Give feedback.
-
|
Related to the discussion of multi-dimensional data was the idea of calculating derived quantities. To clarify in a comment some potential situations... Let's say you have some file in a (purposefully gross) home-baked format that looks something like: Are we putting any constraint on what a "parser" or "metadata extractor" (by our definition) should return? e.g.,
We obviously want to allow/encourage 1, 2 and 3. Perhaps 4 is the average value is well-described. 5 would be useful for actual re-use yet the conversion is somehow lossy, as is 6. The answer I guess would be "whatever the parser wants" but should these differences be somehow expressed in the schema, to make it easier for re-use? e.g., an ELN probably does not want to store/index huge useless arrays in primary data, but automated use of a parser that does 2/3/4 would do this by default) |
Beta Was this translation helpful? Give feedback.
-
|
After the first office hours today we made repos for each of main topics. I would hope that this discussion thread can continue for general stuff, but if we want to comment on specific ideas/code then we can use PRs over at https://github.com/marda-alliance/metadata_extractors_schema (you will see a similar comment on each other thread with the appropriate link). |
Beta Was this translation helpful? Give feedback.
-
|
After some feedback from @PeterKraus, I've just merged a draft file type schema at https://github.com/marda-alliance/metadata_extractors_schema that we can begin iterating on. The schema is authored as YAML using LinkML and can be converted into many different formats (e.g., JSONSchema, auto-generated Python models etc). We will work a bit before the next meeting to generate a demo of the registry and API based on this schema. Please take a look if you are interested! |
Beta Was this translation helpful? Give feedback.
-
Parsing of coupled filesThis was a discussion item raised in our Office Hours on 2023-01-24: What do we do for As an example, @ml-evs mentioned the Bruker Topspin files, where for example the nmrglue library requires the parent folder or "stem" of the files: https://nmrglue.readthedocs.io/en/latest/examples/proc_bruker_1d.html#instructions I see two options on how to deal with this:
I'm in favour of 1), as in both cases we would still need to touch the |
Beta Was this translation helpful? Give feedback.
-
Metadata vs dataOK, on the last Office Hours (2023-01-24), we've agreed it is time we discuss this. This might be of particular importance to resolve #7 (comment), as I imagine a two-pass extraction might go a long way:
The questions I would like opinions from "the community" are:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
A lightweight metadata schema for parsers and associated tooling for software libraries to self-report:
Beta Was this translation helpful? Give feedback.
All reactions