-
Notifications
You must be signed in to change notification settings - Fork 1
Description
@LucasZimm this is an issue to compile some of the changes to be worked during the next weeks. Also, I am sorry because this is going to be a long issue, but I need to show what I mean.
As of now, we have successfully managed to:
- Create an
AbstractParser
and itsparse(files, collection, logger)
functionality: even created amasterdata-parser-example
which users can fork (or use as template) to create their own repos with the proper structure. - Add entry points loading in the openbis-upload-helper
- Created a functional app (albeit the changes we still need to do regarding style and minor backend features)
- Parse our
CollectionType
(its objects and relationships) into the openBISCollection
. - Added validation for CONTROLLEDVOCABULARY data types
This is already good, as users can now create their own parsers with a more OOP friendly way of handling data, for example:
from bam_masterdata.datamodel.object_types import Chemical, Instrument
from bam_masterdata.parsing import AbstractParser
class MasterdataParserExample(AbstractParser):
def parse(self, files, collection, logger):
chemical = Chemical(name="Example Chemical")
chemical_id = collection.add(chemical)
instrument = Instrument(name="Example Instrument")
instrument_id = collection.add(instrument)
_ = collection.add_relationship(chemical_id, instrument_id)
logger.info("Parsing finished: Added example chemical and instrument.")
From this, users only need to find the classes that they need to populate from bam_masterdata/datamodel/object_types.py
and import them in their parsers.
Now, we could add another feature as a bullet point there by improving on the files
reading functionality. The idea would be to give an interface for users to be able to load files into a specific class, so that some specific dictionary with parsed quantities is returned. An easy example would be: imagine files[0]
is always a JSON. We could have a class that in one line returns the loaded dictionary:
class MasterdataParserExample(AbstractParser):
def parse(self, files, collection, logger):
json_dict = JSONParser(file=files[0]).data
... # other logic here
Of course, this is a very easy example, but there are more complex ones that we could give a similar support, so always, parsed info is passed as a dictionary in the format <WHATEVER-FILE-FORMAT>Parser(file=...).data
.
Imagine now we have a free text file (like example.txt
). In this case, users should be able to still specify what to load based on regex. Something like:
class TextParser(FileParser):
def quantities(self):
return [
Quantity("name-of-the-dict-key, r"regex to be matched", ... more args),
...
]
class MasterdataParserExample(AbstractParser):
def parse(self, files, collection, logger):
txt_dict = TextParser(file=files[0]).data
This is also an awesome framework. Then users only need to know:
- How to import object_types from the bam-masterdata package
- How to load and parse specific quantities from their different file formats (text, json, excel, binary, hdf5...)
Now, we have 2 options.
- We code this ourselves.
- We use the NOMAD functionalities. See the files in https://github.com/nomad-coe/nomad/tree/develop/nomad/parsing/file_parser
I am more prone to go with option 2, as I know that the infrastructure works as in the example I showed you above and that the main developer is a very professional guy, but we need certain flexibility to handle this, because there is room for improvement in NOMAD's side.