Skip to content

Parsing various file formats #164

@JosePizarro3

Description

@JosePizarro3

@LucasZimm this is an issue to compile some of the changes to be worked during the next weeks. Also, I am sorry because this is going to be a long issue, but I need to show what I mean.

As of now, we have successfully managed to:

  • Create an AbstractParser and its parse(files, collection, logger) functionality: even created a masterdata-parser-example which users can fork (or use as template) to create their own repos with the proper structure.
  • Add entry points loading in the openbis-upload-helper
  • Created a functional app (albeit the changes we still need to do regarding style and minor backend features)
  • Parse our CollectionType (its objects and relationships) into the openBIS Collection.
  • Added validation for CONTROLLEDVOCABULARY data types

This is already good, as users can now create their own parsers with a more OOP friendly way of handling data, for example:

from bam_masterdata.datamodel.object_types import Chemical, Instrument
from bam_masterdata.parsing import AbstractParser


class MasterdataParserExample(AbstractParser):
    def parse(self, files, collection, logger):
        chemical = Chemical(name="Example Chemical")
        chemical_id = collection.add(chemical)
        instrument = Instrument(name="Example Instrument")
        instrument_id = collection.add(instrument)
        _ = collection.add_relationship(chemical_id, instrument_id)
        logger.info("Parsing finished: Added example chemical and instrument.")

From this, users only need to find the classes that they need to populate from bam_masterdata/datamodel/object_types.py and import them in their parsers.

Now, we could add another feature as a bullet point there by improving on the files reading functionality. The idea would be to give an interface for users to be able to load files into a specific class, so that some specific dictionary with parsed quantities is returned. An easy example would be: imagine files[0] is always a JSON. We could have a class that in one line returns the loaded dictionary:

class MasterdataParserExample(AbstractParser):
    def parse(self, files, collection, logger):
        json_dict = JSONParser(file=files[0]).data
        ... # other logic here

Of course, this is a very easy example, but there are more complex ones that we could give a similar support, so always, parsed info is passed as a dictionary in the format <WHATEVER-FILE-FORMAT>Parser(file=...).data.

Imagine now we have a free text file (like example.txt). In this case, users should be able to still specify what to load based on regex. Something like:

class TextParser(FileParser):
    def quantities(self):
        return [
            Quantity("name-of-the-dict-key, r"regex to be matched", ... more args),
            ...
        ]

class MasterdataParserExample(AbstractParser):
    def parse(self, files, collection, logger):
        txt_dict = TextParser(file=files[0]).data

This is also an awesome framework. Then users only need to know:

  • How to import object_types from the bam-masterdata package
  • How to load and parse specific quantities from their different file formats (text, json, excel, binary, hdf5...)

Now, we have 2 options.

  1. We code this ourselves.
  2. We use the NOMAD functionalities. See the files in https://github.com/nomad-coe/nomad/tree/develop/nomad/parsing/file_parser

I am more prone to go with option 2, as I know that the infrastructure works as in the example I showed you above and that the main developer is a very professional guy, but we need certain flexibility to handle this, because there is room for improvement in NOMAD's side.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions