Parsing various file formats

@LucasZimm this is an issue to compile some of the changes to be worked during the next weeks. Also, I am sorry because this is going to be a long issue, but I need to show what I mean.

As of now, we have successfully managed to:
- Create an `AbstractParser` and its `parse(files, collection, logger)` functionality: even created a [`masterdata-parser-example`](https://github.com/BAMresearch/masterdata-parser-example) which users can fork (or use as template) to create their own repos with the proper structure.
- Add entry points loading in the openbis-upload-helper
- Created a functional app (albeit the changes we still need to do regarding style and minor backend features)
- Parse our `CollectionType` (its objects and relationships) into the openBIS `Collection`.
- Added validation for CONTROLLEDVOCABULARY data types

This is already good, as users can now create their own parsers with a more OOP friendly way of handling data, for example:
```python
from bam_masterdata.datamodel.object_types import Chemical, Instrument
from bam_masterdata.parsing import AbstractParser


class MasterdataParserExample(AbstractParser):
    def parse(self, files, collection, logger):
        chemical = Chemical(name="Example Chemical")
        chemical_id = collection.add(chemical)
        instrument = Instrument(name="Example Instrument")
        instrument_id = collection.add(instrument)
        _ = collection.add_relationship(chemical_id, instrument_id)
        logger.info("Parsing finished: Added example chemical and instrument.")
```

From this, users only need to find the classes that they need to populate from `bam_masterdata/datamodel/object_types.py` and import them in their parsers.

Now, we could add another feature as a bullet point there by improving on the `files` reading functionality. The idea would be to give an interface for users to be able to load files into a specific class, so that some specific dictionary with parsed quantities is returned. An easy example would be: imagine `files[0]` is always a JSON. We could have a class that in one line returns the loaded dictionary:
```python
class MasterdataParserExample(AbstractParser):
    def parse(self, files, collection, logger):
        json_dict = JSONParser(file=files[0]).data
        ... # other logic here
```

Of course, this is a very easy example, but there are more complex ones that we could give a similar support, so **always**, parsed info is passed as a dictionary in the format `<WHATEVER-FILE-FORMAT>Parser(file=...).data`. 

Imagine now we have a free text file (like `example.txt`). In this case, users should be able to still specify what to load based on regex. Something like:
```python
class TextParser(FileParser):
    def quantities(self):
        return [
            Quantity("name-of-the-dict-key, r"regex to be matched", ... more args),
            ...
        ]

class MasterdataParserExample(AbstractParser):
    def parse(self, files, collection, logger):
        txt_dict = TextParser(file=files[0]).data
```

This is also an awesome framework. Then users only need to know:
- How to import object_types from the bam-masterdata package
- How to load and parse specific quantities from their different file formats (text, json, excel, binary, hdf5...)

Now, we have 2 options.
1. We code this ourselves.
2. We use the NOMAD functionalities. See the files in https://github.com/nomad-coe/nomad/tree/develop/nomad/parsing/file_parser

I am more prone to go with option 2, as I know that the infrastructure works as in the example I showed you above and that the main developer is a very professional guy, but we need certain flexibility to handle this, because there is room for improvement in NOMAD's side.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parsing various file formats #164

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parsing various file formats #164

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions