Skip to content

More closely follow file system in RO-Crate metadata json #543

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 29, 2025

Conversation

HLWeil
Copy link
Member

@HLWeil HLWeil commented Jul 28, 2025

connected to nfdi4plants/arc-export#54

Background

In ARC Scaffold Annotation Tables, Data objects can be used as Inputs or Outputs. In the actual ARC Filesystem, these Data annotations can refer to three different entities:

  • File
  • File Fragment
  • Folder

All of these make sense and there is currently no intention to change this annotation in the ARC Scaffold.

The problem arises, when mapping the ARC Scaffold metadata to ARC RO-Crate metadata json. The RO-Crate is meant to be used for various tasks and should be as semantically sound as possible. This is not the case at the moment, as the Folder objects in the Annotation Table are just mapped to Files in RO-Crate, losing both knowledge about its actual type as well as the files it contains (see nfdi4plants/arc-export#54 for why this is problematic).

Implemented Solution

We don't want to access the actual Filesystem when mapping the metadata from Scaffold to RO-Crate, so, in this PR, I implemented logic to make use of the In-Memory Filesystem stored as a field in the ARC object. From there we can check, whether an object annotated as Data is actually a file or a folder and then map it accordingly.

If the object is a folder, I currently add a second type Dataset in addition to File and create and reference all subfiles it contains via the hasPart property.

E.g. when we have an annotation table that references the folder ABC.D as Input [Data], which in turn contains the two files SubFile.txt and SubFolder/SubSubFile.txt, the output RO-Crate metadata will contain the following objects:

[
    {
      "@id": "assays/MyAssay/dataset/ABC.D/SubFile.txt",
      "@type": "File",
      "name": "assays/MyAssay/dataset/ABC.D/SubFile.txt"
    },
    {
      "@id": "assays/MyAssay/dataset/ABC.D/SubFolder/SubSubFile.txt",
      "@type": "File",
      "name": "assays/MyAssay/dataset/ABC.D/SubFolder/SubSubFile.txt"
    },
    {
      "@id": "assays/MyAssay/dataset/ABC.D",
      "@type": [
        "File",
        "Dataset"
      ],
      "name": "assays/MyAssay/dataset/ABC.D",
      "hasPart": [
        {
          "@id": "assays/MyAssay/dataset/ABC.D/SubFile.txt"
        },
        {
          "@id": "assays/MyAssay/dataset/ABC.D/SubFolder/SubSubFile.txt"
        }
      ]
    },
    {
      "@id": "#Process_A_MyAssay_MyTable_1",
      "@type": "LabProcess",
      "name": "MyTable_1",
      "object": {
        "@id": "assays/MyAssay/dataset/ABC.D"
      },
      "result": [],
      "executesLabProtocol": {
        "@id": "#Protocol_MyTable"
      }
    }
 ]

To go full circle, the file and folder objects referenced in the ARC RO-Crate will also be put into the Filesystem of the ARC Scaffold object.

@HLWeil HLWeil requested review from Freymaurer and floWetzels July 28, 2025 10:45
@HLWeil
Copy link
Member Author

HLWeil commented Jul 28, 2025

Little note on why I used the double type File and Dataset on these folder objects:

The same object satisifies two distinct use-cases, which according to semantics and the profiles each require their own type:

  • Dataset or git perspective: The .D object itself is not tracked by git or git-lfs, making it just a container for the actually tracked files
  • File or ISA perspective: The .D folder is a singular research entity referenced by the provenance graph, we type these objects as Mediaobjects

@@ -1772,6 +1772,29 @@ let tests_ROCrate =
Expect.sequenceEqual inputCol.Cells expectedCells "First table input column should have correct cells"
/// Assays
Expect.equal arc.AssayCount 2 "ARC should contain 2 assays"
ftestCase "IncludeFilesystem" <| fun _ ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still an ftestCase? i would assume this throws on CI. If not we can pass pyxpecto an argument to throw if ftests exist.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i assume this is critical as it blocks nearly 1900 tests from running

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing! That's quite a big whoopsie on my side. And yes, could you add the argument in a commit to this PR after I fix this specific case?

@Freymaurer Freymaurer self-requested a review July 29, 2025 06:48
@HLWeil HLWeil merged commit 2f33c59 into main Jul 29, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants