Implement version tracking system for CanProCo data

## Context

> [!NOTE]  
> Conversation started via email, but redirected here for transparency and easy cross-referencing. Everyone please feel free to contribute to the discussion!

There have been multiple conversations about issues with the dataset (eg: the exact same images being labeled as M0 and M12: https://github.com/ivadomed/canproco/issues/39), that are being fixed locally, without knowing if the error is also being fixed at the source, and at institutions using the data for analysis. 

In addition, some of the issues are being reported to us from another site than the source site, example: https://github.com/ivadomed/canproco/issues/13, so we end up fixing things on our internal server, without knowing that the exact same corrections are also being made at the source site.

Problem: Given that the dataset is not being synced across the multiple user sites, we end up with multiple versions of the dataset that are not being tracked, potentially leading to errors and lack of reproducibility. 

## Solutions

We should look at ways to version track the data and its usage across all the user sites. The earlier the better (as time passes, errors are being accumulated, making it increasingly difficult to reconstruct the history). 

### Track source dataset with git-annex

[git-annex](https://git-annex.branchable.com/) technology is a popular reference for version-tracking dataset, based on git. It is notably used by [Datalad](https://www.datalad.org/), a reference tool in the neuroimaging community for sharing data and performing reproducible science. An excellent solution would be to convert the source repos as a git-annex repos, and make modifications with regular git commit/push that are trackable. 

Several levels of permissions are possible:

1. The most restrictive one, would be that the source site would be the only one with R/W permission, and no other site would have either R or W permission (due to network access limitation at the source site, for security reasons). The source site would manage the git-annex repos, and distribute (eg: via secured SFTP) specific versions of the repos (with a specific commit SHA)
2. Less restrictive: source site has R/W access, and _some_ sites have R access, to be able to `git-annex checkout/pull` a specific version of the repos-- pros: less manual work to distribute the data ; cons: possible security issues from source IT management team
3. even less restrictive: source site has R/W access, and _some_ sites have R/W access, to be able to fetch data, and to push contributions (eg: manual segmentations, see below). Pros: less manual work, less prone to human error when copying from collaborative site to source site ; cons: security issue (likely not going to work).
 
I think that option 1 is the most realistic/reasonable given the IT context. 

### Create manual checksum

If git-annex repos is not possible, or while it is being implemented, a "quick and dirty" solution is for the source site to create a Checksum of all files in the dataset (recursively), which could be done with:
```console
find * -type f -exec shasum -b -a 256 {} \; > CANPROCO_vX.Y
```

And then the sums can be verified by collaborative sites with:
```console
shasum -c -a 256 CANPROCO_vX.Y
```

## Additional usage

We should also consider that other sites might contribute to the dataset, eg, with manual labels of segmentations. Using git-annex would be a means to push the segmentations to the source repository, so that it could also serve other sites for analysis.

## Resource

- https://git-annex.branchable.com/
- https://www.datalad.org/
- https://distribits.live/about/

Examples of multi-site data managed with git-annex
- https://github.com/spine-generic/data-multi-subject#spine-generic-public-database-multi-subject (source code for deploying git-annex server is available in the documentation of the repository)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement version tracking system for CanProCo data #86

Context

Solutions

Track source dataset with git-annex

Create manual checksum

Additional usage

Resource

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement version tracking system for CanProCo data #86

Description

Context

Solutions

Track source dataset with git-annex

Create manual checksum

Additional usage

Resource

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions