-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Context
Note
Conversation started via email, but redirected here for transparency and easy cross-referencing. Everyone please feel free to contribute to the discussion!
There have been multiple conversations about issues with the dataset (eg: the exact same images being labeled as M0 and M12: #39), that are being fixed locally, without knowing if the error is also being fixed at the source, and at institutions using the data for analysis.
In addition, some of the issues are being reported to us from another site than the source site, example: #13, so we end up fixing things on our internal server, without knowing that the exact same corrections are also being made at the source site.
Problem: Given that the dataset is not being synced across the multiple user sites, we end up with multiple versions of the dataset that are not being tracked, potentially leading to errors and lack of reproducibility.
Solutions
We should look at ways to version track the data and its usage across all the user sites. The earlier the better (as time passes, errors are being accumulated, making it increasingly difficult to reconstruct the history).
Track source dataset with git-annex
git-annex technology is a popular reference for version-tracking dataset, based on git. It is notably used by Datalad, a reference tool in the neuroimaging community for sharing data and performing reproducible science. An excellent solution would be to convert the source repos as a git-annex repos, and make modifications with regular git commit/push that are trackable.
Several levels of permissions are possible:
- The most restrictive one, would be that the source site would be the only one with R/W permission, and no other site would have either R or W permission (due to network access limitation at the source site, for security reasons). The source site would manage the git-annex repos, and distribute (eg: via secured SFTP) specific versions of the repos (with a specific commit SHA)
- Less restrictive: source site has R/W access, and some sites have R access, to be able to
git-annex checkout/pull
a specific version of the repos-- pros: less manual work to distribute the data ; cons: possible security issues from source IT management team - even less restrictive: source site has R/W access, and some sites have R/W access, to be able to fetch data, and to push contributions (eg: manual segmentations, see below). Pros: less manual work, less prone to human error when copying from collaborative site to source site ; cons: security issue (likely not going to work).
I think that option 1 is the most realistic/reasonable given the IT context.
Create manual checksum
If git-annex repos is not possible, or while it is being implemented, a "quick and dirty" solution is for the source site to create a Checksum of all files in the dataset (recursively), which could be done with:
find * -type f -exec shasum -b -a 256 {} \; > CANPROCO_vX.Y
And then the sums can be verified by collaborative sites with:
shasum -c -a 256 CANPROCO_vX.Y
Additional usage
We should also consider that other sites might contribute to the dataset, eg, with manual labels of segmentations. Using git-annex would be a means to push the segmentations to the source repository, so that it could also serve other sites for analysis.
Resource
Examples of multi-site data managed with git-annex
- https://github.com/spine-generic/data-multi-subject#spine-generic-public-database-multi-subject (source code for deploying git-annex server is available in the documentation of the repository)