Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add instructions for using DVC for versioning data #13

Open
shntnu opened this issue Mar 21, 2021 · 2 comments
Open

Add instructions for using DVC for versioning data #13

shntnu opened this issue Mar 21, 2021 · 2 comments

Comments

@shntnu
Copy link
Member

shntnu commented Mar 21, 2021

@gwaygenomics said this broadinstitute/lincs-cell-painting#60 (comment)

We might at some point also consider moving from gitLFS to dvc. It was super easy to get setup, and plays very nicely with AWS. I did this in the grit-benchmark repo (in broadinstitute/grit-benchmark#28)

The file pointer is in a readable format (YAML file)

outs:
- md5: c53856c1596f00a67a636389716d8219
  size: 26948901
  path: cellhealth_single_cell_umap_embeddings_SQ00014610_chr2.tsv.gz

Steps

  1. Read the docs https://dvc.org/doc/start
  2. Create a destination prefix (a "folder") on S3, which will be the remote storage location for dvc.
  3. Add the dvc and dvcs3 dependencies
  4. Update your .gitignore to ignore the files you used to previously track using GitLFS
  5. Follow steps here https://dvc.org/doc/start and here https://dvc.org/doc/start/data-versioning
@shntnu
Copy link
Member Author

shntnu commented Jun 1, 2021

@gwaygenomics bumping this because you asked for it in some other thread

@gwaybio
Copy link
Member

gwaybio commented Jun 16, 2021

Steps:

  • Add dvc and dvc-s3 to conda environment
  • Navigate to top folder in repository and run dvc init
  • Add and commit all the auto-generated files
  • Navigate to the folder to dvc
  • Run dvc add <FOLDER> (Note, this might throw an error if github is already tracking these files. If there is an error, dvc will provide instructions on how to fix). After fixing the error, run dvc add <FOLDER> again.
  • DVC will need to compute hashes for all the files so this might take a while
  • Run git add <FOLDER>.dvc .gitignore and commit. This will add the dvc pointer and ignore the actual data
  • Add the s3 bucket location to the dvc remote (e.g. dvc remote add -d cellpainting-datasets s3://cellpainting-datasets/lincs-cell-painting/.dvc/cache)
  • Sync the github repo to aws (e.g. aws s3 sync lincs-cell-painting/ s3://cellpainting-datasets/lincs-cell-painting)
    • NOTE: make sure you have run aws configure already!
  • Run dvc push

and for a user to access these files, all they need to do is clone the repo, and then run dvc pull. That's it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants