Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata should be uploaded to storage system after result files are uploaded #24

Open
wshands opened this issue Oct 3, 2017 · 4 comments
Assignees

Comments

@wshands
Copy link
Contributor

wshands commented Oct 3, 2017

The metadata specifying result files produced from a pipeline is currently uploaded before the result files are uploaded to the storage system. These steps should be reversed, since it is more likely that the upload of result files will fail, and if the upload of result files fails, we will have metadata indicating a result file is in the storage system when it is not, and when this metadata is used to locate files to download, and a download of the missing file is attempted, the download fails. In addition the browser will display the details for the missing file that is not actually in the storage system.
However if the result files are uploaded before the metadata is uploaded, and the result file upload fails, the upload will stop and metadata will not be uploaded for the pipeline results. Also if the metadata upload fails, which is unlikely, the result files will exist in the storage system but the user will simply not know about them, the browser will not know about them and the pipeline will simply need to be rerun.

@caaespin
Copy link
Member

caaespin commented Oct 3, 2017

@wshands @GPelayo , It was my understanding that the original design was to upload the metadata first and then upload the file to the storage system, supposedly because it was a better outcome to have a metadata file uploaded with no accompanying file, than a file with no accompanying metadata, and therefore no way to trace it in the storage system. This was my understanding but I may be wrong. You can imagine a lot of garbage accumulating over time and therefore increasing production's costs. @briandoconnor might be a better resource to answer that.

@wshands
Copy link
Contributor Author

wshands commented Oct 3, 2017

Yes also the dcc-metadata-client produces the upload manifest used by the icgc-storage-client which does the upload in dockstore tool runner...so the metadata upload is wired to happen before the result file upload?

@wshands
Copy link
Contributor Author

wshands commented Oct 3, 2017

More discussion needed on this

@caaespin
Copy link
Member

caaespin commented Oct 3, 2017

@wshands , the dockstore-tool-runner first does the registration of the file using the dcc-metadata-client (an icgc tool to talk to the metadata server of redwood). Upon successful registration of the data in redwood, the dockstore-tool-runner then runs the icgc-storage-client tool to do the actual upload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants