-
Notifications
You must be signed in to change notification settings - Fork 14
ACTIN-2312 Align shared data buckets between EMC and NKI #112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…shared-data-external, with the content format of shared-data. Remove clinical.json from shared-data-external as it contains the same as patient_record.json Remove creation and usage of evaluation summary and details.
@@ -16,11 +16,11 @@ if [[ -z "${namespace}" ]]; then | |||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this script?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script creates the files in the orange subfolder. We agreed with Korneel to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oki! I couldn't remember if we wanted to keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe here in the title and logging we can also make it clearer that it is pushed to the shared external bucket and not directly to an archival bucket? Eg 'Copying' instead of 'Archiving' in logging?
@@ -1,53 +1,37 @@ | |||
#!/usr/bin/env bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should rename the EMC and NKI scripts, to have a similar name? I think that would be clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ninajacobs Not quite related to this PR, but at this point can we delete the pull and approve scripts for the NKI feed? I can take care of that separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbartletthmf you're asking since we're using the UI now right? If you don't see a reason why we should need these anymore in the future, then I think so, yes! I indeed haven't used the scripts for this the last times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ninajacobs These comments are confusing, are you saying you don't use share_report_and_clinical_and_matching_snapshot_with_nki or pull and approve scripts for the NKI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreia-hartwig We're talking about pull_transformed_nki_feed
and approve_transformed_nki_feed
, which I'll delete separate from this PR. Sorry for the confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to rename share_report_and_clinical_and_matching_snapshot_with_nki and it's EMC counterpart to something similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw the EMC script has 'archiving' in the name and I don't think it archives (anymore), only sharing data to external shared bucket? how about share_actin_report_and_relevant_data_with_emc and share_actin_report_and_relevant_data_with_nki ?
@@ -38,7 +38,7 @@ if [[ -z "${orange_no_germline_json}" && -z "${orange_no_germline_pdf}" ]]; then | |||
warn "Skipping archiving of ORANGE data for ${patient} as they do not exist" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
finding one more 'archiving' that can be replaced by 'copying' :)
Like the new names, LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will you update the EMC and NKI Confluence pages too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only once it's merged
…t to a versioned folder.
@jbartletthmf @ninajacobs Please review the latest commit where I introduce a latest folder, new files to a latest folder moving the previous latest are moved to a versioned folder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's unfortunate that Google storage doesn't support symlinks or atomic move operations, because having to move the latest folder every time there's an update isn't ideal. Given those constraints, I don't see a better way to have a constant path for the latest data. Some alternatives would be to:
- store the latest version in a separate file at the top level of the directory, so consumers of the data could read that file first to determine the path
- resolve the latest path at access time by listing the directories
Both of those options simplify updates but complicate reads. Since we update in one place and may read from many, it probably makes sense to stick with the current approach.
if [[ "$version_folder" != "1" ]]; then | ||
info "Moving ACTIN data from $shared_actin_data_external_latest_uri to $shared_actin_data_external_uri" | ||
gsutil -m mv "$shared_actin_data_external_latest_uri/\*" "$shared_actin_data_external_uri/" | ||
fi | ||
|
||
info "Copying ACTIN data for ${patient} to ${shared_actin_data_external_latest_uri}" | ||
gsutil cp "${patient_record_json}" "${shared_actin_data_external_latest_uri}/" | ||
gsutil cp "${treatment_match_json}" "${shared_actin_data_external_latest_uri}/" | ||
gsutil cp "${actin_report_pdf}" "${shared_actin_data_external_latest_uri}/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need some checks to handle if one of these operations fails? We wouldn't want to overwrite files if the archiving step is unsuccessful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably that is best. I will add.
if [[ "$version_folder" != "1" ]]; then | ||
info "Moving ACTIN data from $shared_actin_data_external_latest_uri to $shared_actin_data_external_uri" | ||
gsutil -m mv "$shared_actin_data_external_latest_uri/\*" "$shared_actin_data_external_uri/" | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how practical this is, but it would be nice if this code could be centralized and reused instead of implemented in each file. It would be even better if we had a well-tested application to manage files and sharing, but now I'm just being unreasonable 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try to generalize it so I can create a reusable function.
version_folder=$(basename "${shared_actin_data_external_uri%/}") | ||
shared_actin_data_external_latest_uri="${shared_actin_data_external_uri/%${version_folder}/}latest" | ||
|
||
if [[ "$version_folder" != "1" ]]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So nothing is moved when the folder equals 1? Does that mean that the first archived version will be 2? Should we subtract 1 so the archives start from 1 instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the folder is 1 it means it's the first version and there is no existing latest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, and in that case, we skip archiving and just copy the new files to latest. On the next run, there still won't be a folder called 1, so will we try with version_folder 1 again? It doesn't look like the version management is aware of latest/
, which would mean nothing gets archived.
Alternatively, if the version gets bumped to 2, it looks like the first archive operation will be to copy from latest/
to 2/
.
I also don't fully like the idea of move the latest folder every time but adding the latest folder was a request from Jana to simplify reading the data. I think this could also be useful at UMCU, for example, so I was trying to put the effort on the writing side since we only do that once. I even considered always creating a versioned folder and duplicating the files in a latest folder, but since most patients will only have one version, that felt like overkill and a waste of resources. |
Another alternative would be to have a URL that resolves to the latest version, but I don't know if that would support all the existing (and future) use cases. |
…us latest to a versioned folder." This reverts commit 40820cd.
@jbartletthmf Can you please review the latest commit? I've added a manifest file to help identify the latest version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much tidier! Nicely done 🙂
This PR includes:
Related PR's:
https://github.com/hartwigmedical/actin-pipelines/pull/271
hartwigmedical/actin#1266
https://github.com/hartwigmedical/actin-infrastructure/pull/253