Skip to content

ACTIN-2312 Align shared data buckets between EMC and NKI #112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

andreia-hartwig
Copy link
Contributor

@andreia-hartwig andreia-hartwig commented Jul 21, 2025

This PR includes:

  • Align shared data buckets between EMC and NKI, using only shared-data-external, with the content format of shared-data
  • Remove clinical.json from shared-data-external as it contains the same as patient_record.json
  • Remove creation and usage of evaluation summary and details

Related PR's:
https://github.com/hartwigmedical/actin-pipelines/pull/271
hartwigmedical/actin#1266
https://github.com/hartwigmedical/actin-infrastructure/pull/253

andreia-hartwig and others added 3 commits July 20, 2025 14:33
…shared-data-external, with the content format of shared-data.

Remove clinical.json from shared-data-external as it contains the same as patient_record.json
Remove creation and usage of evaluation summary and details.
@@ -16,11 +16,11 @@ if [[ -z "${namespace}" ]]; then
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this script?

Copy link
Contributor Author

@andreia-hartwig andreia-hartwig Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script creates the files in the orange subfolder. We agreed with Korneel to keep it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oki! I couldn't remember if we wanted to keep it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe here in the title and logging we can also make it clearer that it is pushed to the shared external bucket and not directly to an archival bucket? Eg 'Copying' instead of 'Archiving' in logging?

@@ -1,53 +1,37 @@
#!/usr/bin/env bash
Copy link
Contributor

@ninajacobs ninajacobs Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should rename the EMC and NKI scripts, to have a similar name? I think that would be clearer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ninajacobs Not quite related to this PR, but at this point can we delete the pull and approve scripts for the NKI feed? I can take care of that separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbartletthmf you're asking since we're using the UI now right? If you don't see a reason why we should need these anymore in the future, then I think so, yes! I indeed haven't used the scripts for this the last times

Copy link
Contributor Author

@andreia-hartwig andreia-hartwig Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ninajacobs These comments are confusing, are you saying you don't use share_report_and_clinical_and_matching_snapshot_with_nki or pull and approve scripts for the NKI?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreia-hartwig We're talking about pull_transformed_nki_feed and approve_transformed_nki_feed, which I'll delete separate from this PR. Sorry for the confusion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to rename share_report_and_clinical_and_matching_snapshot_with_nki and it's EMC counterpart to something similar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw the EMC script has 'archiving' in the name and I don't think it archives (anymore), only sharing data to external shared bucket? how about share_actin_report_and_relevant_data_with_emc and share_actin_report_and_relevant_data_with_nki ?

@@ -38,7 +38,7 @@ if [[ -z "${orange_no_germline_json}" && -z "${orange_no_germline_pdf}" ]]; then
warn "Skipping archiving of ORANGE data for ${patient} as they do not exist"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finding one more 'archiving' that can be replaced by 'copying' :)

Like the new names, LGTM!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you update the EMC and NKI Confluence pages too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only once it's merged

@andreia-hartwig
Copy link
Contributor Author

@jbartletthmf @ninajacobs Please review the latest commit where I introduce a latest folder, new files to a latest folder moving the previous latest are moved to a versioned folder.

Copy link
Contributor

@jbartletthmf jbartletthmf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate that Google storage doesn't support symlinks or atomic move operations, because having to move the latest folder every time there's an update isn't ideal. Given those constraints, I don't see a better way to have a constant path for the latest data. Some alternatives would be to:

  1. store the latest version in a separate file at the top level of the directory, so consumers of the data could read that file first to determine the path
  2. resolve the latest path at access time by listing the directories

Both of those options simplify updates but complicate reads. Since we update in one place and may read from many, it probably makes sense to stick with the current approach.

Comment on lines 39 to 47
if [[ "$version_folder" != "1" ]]; then
info "Moving ACTIN data from $shared_actin_data_external_latest_uri to $shared_actin_data_external_uri"
gsutil -m mv "$shared_actin_data_external_latest_uri/\*" "$shared_actin_data_external_uri/"
fi

info "Copying ACTIN data for ${patient} to ${shared_actin_data_external_latest_uri}"
gsutil cp "${patient_record_json}" "${shared_actin_data_external_latest_uri}/"
gsutil cp "${treatment_match_json}" "${shared_actin_data_external_latest_uri}/"
gsutil cp "${actin_report_pdf}" "${shared_actin_data_external_latest_uri}/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need some checks to handle if one of these operations fails? We wouldn't want to overwrite files if the archiving step is unsuccessful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably that is best. I will add.

Comment on lines 34 to 37
if [[ "$version_folder" != "1" ]]; then
info "Moving ACTIN data from $shared_actin_data_external_latest_uri to $shared_actin_data_external_uri"
gsutil -m mv "$shared_actin_data_external_latest_uri/\*" "$shared_actin_data_external_uri/"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how practical this is, but it would be nice if this code could be centralized and reused instead of implemented in each file. It would be even better if we had a well-tested application to manage files and sharing, but now I'm just being unreasonable 😛

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to generalize it so I can create a reusable function.

version_folder=$(basename "${shared_actin_data_external_uri%/}")
shared_actin_data_external_latest_uri="${shared_actin_data_external_uri/%${version_folder}/}latest"

if [[ "$version_folder" != "1" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So nothing is moved when the folder equals 1? Does that mean that the first archived version will be 2? Should we subtract 1 so the archives start from 1 instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the folder is 1 it means it's the first version and there is no existing latest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, and in that case, we skip archiving and just copy the new files to latest. On the next run, there still won't be a folder called 1, so will we try with version_folder 1 again? It doesn't look like the version management is aware of latest/, which would mean nothing gets archived.

Alternatively, if the version gets bumped to 2, it looks like the first archive operation will be to copy from latest/ to 2/.

@andreia-hartwig
Copy link
Contributor Author

It's unfortunate that Google storage doesn't support symlinks or atomic move operations, because having to move the latest folder every time there's an update isn't ideal. Given those constraints, I don't see a better way to have a constant path for the latest data. Some alternatives would be to:

  1. store the latest version in a separate file at the top level of the directory, so consumers of the data could read that file first to determine the path
  2. resolve the latest path at access time by listing the directories

Both of those options simplify updates but complicate reads. Since we update in one place and may read from many, it probably makes sense to stick with the current approach.

I also don't fully like the idea of move the latest folder every time but adding the latest folder was a request from Jana to simplify reading the data. I think this could also be useful at UMCU, for example, so I was trying to put the effort on the writing side since we only do that once. I even considered always creating a versioned folder and duplicating the files in a latest folder, but since most patients will only have one version, that felt like overkill and a waste of resources.

@jbartletthmf
Copy link
Contributor

Another alternative would be to have a URL that resolves to the latest version, but I don't know if that would support all the existing (and future) use cases.

@andreia-hartwig
Copy link
Contributor Author

@jbartletthmf Can you please review the latest commit? I've added a manifest file to help identify the latest version.

Copy link
Contributor

@jbartletthmf jbartletthmf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much tidier! Nicely done 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants