You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- File obtained from [here](https://pedcbioportal.kidsfirstdrc.org/webAPI#using-data-access-tokens), then clicking on `Download Token`. File is reusable until expired, then a new one will have to be downloaded.
+`chopaws`https://github.research.chop.edu/devops/aws-auth-cli needed for saml key generation for s3 upload
8
+
+`IGOR`https://github.com/d3b-center/d3b-cli-igor
10
9
+ Access to https://github.com/d3b-center/aws-infra-pedcbioportal-import repo for server loading:
11
10
+ Access to the `postgres` D3b Warehouse database at `d3b-warehouse-aurora-prd.d3b.io`. Need at least read access to tables with the `bix_workflows` schema
12
-
+[cbioportal git repo](https://github.com/cBioPortal/cbioportal) needed to validate the final study output
13
11
14
-
## I have everything and I know I am doing
12
+
[cBio load package v5.4.10](https://github.com/cBioPortal/cbioportal/releases/tag/v5.4.10) is used in this tool.
13
+
Refer to [INSTALL.md](https://github.com/kids-first/kf-cbioportal-etl/INSTALL.md) and [setup.py](https://github.com/kids-first/kf-cbioportal-etl/setup.py) for more details.
14
+
15
+
16
+
## Install tool
17
+
Run on `Mgmt-Console-Dev-chopd3bprod@684194535433` EC2 instance
- Copy the `credentials_templates/template.db.ini` template to `/path/to/db.ini` and replace placeholders with your credentials.
85
+
- Copy the `credentials_templates/template.sevenbridges.ini` template to `~/.sevenbridges/credentials` and replace placeholders with your credentials.
86
+
- Download a reusable access token for PedcBioPortal `cbioportal_data_access_token.txt` from [here](https://pedcbioportal.kidsfirstdrc.org/webAPI#using-data-access-tokens).
87
+
88
+
### Steps Argument
89
+
The `--steps` argument specifies which steps of the pipeline to run. It is outlined as follows:
90
+
-`1` - Get study metadata
91
+
-`2` - Compare current DWH data vs cBioPortal build
92
+
-`3` - Get files from manifest
93
+
-`4` - Check downloaded files
94
+
-`5` - Build genomic file package
95
+
96
+
You can specify the steps in one of the following ways:
97
+
-**Run a single step**:
98
+
```bash
99
+
--steps 1
100
+
```
101
+
This will only execute Step 1 (Get study metadata).
102
+
-**Run multiple steps**:
103
+
```bash
104
+
--steps 2 3 4
105
+
```
106
+
This will execute Steps 2, 3, and 4 in sequence.
107
+
108
+
-**Run the whole ETL**:
109
+
```bash
110
+
--steps all
111
+
```
112
+
This will execute Steps 1 through 5.
113
+
114
+
## Run manually without tool installation
15
115
Below assumes you have already created the necessary tables from dbt
16
116
1. Run commands as outlined in [scripts/get_study_metadata.py](#scriptsget_study_metadatapy). Copy/move those files to the cBio loader ec2 instance
17
117
1. Recommended, but not required: run [scripts/diff_studies.py](docs/DIFF_STUDY_CLINICAL.md). It will give a summary of metadata changes between what is currently loaded and what you plan to load, to potentially flag any suspicious changes
18
118
1. Copy over the appropriate aws account key and download files. Example using `pbta_all` study:
`aws_bucket_key_pairs.txt` is a headerless tsv file with bucket name + object prefixes and aws profile name pairs
24
124
25
-
1. Copy and edit `STUDY_CONFIGS/pbta_all_data_processing_config.json` and `STUDY_CONFIGS/pbta_all_case_meta_config.json` as needed
125
+
1. Copy and edit `cbioportal_etl/STUDY_CONFIGS/pbta_all_data_processing_config.json` and `cbioportal_etl/STUDY_CONFIGS/pbta_all_case_meta_config.json` as needed
26
126
1. Run pipeline script - ignore manifest section, it is a placeholder for a better function download method
27
127
28
128
```sh
29
-
scripts/genomics_file_cbio_package_build.py -t cbio_file_name_id.txt -c pbta_all_case_meta_config.json -d pbta_all_data_processing_config.json -f both
129
+
cbioportal_etl/scripts/genomics_file_cbio_package_build.py -t cbio_file_name_id.txt -c pbta_all_case_meta_config.json -d pbta_all_data_processing_config.json -f both
30
130
```
31
131
1. Check logs and outputs for errors, especially `validator.errs` and `validator.out`, assuming everything else went fine, to see if any `ERROR` popped up that would prevent the pakcage from loading properly once pushed to the bucket and Jenkins import job is run
32
132
@@ -75,7 +175,7 @@ processed
75
175
└── meta_sv.txt
76
176
```
77
177
Note! Most other studies won't have a timeline set of files.
78
-
## Upload the final packages
178
+
###Upload the final packages
79
179
Upload all of the directories named as study short names to `s3://kf-strides-232196027141-cbioportal-studies/studies/`. You may need to set and/or copy your aws saml key before uploading. See "access to https://github.com/d3b-center/aws-infra-pedcbioportal-import repo" bullet point in [Software Prerequisites](#software-prerequisites) to load the study.
80
180
### Load into QA/PROD
81
181
An AWS step function exists to load studies on to the QA and PROD servers.
@@ -85,10 +185,10 @@ An AWS step function exists to load studies on to the QA and PROD servers.
85
185
+ aws `stateMachinePedcbioImportservice` Step function service is used to view and manage running jobs
86
186
+ To repeat a load, click on the ▶️ icon in the git repo to select the job you want to re-run
87
187
+*Note*, if your branch importStudies.txt is the same as main, you may have tot rigger it yourself. To do so, go to [actions](https://github.com/d3b-center/aws-infra-pedcbioportal-import/actions), on the left panel choose which action you want, then from the drop down in the right panel, pick which branch you want that action to run on
88
-
# Details
188
+
##Details - ETL Steps
89
189
Use this section as a reference in case your overconfidence got the best of you
90
190
91
-
## REFS
191
+
###REFS
92
192
In case you want to use different reference inputs...
93
193
- From data_processing_config.json `bed_genes`:
94
194
- This is used to collate ControlFreeC results into gene hits
Most starting files are exported from the D3b Warehouse. An example of file exports can be found here `scripts/export_clinical.sh`, we now use `scripts/get_study_metadata.py` to get the files.
206
+
###Starting file inputs
207
+
Most starting files are exported from the D3b Warehouse. An example of file exports can be found here `cbioportal_etl/scripts/export_clinical.sh`, we now use `cbioportal_etl/scripts/get_study_metadata.py` to get the files.
108
208
However, a python wrapper script that leverages the `x_case_meta_config.json` is recommended to use for each study.
This is the cBioportal-formatted sample sheet that follows guidelines from [here](https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#clinical-sample-columns)
231
+
131
232
#### - Data clinical patient sheet
132
233
This is the cBioportal-formatted patient sheet that follows guidelines from [here](https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#clinical-patient-columns)
234
+
133
235
#### - Genomics metadata file
134
236
Seemingly redundant, this file contains the file locations, BS IDs, file type, and cBio-formatted sample IDs of all inputs.
135
237
It helps simplify the process to integrate better into the downstream tools.
136
238
This is the file that goes in as the `-t` arg in all the data collating tools
239
+
137
240
### User-edited
138
241
#### - Data processing config file
139
242
140
243
This is a json formatted file that has tool paths, reference paths, and run time params.
141
-
An example is given in `STUDY_CONFIGS/pbta_all_data_processing_config.json`.
244
+
An example is given in `cbioportal_etl/STUDY_CONFIGS/pbta_all_data_processing_config.json`.
142
245
This section here:
143
246
```json
144
247
"file_loc_defs": {
@@ -163,7 +266,7 @@ Will likely need the most editing existing based on your input, and should only
163
266
#### - Metadata processing config file
164
267
165
268
This is a json config file with file descriptions and case lists required by the cbioportal.
166
-
An example is given in `STUDY_CONFIGS/pbta_all_case_meta_config.json`.
269
+
An example is given in `cbioportal_etl/STUDY_CONFIGS/pbta_all_case_meta_config.json`.
167
270
Within this file is a `_doc` section with a decent explanation of the file format and layout.
168
271
Be sure to review all data types to be loaded by review all `meta_*` to see if they match incoming data.
169
272
Likely personalized edits would occur in the following fields:
@@ -177,10 +280,10 @@ Likely personalized edits would occur in the following fields:
177
280
+`short_name`: This is the short version. By default, should be the same as `cancer_study_identifier`
178
281
179
282
180
-
## Pipeline script
283
+
###Pipeline script
181
284
After downloading the genomic files and files above as needed, and properly editing config files as needed, this script should generate and validate the cBioportal load package
Currently, file locations are still too volatile to trust to make downloading part of the pipeline. Using various combinations of buckets and sbg file ID pulls will eventually get you everything
0 commit comments