Skip to content

Commit b45a2d9

Browse files
Merge pull request #72 from kids-first/jw-etl-tool-2
converted etl to standalone tool and updated readme.md
2 parents f2d51eb + 151a83d commit b45a2d9

File tree

76 files changed

+1119
-868
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

76 files changed

+1119
-868
lines changed

INSTALL.md

+49
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
## Prerequisites
2+
1. **System-Level Dependencies**:
3+
Not required, but depending on your system, might be needed:
4+
- `pkg-config`: May be required for building some Python libraries.
5+
- `libmysqlclient-dev`: May be required for `mysqlclient`.
6+
- `build-essential`: Provides `gcc` for compiling Python extensions.
7+
- Install these on Ubuntu/Debian:
8+
```bash
9+
sudo apt update
10+
sudo apt install pkg-config libmysqlclient-dev build-essential
11+
```
12+
13+
2. **db.ini File**:
14+
- Make a db.ini file and paste this (replacing `<user>` and `<password>` with your actual credentials):
15+
```plaintext
16+
[postgresql]
17+
database=postgres
18+
host=<host>
19+
user=<user>
20+
password=<password>
21+
```
22+
23+
3. **Seven Bridges Credentials**:
24+
- Set up credentials:
25+
```bash
26+
mkdir -p ~/.sevenbridges
27+
vim ~/.sevenbridges/credentials
28+
```
29+
- Paste this into the credentials file (replacing `<token>` with your actual token):
30+
```plaintext
31+
[default]
32+
api_endpoint = <api_endpoint>
33+
auth_token = <token>
34+
advance_access = false
35+
```
36+
37+
4. **Seven Bridges Tools**:
38+
- Install Seven Bridges CLI tools (not needed for ETL, but in case you want it installed):
39+
```bash
40+
bash -c 'curl https://igor.sbgenomics.com/downloads/sb/install.sh -sSf | sudo -H sh'
41+
pip3 install pipx
42+
pipx ensurepath
43+
source ~/.bashrc
44+
pipx install sbpack
45+
```
46+
47+
5. **PedcBioPortal Access Token**:
48+
- Required for running step 2.
49+
- File obtained from [here](https://pedcbioportal.kidsfirstdrc.org/webAPI#using-data-access-tokens), then clicking on `Download Token`. File is reusable until expired, then a new one will have to be downloaded.

README.md

+125-22
Original file line numberDiff line numberDiff line change
@@ -4,29 +4,129 @@ Further general loading notes can be found in this [Notion page](https://www.not
44
See [below](#collaborative-and-publication-workflows) for special cases like publications or collaborative efforts
55
## Software Prerequisites
66
+ `python3` v3.5.3+
7-
+ `numpy`, `pandas`, `scipy`
87
+ `bedtools` (https://bedtools.readthedocs.io/en/latest/content/installation.html)
9-
+ `chopaws` https://github.research.chop.edu/devops/aws-auth-cli needed for saml key generation for s3 upload
8+
+ `IGOR` https://github.com/d3b-center/d3b-cli-igor
109
+ Access to https://github.com/d3b-center/aws-infra-pedcbioportal-import repo for server loading:
1110
+ Access to the `postgres` D3b Warehouse database at `d3b-warehouse-aurora-prd.d3b.io`. Need at least read access to tables with the `bix_workflows` schema
12-
+ [cbioportal git repo](https://github.com/cBioPortal/cbioportal) needed to validate the final study output
1311

14-
## I have everything and I know I am doing
12+
[cBio load package v5.4.10](https://github.com/cBioPortal/cbioportal/releases/tag/v5.4.10) is used in this tool.
13+
Refer to [INSTALL.md](https://github.com/kids-first/kf-cbioportal-etl/INSTALL.md) and [setup.py](https://github.com/kids-first/kf-cbioportal-etl/setup.py) for more details.
14+
15+
16+
## Install tool
17+
Run on `Mgmt-Console-Dev-chopd3bprod@684194535433` EC2 instance
18+
```sh
19+
git clone https://github.com/kids-first/kf-cbioportal-etl.git
20+
pip install /path/to/kf-cbioportal-etl/
21+
```
22+
If the install was successful, you should be able to run `cbioportal_etl --help`, which will give you the following menu:
23+
```
24+
usage: cbioportal_etl [-h] [--steps {1,2,3,4,5,all} [{1,2,3,4,5,all} ...]] -db DB_INI [-p PROFILE] [-mc META_CONFIG] [-r REF_DIR] [-a] [-u URL] -s STUDY [-t TOKEN] [-ds DATASHEET_SAMPLE] [-dp DATASHEET_PATIENT] [-m MANIFEST] [-f FILE_TYPES] [-at AWS_TBL]
25+
[-sp SBG_PROFILE] [-c CBIO] [-ao] [-rm] [-d] [-o] [-ms MANIFEST_SUBSET] [-dc DATA_CONFIG] [-dgd {both,kf,dgd}] [-l]
26+
27+
Run cBioPortal ETL pipeline
28+
29+
optional arguments:
30+
-h, --help show this help message and exit
31+
--steps {1,2,3,4,5,all} [{1,2,3,4,5,all} ...]
32+
Steps to execute (e.g., 1 2 3 or all)
33+
-db DB_INI, --db-ini DB_INI
34+
Database config file
35+
-p PROFILE, --profile PROFILE
36+
Profile name (default: postgresql)
37+
-mc META_CONFIG, --meta-config META_CONFIG
38+
Metadata configuration file. Default: value inputted for --study + '_case_meta_config.json'
39+
-r REF_DIR, --ref-dir REF_DIR
40+
Reference directory. Defaults to tool's ref dir if not provided.
41+
-a, --all Include all relevant files, not just status=active files, NOT RECOMMENDED
42+
-u URL, --url URL URL to search against
43+
-s STUDY, --study STUDY
44+
Cancer study ID
45+
-t TOKEN, --token TOKEN
46+
Token file obtained from Web API. Required if running Step 2
47+
-ds DATASHEET_SAMPLE, --datasheet-sample DATASHEET_SAMPLE
48+
File containing cBio-formatted sample metadata (default: datasheets/data_clinical_sample.txt from Step 1 output)
49+
-dp DATASHEET_PATIENT, --datasheet-patient DATASHEET_PATIENT
50+
File containing cBio-formatted patient metadata (default: datasheets/data_clinical_patient.txt from Step 1 output)
51+
-m MANIFEST, --manifest MANIFEST
52+
Manifest file (default: cbio_file_name_id.txt from Step 1 output)
53+
-f FILE_TYPES, --file-types FILE_TYPES
54+
Comma-separated file types to download
55+
-at AWS_TBL, --aws-tbl AWS_TBL
56+
AWS table with bucket name and keys
57+
-sp SBG_PROFILE, --sbg-profile SBG_PROFILE
58+
SBG profile name
59+
-c CBIO, --cbio CBIO cBio manifest to limit downloads
60+
-ao, --active-only Only include active files
61+
-rm, --rm-na Remove entries where file_id and s3_path are NA
62+
-d, --debug Enable debug mode
63+
-o, --overwrite Overwrite files if they already exist
64+
-ms MANIFEST_SUBSET, --manifest-subset MANIFEST_SUBSET
65+
Check that files were downloaded. Default: manifest_subset.tsv from Step 3
66+
-dc DATA_CONFIG, --data-config DATA_CONFIG
67+
Data processing configuration file. Default: value inputted for --study + '_data_processing_config.json'
68+
-dgd {both,kf,dgd}, --dgd-status {both,kf,dgd}
69+
Flag to determine load will have pbta/kf + dgd(both), kf/pbta only(kf), dgd-only(dgd). Default: kf
70+
-l, --legacy Enable legacy mode
71+
```
72+
73+
## Run tool
74+
```sh
75+
cbioportal_etl \
76+
--steps all \
77+
--db-ini /path/to/db.ini \
78+
--token /path/to/cbioportal_data_access_token.txt \
79+
--study oligo_nation \
80+
--sbg-profile default
81+
```
82+
83+
### Required credentials files
84+
- Copy the `credentials_templates/template.db.ini` template to `/path/to/db.ini` and replace placeholders with your credentials.
85+
- Copy the `credentials_templates/template.sevenbridges.ini` template to `~/.sevenbridges/credentials` and replace placeholders with your credentials.
86+
- Download a reusable access token for PedcBioPortal `cbioportal_data_access_token.txt` from [here](https://pedcbioportal.kidsfirstdrc.org/webAPI#using-data-access-tokens).
87+
88+
### Steps Argument
89+
The `--steps` argument specifies which steps of the pipeline to run. It is outlined as follows:
90+
- `1` - Get study metadata
91+
- `2` - Compare current DWH data vs cBioPortal build
92+
- `3` - Get files from manifest
93+
- `4` - Check downloaded files
94+
- `5` - Build genomic file package
95+
96+
You can specify the steps in one of the following ways:
97+
- **Run a single step**:
98+
```bash
99+
--steps 1
100+
```
101+
This will only execute Step 1 (Get study metadata).
102+
- **Run multiple steps**:
103+
```bash
104+
--steps 2 3 4
105+
```
106+
This will execute Steps 2, 3, and 4 in sequence.
107+
108+
- **Run the whole ETL**:
109+
```bash
110+
--steps all
111+
```
112+
This will execute Steps 1 through 5.
113+
114+
## Run manually without tool installation
15115
Below assumes you have already created the necessary tables from dbt
16116
1. Run commands as outlined in [scripts/get_study_metadata.py](#scriptsget_study_metadatapy). Copy/move those files to the cBio loader ec2 instance
17117
1. Recommended, but not required: run [scripts/diff_studies.py](docs/DIFF_STUDY_CLINICAL.md). It will give a summary of metadata changes between what is currently loaded and what you plan to load, to potentially flag any suspicious changes
18118
1. Copy over the appropriate aws account key and download files. Example using `pbta_all` study:
19119

20120
```sh
21-
python3 scripts/get_files_from_manifest.py -s turbo -m cbio_file_name_id.txt -r
121+
python3 cbioportal_etl/scripts/get_files_from_manifest.py -s turbo -m cbio_file_name_id.txt -r
22122
```
23123
`aws_bucket_key_pairs.txt` is a headerless tsv file with bucket name + object prefixes and aws profile name pairs
24124

25-
1. Copy and edit `STUDY_CONFIGS/pbta_all_data_processing_config.json` and `STUDY_CONFIGS/pbta_all_case_meta_config.json` as needed
125+
1. Copy and edit `cbioportal_etl/STUDY_CONFIGS/pbta_all_data_processing_config.json` and `cbioportal_etl/STUDY_CONFIGS/pbta_all_case_meta_config.json` as needed
26126
1. Run pipeline script - ignore manifest section, it is a placeholder for a better function download method
27127

28128
```sh
29-
scripts/genomics_file_cbio_package_build.py -t cbio_file_name_id.txt -c pbta_all_case_meta_config.json -d pbta_all_data_processing_config.json -f both
129+
cbioportal_etl/scripts/genomics_file_cbio_package_build.py -t cbio_file_name_id.txt -c pbta_all_case_meta_config.json -d pbta_all_data_processing_config.json -f both
30130
```
31131
1. Check logs and outputs for errors, especially `validator.errs` and `validator.out`, assuming everything else went fine, to see if any `ERROR` popped up that would prevent the pakcage from loading properly once pushed to the bucket and Jenkins import job is run
32132

@@ -75,7 +175,7 @@ processed
75175
└── meta_sv.txt
76176
```
77177
Note! Most other studies won't have a timeline set of files.
78-
## Upload the final packages
178+
### Upload the final packages
79179
Upload all of the directories named as study short names to `s3://kf-strides-232196027141-cbioportal-studies/studies/`. You may need to set and/or copy your aws saml key before uploading. See "access to https://github.com/d3b-center/aws-infra-pedcbioportal-import repo" bullet point in [Software Prerequisites](#software-prerequisites) to load the study.
80180
### Load into QA/PROD
81181
An AWS step function exists to load studies on to the QA and PROD servers.
@@ -85,10 +185,10 @@ An AWS step function exists to load studies on to the QA and PROD servers.
85185
+ aws `stateMachinePedcbioImportservice` Step function service is used to view and manage running jobs
86186
+ To repeat a load, click on the ▶️ icon in the git repo to select the job you want to re-run
87187
+ *Note*, if your branch importStudies.txt is the same as main, you may have tot rigger it yourself. To do so, go to [actions](https://github.com/d3b-center/aws-infra-pedcbioportal-import/actions), on the left panel choose which action you want, then from the drop down in the right panel, pick which branch you want that action to run on
88-
# Details
188+
## Details - ETL Steps
89189
Use this section as a reference in case your overconfidence got the best of you
90190

91-
## REFS
191+
### REFS
92192
In case you want to use different reference inputs...
93193
- From data_processing_config.json `bed_genes`:
94194
- This is used to collate ControlFreeC results into gene hits
@@ -103,11 +203,11 @@ cat *genomic* | cut -f 15 | cut -f 1-3 -d "/" | sort | uniq > aws_bucket_key_pai
103203
```
104204
Just remove the `s3_path` and `None` entries
105205

106-
## Starting file inputs
107-
Most starting files are exported from the D3b Warehouse. An example of file exports can be found here `scripts/export_clinical.sh`, we now use `scripts/get_study_metadata.py` to get the files.
206+
### Starting file inputs
207+
Most starting files are exported from the D3b Warehouse. An example of file exports can be found here `cbioportal_etl/scripts/export_clinical.sh`, we now use `cbioportal_etl/scripts/get_study_metadata.py` to get the files.
108208
However, a python wrapper script that leverages the `x_case_meta_config.json` is recommended to use for each study.
109209

110-
### scripts/get_study_metadata.py
210+
### cbioportal_etl/scripts/get_study_metadata.py
111211
```
112212
usage: get_study_metadata.py [-h] [-d DB_INI] [-p PROFILE] [-c CONFIG_FILE]
113213
@@ -128,17 +228,20 @@ optional arguments:
128228
### From D3b Warehouse
129229
#### - Data clinical sample sheet
130230
This is the cBioportal-formatted sample sheet that follows guidelines from [here](https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#clinical-sample-columns)
231+
131232
#### - Data clinical patient sheet
132233
This is the cBioportal-formatted patient sheet that follows guidelines from [here](https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#clinical-patient-columns)
234+
133235
#### - Genomics metadata file
134236
Seemingly redundant, this file contains the file locations, BS IDs, file type, and cBio-formatted sample IDs of all inputs.
135237
It helps simplify the process to integrate better into the downstream tools.
136238
This is the file that goes in as the `-t` arg in all the data collating tools
239+
137240
### User-edited
138241
#### - Data processing config file
139242

140243
This is a json formatted file that has tool paths, reference paths, and run time params.
141-
An example is given in `STUDY_CONFIGS/pbta_all_data_processing_config.json`.
244+
An example is given in `cbioportal_etl/STUDY_CONFIGS/pbta_all_data_processing_config.json`.
142245
This section here:
143246
```json
144247
"file_loc_defs": {
@@ -163,7 +266,7 @@ Will likely need the most editing existing based on your input, and should only
163266
#### - Metadata processing config file
164267

165268
This is a json config file with file descriptions and case lists required by the cbioportal.
166-
An example is given in `STUDY_CONFIGS/pbta_all_case_meta_config.json`.
269+
An example is given in `cbioportal_etl/STUDY_CONFIGS/pbta_all_case_meta_config.json`.
167270
Within this file is a `_doc` section with a decent explanation of the file format and layout.
168271
Be sure to review all data types to be loaded by review all `meta_*` to see if they match incoming data.
169272
Likely personalized edits would occur in the following fields:
@@ -177,10 +280,10 @@ Likely personalized edits would occur in the following fields:
177280
+ `short_name`: This is the short version. By default, should be the same as `cancer_study_identifier`
178281

179282

180-
## Pipeline script
283+
### Pipeline script
181284
After downloading the genomic files and files above as needed, and properly editing config files as needed, this script should generate and validate the cBioportal load package
182285

183-
### scripts/get_files_from_manifest.py
286+
### cbioportal_etl/scripts/get_files_from_manifest.py
184287
Currently, file locations are still too volatile to trust to make downloading part of the pipeline. Using various combinations of buckets and sbg file ID pulls will eventually get you everything
185288
```
186289
usage: get_files_from_manifest.py [-h] [-m MANIFEST] [-f FTS] [-p PROFILE] [-s SBG_PROFILE] [-c CBIO] [-a] [-d]
@@ -204,7 +307,7 @@ optional arguments:
204307
-d, --debug Just output manifest subset to see what would be grabbed
205308
-o, --overwrite If set, overwrite if file exists
206309
```
207-
### scripts/genomics_file_cbio_package_build.py
310+
### cbioportal_etl/scripts/genomics_file_cbio_package_build.py
208311
```
209312
usage: genomics_file_cbio_package_build.py [-h] [-t TABLE] [-m MANIFEST] [-c CBIO_CONFIG] [-d DATA_CONFIG] [-f [{both,kf,dgd}]]
210313
@@ -231,13 +334,13 @@ optional arguments:
231334
Check the pipeline log output for any errors that might have occurred.
232335

233336

234-
## Congratulations, you did it!
337+
### Congratulations, you did it!
235338

236-
# Collaborative and Publication Workflows
339+
## Collaborative and Publication Workflows
237340
These are highly specialized cases in which all or most of the data come from a third party, and therefore requires specific case-by-case protocols.
238341

239-
## OpenPedCan
342+
### OpenPedCan
240343
See [OpenPedCan README](COLLABORATIONS/openTARGETS/README.md)
241344

242-
## OpenPBTA
345+
### OpenPBTA
243346
See [OpenPBTA README](COLLABORATIONS/openPBTA/README.md)

STUDY_CONFIGS/aml_sd_pet7q6f2_2018_data_processing_config.json

-30
This file was deleted.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
{
2+
"bedtools": "bedtools",
3+
"cp_only_script": "scripts/get_cbio_copy_only_num.pl",
4+
"bed_genes": "REFS/Homo_sapiens.GRCh38.105.chr.gtf_genes.bed",
5+
"hugo_tsv": "REFS/HUGO_2021-06-01_EntrezID.tsv",
6+
"entrez_tsv": "REFS/EntrezGeneId_HugoGeneSymbol_2021-06-01.txt",
7+
"rna_ext_list": {
8+
"expression": "rsem.genes.results.gz",
9+
"fusion": "annoFuse_filter.tsv"
10+
},
11+
"file_loc_defs": {
12+
"_comment": "edit the values based on existing/anticipated source file locations, relative to working directory of the script being run",
13+
"mafs": {
14+
"kf": ["annotated_public_outputs"],
15+
"header": "REFS/maf_KF_CONSENSUS.txt"
16+
},
17+
"rsem": "RSEM_gene",
18+
"fusion": "annofuse_filtered_fusions_tsv"
19+
},
20+
"dl_file_type_list": ["RSEM_gene","annofuse_filtered_fusions_tsv"],
21+
"ens_gene_list":"REFS/gencode27_gene_list.txt",
22+
"script_dir": "scripts/",
23+
"cbioportal_validator": "external_scripts/cbioportal-5.4.10/core/src/main/scripts/importer/validateData.py",
24+
"cna_flag": 0,
25+
"cnv_high_gain": 4,
26+
"cnv_min_len": 50000,
27+
"rna_flag": 1,
28+
"cpus": 8,
29+
"threads": 40
30+
}

STUDY_CONFIGS/bllnos_sd_z6mwd3h0_2018_data_processing_config.json cbioportal_etl/STUDY_CONFIGS/bllnos_sd_z6mwd3h0_2018_data_processing_config.json

+8-8
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
{
22
"bedtools": "bedtools",
3-
"cp_only_script": "/home/ubuntu/tools/kf-cbioportal-etl/scripts/get_cbio_copy_only_num.pl",
4-
"bed_genes": "/home/ubuntu/tools/kf-cbioportal-etl/REFS/Homo_sapiens.GRCh38.105.chr.gtf_genes.bed",
5-
"hugo_tsv": "/home/ubuntu/tools/kf-cbioportal-etl/REFS/HUGO_2021-06-01_EntrezID.tsv",
6-
"entrez_tsv": "/home/ubuntu/tools/kf-cbioportal-etl/REFS/EntrezGeneId_HugoGeneSymbol_2021-06-01.txt",
3+
"cp_only_script": "scripts/get_cbio_copy_only_num.pl",
4+
"bed_genes": "REFS/Homo_sapiens.GRCh38.105.chr.gtf_genes.bed",
5+
"hugo_tsv": "REFS/HUGO_2021-06-01_EntrezID.tsv",
6+
"entrez_tsv": "REFS/EntrezGeneId_HugoGeneSymbol_2021-06-01.txt",
77
"rna_ext_list": {
88
"expression": "rsem.genes.results.gz",
99
"fusion": "annoFuse_filter.tsv"
@@ -17,7 +17,7 @@
1717
"_comment": "edit the values based on existing/anticipated source file locations, relative to working directory of the script being run",
1818
"mafs": {
1919
"kf": ["annotated_public_outputs"],
20-
"header": "/home/ubuntu/tools/kf-cbioportal-etl/REFS/maf_KF_CONSENSUS_r105.txt"
20+
"header": "REFS/maf_KF_CONSENSUS_r105.txt"
2121
},
2222
"cnvs": {
2323
"pval": "ctrlfreec_pval",
@@ -28,9 +28,9 @@
2828
},
2929
"dl_file_type_list": ["annotated_public_outputs",
3030
"ctrlfreec_pval","ctrlfreec_info","ctrlfreec_bam_seg"],
31-
"ens_gene_list":"/home/ubuntu/tools/kf-cbioportal-etl/REFS/gencode27_gene_list.txt",
32-
"script_dir": "/home/ubuntu/tools/kf-cbioportal-etl/scripts/",
33-
"cbioportal_validator": "/home/ubuntu/tools/cbioportal/core/src/main/scripts/importer/validateData.py",
31+
"ens_gene_list":"REFS/gencode27_gene_list.txt",
32+
"script_dir": "scripts/",
33+
"cbioportal_validator": "external_scripts/cbioportal-5.4.10/core/src/main/scripts/importer/validateData.py",
3434
"cna_flag": 1,
3535
"cnv_high_gain": 4,
3636
"cnv_min_len": 50000,

0 commit comments

Comments
 (0)