Skip to content

Commit b3984d7

Browse files
authored
Merge pull request #33 from kids-first/feature/mb-missed-updates-improvements
Feature/mb missed updates improvements
2 parents 0155bde + fe57d8f commit b3984d7

15 files changed

+422
-682
lines changed

COLLABORATIONS/openPBTA/openpbta_case_meta_config.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -112,7 +112,7 @@
112112
}
113113
},
114114
"study": {
115-
"description": "The Open Pediatric Brain Tumor Atlas (OpenPBTA) Project is a global open science initiative led by <a href=\"https://www.ccdatalab.org/\">Alex's Lemonade Stand Childhood Cancer Data Lab (CCDL)</a> and <a href=\"https://www.chop.edu/\">Children's Hospital of Philadelphia's</a> <a href=\"https://d3b.center/\">Center for Data-Driven Discovery</a> to comprehensively define the molecular landscape of tumors of 943 patients from the <a href=\"http://cbtn.org\">Children's Brain Tumor Network</a> and the <a href=\"http://www.pnoc.us/\">Pacific Pediatric Neuro-oncology Consortium</a> through real-time, <a href=\"https://github.com/AlexsLemonade/OpenPBTA-analysis\">collaborative analyses</a> and <a href=\"https://github.com/AlexsLemonade/OpenPBTA-manuscript\"> collaborative manuscript writing</a> on GitHub. The study loaded matches that of v22. For updates, please see here: <a href=\"https://tinyurl.com/55cxz9am\">Release Notes</a>",
115+
"description": "The Open Pediatric Brain Tumor Atlas (OpenPBTA) Project is a global open science initiative led by <a href=\"https://www.ccdatalab.org/\">Alex's Lemonade Stand Childhood Cancer Data Lab (CCDL)</a> and <a href=\"https://www.chop.edu/\">Children's Hospital of Philadelphia's</a> <a href=\"https://d3b.center/\">Center for Data-Driven Discovery</a> to comprehensively define the molecular landscape of tumors of 943 patients from the <a href=\"http://cbtn.org\">Children's Brain Tumor Network</a> and the <a href=\"http://www.pnoc.us/\">Pacific Pediatric Neuro-oncology Consortium</a> through real-time, <a href=\"https://github.com/AlexsLemonade/OpenPBTA-analysis\">collaborative analyses</a> and <a href=\"https://github.com/AlexsLemonade/OpenPBTA-manuscript\"> collaborative manuscript writing</a> on GitHub. The study loaded matches that of v23. For updates, please see here: <a href=\"https://tinyurl.com/55cxz9am\">Release Notes</a>",
116116
"groups": "PUBLIC",
117117
"cancer_study_identifier": "openpbta",
118118
"type_of_cancer": "brain",

README.md

Lines changed: 18 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,10 +8,9 @@ Below assumes you have already created the necessary tables from dbt
88
1. Copy over the appropriate aws account key and download files. Example using `pbta_all` study:
99

1010
```sh
11-
python3 ~/tools/kf-cbioportal-etl/scripts/get_files_from_manifest.py -m genomics_file_manifest.txt -f RSEM_gene,annofuse_filtered_fusions_tsv,annotated_public_outputs,ctrlfreec_pval,ctrlfreec_info,ctrlfreec_bam_seg -p saml 2> pbta_dl.log & # -p aws download
12-
python3 /home/ubuntu/tools/kf-cbioportal-etl/scripts/get_files_from_manifest.py -m pnoc_sb_subset -f RSEM_gene,annofuse_filtered_fusions_tsv,annotated_public_outputs,ctrlfreec_bam_seg,ctrlfreec_info,ctrlfreec_pval -s turbo -a -c cbio_file_name_id.txt 2> pnoc_sb_dl.err # -s sbg download
13-
python3 ~/tools/kf-cbioportal-etl/scripts/get_files_from_manifest.py -m dgd_genomics_file_manifest.txt -f DGD_MAF,DGD_FUSION -p d3b 2> dgd_dl.log &
11+
python3 scripts/get_files_from_manifest.py -m cbtn_genomics_file_manifest.txt,pnoc_genomics_file_manifest.txt,x01_genomics_file_manifest.txt,dgd_genomics_file_manifest.txt -f RSEM_gene,annofuse_filtered_fusions_tsv,annotated_public_outputs,ctrlfreec_pval,ctrlfreec_info,ctrlfreec_bam_seg,annotated_public -t aws_buckets_key_pairs.txt -s turbo -c cbio_file_name_id.txt
1412
```
13+
`aws_bucket_key_pairs.txt` is a headerless tsv file with bucket name and aws profile name pairs
1514

1615
1. Copy and edit `REFS/data_processing_config.json` and `REFS/pbta_all_case_meta_config.json` as needed
1716
1. Run pipeline script - ignore manifest section, it is a placeholder for a better function download method
@@ -67,14 +66,25 @@ In case you want to use different reference inputs...
6766
```sh
6867
cat Homo_sapiens.GRCh38.105.chr.gtf | perl -e 'while(<>){@a=split /\t/; if($a[2] eq "gene" && $a[8] =~ /gene_name/){print $_;}}' | convert2bed -i gtf --attribute-key=gene_name > Homo_sapiens.GRCh38.105.chr.gtf_genes.bed
6968
```
69+
To get aws bucket prefixes to add key (meaning aws profile names) to:
70+
```sh
71+
cat *genomic* | cut -f 15 | cut -f 1-3 -d "/" | sort | uniq > aws_bucket_key_pairs.txt
72+
```
73+
Just remove the `s3_path` and `None` entries
74+
7075

7176
## Software Prerequisites
7277

7378
+ `python3` v3.5.3+
7479
+ `numpy`, `pandas`, `scipy`
7580
+ `bedtools` (https://bedtools.readthedocs.io/en/latest/content/installation.html)
7681
+ `chopaws` https://github.research.chop.edu/devops/aws-auth-cli needed for saml key generation for s3 upload
77-
+ access to https://aws-infra-jenkins-service.kf-strides.org to start cbio load into QA and/or prod using the `d3b-center-aws-infra-pedcbioportal-import` task
82+
+ access to https://github.com/d3b-center/aws-infra-pedcbioportal-import repo. To start a load job:
83+
+ Create a branch and edit the `import_studies.txt` file with the study name you which to load. Can be an MSKCC datahub link or a local study name
84+
+ Push the branch to remote - this will kick off a github action to load into QA
85+
+ To load into prod, make a PR. On merge, load to prod will kick off
86+
+ aws `stateMachinePedcbioImportservice` Step function service is used to view and mangage running jobs
87+
+ To repeat a load, click on the ▶️ icon in the git repo to select the job you want to re-run
7888
+ Access to the `postgres` D3b Warehouse database at `d3b-warehouse-aurora-prd.d3b.io`. Need at least read access to tables with the `bix_workflows` schema
7989
+ [cbioportal git repo](https://github.com/cBioPortal/cbioportal) needed to validate the final study output
8090

@@ -112,6 +122,7 @@ Seemingly redundant, this file contains the file locations, BS IDs, file type, a
112122
It helps simplify the process to integrate better into the downstream tools.
113123
This is the file that goes in as the `-t` arg in all the data collating tools
114124
#### - Sequencing center info resource file
125+
DEPRECATED and will be removed from future releases
115126
This is a simple file this BS IDs and sequencing center IDs and locations.
116127
It is necessary to patch in a required field for the fusion data
117128
#### - Data gene matrix - *OPTIONAL*
@@ -211,7 +222,7 @@ optional arguments:
211222
Check the pipeline log output for any errors that might have occurred.
212223

213224
## Upload the final packages
214-
Upload all of the directories named as study short names to `s3://kf-cbioportal-studies/public/`. You may need to set and/or copy aws your saml key before uploading. Next, edit the file in that bucket called `importStudies.txt` located at `s3://kf-cbioportal-studies/public/importStudies.txt`, with the names of all of the studies you wish to updated/upload. Lastly, go to https://jenkins.kids-first.io/job/d3b-center-aws-infra-pedcbioportal-import/job/master/, click on build. At the `Promotion kf-aws-infra-pedcbioportal-import-asg to QA` and `Promotion kf-aws-infra-pedcbioportal-import-asg to PRD`, the process will pause, click on the box below it to affirm that you want these changes deployed to QA and/or PROD respectively. If both, you will have to wait for the QA job to finish first before you get the prompt for PROD.
225+
Upload all of the directories named as study short names to `s3://kf-cbioportal-studies/public/`. You may need to set and/or copy aws your saml key before uploading. Next, edit the file in that bucket called `importStudies.txt` located at `s3://kf-cbioportal-studies/public/importStudies.txt`, with the names of all of the studies you wish to updated/upload. Lastly, follow the directions reference in [Software Prerequisites](#software-prerequisites) to load the study.
215226
## Congratulations, you did it!
216227

217228
# Collaborative and Publication Workflows
@@ -220,7 +231,7 @@ These are highly specialized cases in which all or most of the data come from a
220231
## OpenTargets
221232
This project is organized much like OpenPBTA in which all genomics data for each assay-type are collated into one giant table.
222233
In general, this fits cBioPortal well.
223-
Input files mostly come from a "subdirectory" from within `s3://kf-openaccess-us-east-1-prd-pbta/`, consisting of:
234+
Input files mostly come from a "subdirectory" from within `s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/`, consisting of:
224235
- `histologies.tsv`
225236
- `snv-consensus-plus-hotspots.maf.tsv.gz`
226237
- `consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz`
@@ -270,7 +281,7 @@ To create the histologies file, recommended method is to:
270281
1. Run `Rscript --vanilla pedcbio_sample_name_col.R --hist_dir path-to-hist-dir`. Histologies file must be `histologies.tsv`, modify file name or create sym link if needed. Results will be in `results` as `histologies-formatted-id-added.tsv`
271282

272283
### Inputs
273-
Inputs are located in the old Kids First AWS account (`538745987955`) in this general bucket location: `s3://kf-openaccess-us-east-1-prd-pbta/open-targets/`.
284+
Inputs are located in the old D3b AWS account (`684194535433`) in this general bucket location: `s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/`.
274285
Clinical data with cBio names are obtained from the `histologies-formatted-id-added.tsv` file, as noted in [Prep Work section](#prep-work).
275286
Genomic data generally obtained as such:
276287
- Somatic variant calls: merged maf

REFS/aws_bucket_key_pairs.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
s3://cds-246-phs002517-p30-fy20 NCI-AR
2+
s3://cds-246-phs002517-sequencefiles-p30-fy20 NCI-AR
3+
s3://cds-306-phs002517-x01 NCI-X01
4+
s3://d3b-cds-working-bucket d3b
5+
s3://kf-study-us-east-1-prd-sd-8y99qzjj saml

STUDY_CONFIGS/case_cptac_meta_config.json

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -112,12 +112,12 @@
112112
}
113113
},
114114
"study": {
115-
"_comment": "If a big study being split into many, make cancer_study_identifer blank, dx will be used",
116-
"description": ["Genomic characterization through proteimics. Samples provided by the <a href=\"http://CBTTC.org\">Children's Brain Tumor Tissue Consortium</a> and its partners via the <a href=\"http://kidsfirstdrc.org\">Gabriella Miller Kids First Data Resource Center</a>. Updated Februrary 1, 2020 from last load, July 2019"],
115+
"_comment": "see https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#cancer-study for detailed specifics",
116+
"description": "Genomic characterization through proteimics. Samples provided by the <a href=\"http://CBTTC.org\">Children's Brain Tumor Tissue Consortium</a> and its partners via the <a href=\"http://kidsfirstdrc.org\">Gabriella Miller Kids First Data Resource Center</a>. Updated Februrary 1, 2020 from last load, July 2019",
117117
"groups": "PUBLIC",
118118
"cancer_study_identifier": "cptac_cbttc",
119-
"dir_suffix": "",
120-
"name_append": "(CBTTC, PBTA, Provisional)"
119+
"reference_genome": "hg38",
120+
"display_name": "Proteomics (CBTTC, PBTA, Provisional)"
121121
},
122122
"cases_3way_complete": {
123123
"stable_id": "3way_complete",

STUDY_CONFIGS/case_pbta_by_dx_meta_config.json

Lines changed: 0 additions & 164 deletions
This file was deleted.

0 commit comments

Comments
 (0)