kids-first
diff --git a/‎COLLABORATIONS/openPBTA/openpbta_case_meta_config.json
Lines changed: 1 addition & 1 deletion b/‎COLLABORATIONS/openPBTA/openpbta_case_meta_config.json
Lines changed: 1 addition & 1 deletion
diff --git a/‎README.md
Lines changed: 18 additions & 7 deletions b/‎README.md
Lines changed: 18 additions & 7 deletions
diff --git a/‎REFS/aws_bucket_key_pairs.txt
Lines changed: 5 additions & 0 deletions b/‎REFS/aws_bucket_key_pairs.txt
Lines changed: 5 additions & 0 deletions
diff --git a/‎STUDY_CONFIGS/case_cptac_meta_config.json
Lines changed: 4 additions & 4 deletions b/‎STUDY_CONFIGS/case_cptac_meta_config.json
Lines changed: 4 additions & 4 deletions
diff --git a/‎STUDY_CONFIGS/case_pbta_by_dx_meta_config.json
Lines changed: 0 additions & 164 deletions b/‎STUDY_CONFIGS/case_pbta_by_dx_meta_config.json
Lines changed: 0 additions & 164 deletions
@@ -112,7 +112,7 @@
         }
     },
     "study": {
-        "description": "The Open Pediatric Brain Tumor Atlas (OpenPBTA) Project is a global open science initiative led by <a href=\"https://www.ccdatalab.org/\">Alex's Lemonade Stand Childhood Cancer Data Lab (CCDL)</a> and <a href=\"https://www.chop.edu/\">Children's Hospital of Philadelphia's</a> <a href=\"https://d3b.center/\">Center for Data-Driven Discovery</a> to comprehensively define the molecular landscape of tumors of 943 patients from the <a href=\"http://cbtn.org\">Children's Brain Tumor Network</a> and the <a href=\"http://www.pnoc.us/\">Pacific Pediatric Neuro-oncology Consortium</a> through real-time, <a href=\"https://github.com/AlexsLemonade/OpenPBTA-analysis\">collaborative analyses</a> and <a href=\"https://github.com/AlexsLemonade/OpenPBTA-manuscript\"> collaborative manuscript writing</a> on GitHub. The study loaded matches that of v22. For updates, please see here: <a href=\"https://tinyurl.com/55cxz9am\">Release Notes</a>",
+        "description": "The Open Pediatric Brain Tumor Atlas (OpenPBTA) Project is a global open science initiative led by <a href=\"https://www.ccdatalab.org/\">Alex's Lemonade Stand Childhood Cancer Data Lab (CCDL)</a> and <a href=\"https://www.chop.edu/\">Children's Hospital of Philadelphia's</a> <a href=\"https://d3b.center/\">Center for Data-Driven Discovery</a> to comprehensively define the molecular landscape of tumors of 943 patients from the <a href=\"http://cbtn.org\">Children's Brain Tumor Network</a> and the <a href=\"http://www.pnoc.us/\">Pacific Pediatric Neuro-oncology Consortium</a> through real-time, <a href=\"https://github.com/AlexsLemonade/OpenPBTA-analysis\">collaborative analyses</a> and <a href=\"https://github.com/AlexsLemonade/OpenPBTA-manuscript\"> collaborative manuscript writing</a> on GitHub. The study loaded matches that of v23. For updates, please see here: <a href=\"https://tinyurl.com/55cxz9am\">Release Notes</a>",
         "groups": "PUBLIC",
         "cancer_study_identifier": "openpbta",
         "type_of_cancer": "brain",
 
@@ -8,10 +8,9 @@ Below assumes you have already created the necessary tables from dbt
 1. Copy over the appropriate aws account key and download files. Example using `pbta_all` study:
 
    ```sh
-   python3 ~/tools/kf-cbioportal-etl/scripts/get_files_from_manifest.py -m genomics_file_manifest.txt -f RSEM_gene,annofuse_filtered_fusions_tsv,annotated_public_outputs,ctrlfreec_pval,ctrlfreec_info,ctrlfreec_bam_seg -p saml 2> pbta_dl.log & # -p aws download
-  python3 /home/ubuntu/tools/kf-cbioportal-etl/scripts/get_files_from_manifest.py -m pnoc_sb_subset -f RSEM_gene,annofuse_filtered_fusions_tsv,annotated_public_outputs,ctrlfreec_bam_seg,ctrlfreec_info,ctrlfreec_pval -s turbo -a -c cbio_file_name_id.txt 2> pnoc_sb_dl.err # -s sbg download
-   python3 ~/tools/kf-cbioportal-etl/scripts/get_files_from_manifest.py -m dgd_genomics_file_manifest.txt -f DGD_MAF,DGD_FUSION -p d3b 2> dgd_dl.log &
+    python3 scripts/get_files_from_manifest.py -m cbtn_genomics_file_manifest.txt,pnoc_genomics_file_manifest.txt,x01_genomics_file_manifest.txt,dgd_genomics_file_manifest.txt -f RSEM_gene,annofuse_filtered_fusions_tsv,annotated_public_outputs,ctrlfreec_pval,ctrlfreec_info,ctrlfreec_bam_seg,annotated_public -t aws_buckets_key_pairs.txt -s turbo -c cbio_file_name_id.txt
    ```
+  `aws_bucket_key_pairs.txt` is a headerless tsv file with bucket name and aws profile name pairs
 
 1. Copy and edit `REFS/data_processing_config.json` and `REFS/pbta_all_case_meta_config.json` as needed
 1. Run pipeline script - ignore manifest section, it is a placeholder for a better function download method
@@ -67,14 +66,25 @@ In case you want to use different reference inputs...
    ```sh
    cat Homo_sapiens.GRCh38.105.chr.gtf | perl -e 'while(<>){@a=split /\t/; if($a[2] eq "gene" && $a[8] =~ /gene_name/){print $_;}}'  | convert2bed -i gtf --attribute-key=gene_name  > Homo_sapiens.GRCh38.105.chr.gtf_genes.bed
    ```
+To get aws bucket prefixes to add key (meaning aws profile names) to:
+```sh
+cat *genomic* | cut -f 15 | cut -f 1-3 -d "/" | sort | uniq > aws_bucket_key_pairs.txt
+```
+Just remove the `s3_path` and `None` entries
+
 
 ## Software Prerequisites
 
 + `python3` v3.5.3+
   + `numpy`, `pandas`, `scipy`
 + `bedtools` (https://bedtools.readthedocs.io/en/latest/content/installation.html)
 + `chopaws` https://github.research.chop.edu/devops/aws-auth-cli needed for saml key generation for s3 upload
-+ access to https://aws-infra-jenkins-service.kf-strides.org to start cbio load into QA and/or prod using the `d3b-center-aws-infra-pedcbioportal-import` task
++ access to https://github.com/d3b-center/aws-infra-pedcbioportal-import repo. To start a load job:
+  + Create a branch and edit the `import_studies.txt` file with the study name you which to load. Can be an MSKCC datahub link or a local study name
+  + Push the branch to remote - this will kick off a github action to load into QA
+  + To load into prod, make a PR. On merge, load to prod will kick off
+  + aws `stateMachinePedcbioImportservice` Step function service is used to view and mangage running jobs
+  + To repeat a load, click on the ▶️ icon in the git repo to select the job you want to re-run
 + Access to the `postgres` D3b Warehouse database at `d3b-warehouse-aurora-prd.d3b.io`. Need at least read access to tables with the `bix_workflows` schema
 + [cbioportal git repo](https://github.com/cBioPortal/cbioportal) needed to validate the final study output
 
@@ -112,6 +122,7 @@ Seemingly redundant, this file contains the file locations, BS IDs, file type, a
 It helps simplify the process to integrate better into the downstream tools.
 This is the file that goes in as the `-t` arg in all the data collating tools
 #### - Sequencing center info resource file
+DEPRECATED and will be removed from future releases
 This is a simple file this BS IDs and sequencing center IDs and locations.
 It is necessary to patch in a required field for the fusion data
 #### - Data gene matrix - *OPTIONAL*
@@ -211,7 +222,7 @@ optional arguments:
 Check the pipeline log output for any errors that might have occurred.
 
 ## Upload the final packages
-Upload all of the directories named as study short names to `s3://kf-cbioportal-studies/public/`. You may need to set and/or copy aws your saml key before uploading. Next, edit the file in that bucket called `importStudies.txt` located at `s3://kf-cbioportal-studies/public/importStudies.txt`, with the names of all of the studies you wish to updated/upload. Lastly, go to https://jenkins.kids-first.io/job/d3b-center-aws-infra-pedcbioportal-import/job/master/, click on build. At the `Promotion kf-aws-infra-pedcbioportal-import-asg to QA` and `Promotion kf-aws-infra-pedcbioportal-import-asg to PRD`, the process will pause, click on the box below it to affirm that you want these changes deployed to QA and/or PROD respectively.  If both, you will have to wait for the QA job to finish first before you get the prompt for PROD.
+Upload all of the directories named as study short names to `s3://kf-cbioportal-studies/public/`. You may need to set and/or copy aws your saml key before uploading. Next, edit the file in that bucket called `importStudies.txt` located at `s3://kf-cbioportal-studies/public/importStudies.txt`, with the names of all of the studies you wish to updated/upload. Lastly, follow the directions reference in [Software Prerequisites](#software-prerequisites) to load the study.
 ## Congratulations, you did it!
 
 # Collaborative and Publication Workflows
@@ -220,7 +231,7 @@ These are highly specialized cases in which all or most of the data come from a
 ## OpenTargets
 This project is organized much like OpenPBTA in which all genomics data for each assay-type are collated into one giant table.
 In general, this fits cBioPortal well.
-Input files mostly come from a "subdirectory" from within `s3://kf-openaccess-us-east-1-prd-pbta/`, consisting of:
+Input files mostly come from a "subdirectory" from within `s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/`, consisting of:
  - `histologies.tsv`
  - `snv-consensus-plus-hotspots.maf.tsv.gz`
  - `consensus_wgs_plus_cnvkit_wxs_x_and_y.tsv.gz`
@@ -270,7 +281,7 @@ To create the histologies file, recommended method is to:
 1. Run `Rscript --vanilla pedcbio_sample_name_col.R --hist_dir path-to-hist-dir`. Histologies file must be `histologies.tsv`, modify file name or create sym link if needed. Results will be in `results` as `histologies-formatted-id-added.tsv`
 
 ### Inputs
-Inputs are located in the old Kids First AWS account (`538745987955`) in this general bucket location: `s3://kf-openaccess-us-east-1-prd-pbta/open-targets/`.
+Inputs are located in the old D3b AWS account (`684194535433`) in this general bucket location: `s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/`.
 Clinical data with cBio names are obtained from the `histologies-formatted-id-added.tsv` file, as noted in [Prep Work section](#prep-work).
 Genomic data generally obtained as such:
  - Somatic variant calls: merged maf
 
@@ -0,0 +1,5 @@
+s3://cds-246-phs002517-p30-fy20	NCI-AR
+s3://cds-246-phs002517-sequencefiles-p30-fy20	NCI-AR
+s3://cds-306-phs002517-x01	NCI-X01
+s3://d3b-cds-working-bucket	d3b
+s3://kf-study-us-east-1-prd-sd-8y99qzjj	saml
@@ -112,12 +112,12 @@
         }
     },
     "study": {
-        "_comment": "If a big study being split into many, make cancer_study_identifer blank, dx will be used",
-        "description": ["Genomic characterization through proteimics. Samples provided by the <a href=\"http://CBTTC.org\">Children's Brain Tumor Tissue Consortium</a> and its partners via the <a href=\"http://kidsfirstdrc.org\">Gabriella Miller Kids First Data Resource Center</a>. Updated Februrary 1, 2020 from last load, July 2019"],
+        "_comment": "see https://docs.cbioportal.org/5.1-data-loading/data-loading/file-formats#cancer-study for detailed specifics",
+        "description": "Genomic characterization through proteimics. Samples provided by the <a href=\"http://CBTTC.org\">Children's Brain Tumor Tissue Consortium</a> and its partners via the <a href=\"http://kidsfirstdrc.org\">Gabriella Miller Kids First Data Resource Center</a>. Updated Februrary 1, 2020 from last load, July 2019",
         "groups": "PUBLIC",
         "cancer_study_identifier": "cptac_cbttc",
-        "dir_suffix": "",
-        "name_append": "(CBTTC, PBTA, Provisional)"
+        "reference_genome": "hg38",
+        "display_name": "Proteomics (CBTTC, PBTA, Provisional)"
     },
     "cases_3way_complete": {
         "stable_id": "3way_complete",
Original file line number	Diff line number	Diff line change
`@@ -112,7 +112,7 @@`
`112`	`112`	`}`
`113`	`113`	`},`
`114`	`114`	`"study": {`
`115`		- "description": "The Open Pediatric Brain Tumor Atlas (OpenPBTA) Project is a global open science initiative led by <a href=\"https://www.ccdatalab.org/\">Alex's Lemonade Stand Childhood Cancer Data Lab (CCDL)</a> and <a href=\"https://www.chop.edu/\">Children's Hospital of Philadelphia's</a> <a href=\"https://d3b.center/\">Center for Data-Driven Discovery</a> to comprehensively define the molecular landscape of tumors of 943 patients from the <a href=\"http://cbtn.org\">Children's Brain Tumor Network</a> and the <a href=\"http://www.pnoc.us/\">Pacific Pediatric Neuro-oncology Consortium</a> through real-time, <a href=\"https://github.com/AlexsLemonade/OpenPBTA-analysis\">collaborative analyses</a> and <a href=\"https://github.com/AlexsLemonade/OpenPBTA-manuscript\"> collaborative manuscript writing</a> on GitHub. The study loaded matches that of v22. For updates, please see here: <a href=\"https://tinyurl.com/55cxz9am\">Release Notes</a>",
	`115`	+ "description": "The Open Pediatric Brain Tumor Atlas (OpenPBTA) Project is a global open science initiative led by <a href=\"https://www.ccdatalab.org/\">Alex's Lemonade Stand Childhood Cancer Data Lab (CCDL)</a> and <a href=\"https://www.chop.edu/\">Children's Hospital of Philadelphia's</a> <a href=\"https://d3b.center/\">Center for Data-Driven Discovery</a> to comprehensively define the molecular landscape of tumors of 943 patients from the <a href=\"http://cbtn.org\">Children's Brain Tumor Network</a> and the <a href=\"http://www.pnoc.us/\">Pacific Pediatric Neuro-oncology Consortium</a> through real-time, <a href=\"https://github.com/AlexsLemonade/OpenPBTA-analysis\">collaborative analyses</a> and <a href=\"https://github.com/AlexsLemonade/OpenPBTA-manuscript\"> collaborative manuscript writing</a> on GitHub. The study loaded matches that of v23. For updates, please see here: <a href=\"https://tinyurl.com/55cxz9am\">Release Notes</a>",
`116`	`116`	`"groups": "PUBLIC",`
`117`	`117`	`"cancer_study_identifier": "openpbta",`
`118`	`118`	`"type_of_cancer": "brain",`