Skip to content

Commit 6b5a3bb

Browse files
authored
Merge pull request #57 from kids-first/feature/mb-add-openpedcan_v14
🚀 Add OpenPedCan v15
2 parents 028564c + 3874f3d commit 6b5a3bb

11 files changed

+441
-184
lines changed

COLLABORATIONS/openTARGETS/README.md

+22-11
Original file line numberDiff line numberDiff line change
@@ -13,18 +13,20 @@ Genomic data generally obtained as such:
1313
- Copy number: tsv file with copy number, ploidy, and GISTIC-style information in maf-like format (each call is a row)
1414
- RNA expression: tpm values from rsem stored an `.rds` object
1515
- RNA fusion: annoFuse output
16-
For example, for v12, bucket s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v12/:
16+
For example, for v14, bucket s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v14/:
1717
```
18-
consensus_wgs_plus_cnvkit_wxs.tsv.gz
18+
consensus_wgs_plus_cnvkit_wxs_plus_freec_tumor_only.tsv.gz
1919
fusion-dgd.tsv.gz
2020
fusion-putative-oncogenic.tsv
2121
gene-expression-rsem-tpm-collapsed.rds
22-
tcga-gene-expression-rsem-tpm-collapsed.rds
22+
tcga_gene-expression-rsem-tpm-collapsed.rds
23+
gtex_gene-expression-rsem-tpm-collapsed.rds
2324
snv-consensus-plus-hotspots.maf.tsv.gz
25+
snv-mutect2-tumor-only-plus-hotspots.maf.tsv.gz
2426
```
2527

2628
### Prep work
27-
The histologies file needs `formatted_sample_id` added and likely a blacklist from the D3b Warehouse or some other source to supress duplicate RNA libraries from different sequencing methods.
29+
The histologies file needs `formatted_sample_id` added and likely a blacklist from the D3b Warehouse or some other source to suppress duplicate RNA libraries from different sequencing methods.
2830
Since we are not handling `Methylation` yet, it is recommended those entries be removed ahead of time.
2931
To create the histologies file, recommended method is to:
3032
1. `docker pull pgc-images.sbgenomics.com/d3b-bixu/open-pedcan:latest` OR if you have R installed locally, ensure the following libraries are installed:
@@ -38,12 +40,12 @@ To create the histologies file, recommended method is to:
3840
1. Pull the OpenPedCan repo (warning, it's 12GB ): https://github.com/PediatricOpenTargets/OpenPedCan-analysis, or just download the script from `analyses/pedcbio-sample-name/pedcbio_sample_name_col.R`
3941
1. Export from D3b Warehouse the latest existing cBio IDs to use for population. Ensure that the output is csv double-quoted. Currently that can be obtained using the sql command:
4042
```sql
41-
43+
with custom as (
4244
select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
4345
from prod_cbio.aml_sd_pet7q6f2_2018_cbio_sample
4446
union
4547
select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
46-
from prod_cbio.aml_sd_z6mwd3h0_2018_cbio_sample
48+
from prod_cbio.bllnos_sd_z6mwd3h0_2018_cbio_sample
4749
union
4850
select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
4951
from prod_cbio.x01_fy16_nbl_maris_cbio_sample
@@ -68,13 +70,18 @@ To create the histologies file, recommended method is to:
6870
union
6971
select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
7072
from prod_cbio.chdm_phs001643_2018_cbio_sample
73+
union
74+
select participant_id, formatted_sample_id, specimen_id, analyte_types, normal_bs_id, normal_sample_id
75+
from prod_cbio.pbta_mioncoseq_cbio_sample
76+
)
77+
select * from custom
7178
7279
```
73-
1. Get a blacklist from D3b Warehouse, exporting table `bix_workflows.cbio_hide_reasons
80+
1. Get a blacklist from D3b Warehouse, exporting table `bix_workflows.cbio_hide_reasons`
7481
7582
### Run as standalone
7683
1. Download from https://github.com/PediatricOpenTargets/OpenPedCan-analysis the `analyses/pedcbio-sample-name/pedcbio_sample_name_col.R` or run from repo if you have it
77-
1. Run `Rscript --vanilla pedcbio_sample_name_col.R -i path-to-histolgies-file.tsv -n path-to-cbio-names.csv -b Methylation`
84+
1. Run `Rscript --vanilla pedcbio_sample_name_col.R -i path-to-histologies-file.tsv -n path-to-cbio-names.csv -b 'Methylation,Phospho-Proteomics,Whole Cell Proteomics,miRNA-Seq'`
7885
OR
7986
### Run in repo
8087
1. Either run an interactive docker or using your local R, and ensure to mount a volume that will have the repo and whatever input histologies file you end up using, i.e. `docker run -it --mount type=bind,source=/home/ubuntu,target=/WORK pgc-images.sbgenomics.com/d3b-bixu/open-pedcan:latest /bin/bash`
@@ -86,7 +93,11 @@ TCGA data are kept in a seprate matrix from everything else. We need to merge th
8693
```sh
8794
Rscript COLLABORATIONS/openTARGETS/merge_rsem_rds.R --first_file gene-expression-rsem-tpm-collapsed.rds --second_file tcga-gene-expression-rsem-tpm-collapsed.rds --output_fn gene_tcga_expression_common_merge.rds
8895
```
89-
96+
UPDATE: GTEx is also in a seprate matrix, so run again currently to make the "final" merge before conversion
97+
```sh
98+
Rscript COLLABORATIONS/openTARGETS/merge_rsem_rds.R --first_file gene_tcga_expression_common_merge.rds --second_file gtex_gene-expression-rsem-tpm-collapsed.rds --output_fn gene_tcga_gtex_expression_common_merge.rds
99+
```
100+
```
90101
91102
### File Transformation
92103
It's recommended to put datasheets in a dir called `datasheets`, downloaded files in it's own dir (in v12 it's `GF_INPUTS`) and the rest of the processed outputs into it's own dir (`study_build` for v12) to keep things sane and also be able to leverage existing study build script in `scripts/organize_upload_packages.py`
@@ -140,7 +151,7 @@ optional arguments:
140151
```
141152
_NOTE_ for v11 input, I ran the following command `zcat snv-dgd.maf.tsv.gz | perl -e '$skip = <>; $skip= <>; while(<>){print $_;}' | gzip -c >> snv-consensus-plus-hotspots.maf.tsv.gz` to add DGD data
142153
143-
_NOTE_ for v12 input,I would have following command `python3 ~/tools/kf-cbioportal-etl/COLLABORATIONS/openTARGETS/add_dgd_maf_to_openpedcan.py -i /home/ubuntu/tools/kf-cbioportal-etl/COLLABORATIONS/openTARGETS/maf_openpedcan_v12_header.txt -c openpedcan_v12.maf -t ../bs_id_sample_map.txt -m ../GF_INPUTS/snv-dgd.maf.tsv.gz` to add DGD data, which is more robust - however, there are data issues with DGD, so it was left out
154+
_NOTE_ for v15 input,I would have following command `python3 ~/tools/kf-cbioportal-etl/COLLABORATIONS/openTARGETS/append_maf_to_existing.py -i /home/ubuntu/tools/kf-cbioportal-etl/COLLABORATIONS/openTARGETS/maf_openpedcan_v15_header.txt -c openpedcan_v15.maf -t ../INPUT_PREP/bs_id_sample_map.txt -m ../INPUTS/snv-mutect2-tumor-only-plus-hotspots.maf.tsv.gz` to add tumor-only data, which is more robust
144155
145156
Example run:
146157
`python3 COLLABORATIONS/openTARGETS/rename_filter_maf.py -m bs_id_sample_map.txt -v snv-consensus-plus-hotspots.maf.tsv.gz -s 1 -n openpedcan_v12`
@@ -189,7 +200,7 @@ Options:
189200
Show this help message and exit
190201
```
191202
Example run:
192-
`Rscript COLLABORATIONS/openTARGETS/rename_export_rsem.R --rna_rds gene_tcga_expression_common_merge.rds --map_id bs_id_sample_map.txt --type openpedcan_v11 --computeZscore R 2> rna_convert.errs`
203+
`Rscript COLLABORATIONS/openTARGETS/rename_export_rsem.R --rna_rds gene_tcga_gtex_expression_common_merge.rds --map_id bs_id_sample_map.txt --type openpedcan_v15 --computeZscore R 2> rna_convert.errs`
193204
194205
#### 5. scripts/convert_fusion_as_sv.py
195206

COLLABORATIONS/openTARGETS/add_dgd_maf_to_openpedcan.py COLLABORATIONS/openTARGETS/append_maf_to_existing.py

+29-11
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
#!/usr/bin/env python3
22
"""
3-
Helper script to append DGD data to an existing merged maf file.
3+
Helper script to append a maf to an existing maf file.
44
Uses filter_entry to filter out undesired calls like in other mafs
55
"""
66
import sys
@@ -82,6 +82,15 @@
8282
v_idx = m_header.index("Variant_Classification")
8383
h_idx = m_header.index("Hugo_Symbol")
8484

85+
# need to also pop entrez ID if exists, as process_maf_entry() will do that to data
86+
try:
87+
m_header.pop(m_header.index("Entrez_Gene_Id"))
88+
except Exception as e:
89+
print(e, file=sys.stderr)
90+
# bug fix for OpenPedCan, position will be one less after process_maf_entry
91+
n_ref_ct_idx = m_header.index('n_ref_count')
92+
n_alt_ct_idx = m_header.index('n_alt_count')
93+
8594
for i in range(len(m_header)):
8695
if m_header[i] in h_dict:
8796
h_dict[m_header[i]] = i
@@ -90,16 +99,25 @@
9099
to_print = []
91100
datum = process_maf_entry(data, maf_exc, v_idx, h_idx, tid_idx, eid_idx, bs_cbio_key)
92101
# Set tumor barcode to cBio ID
93-
if datum:
94-
for item in header:
95-
if h_dict[item] != None:
96-
to_print.append(datum[h_dict[item]])
97-
else:
98-
to_print.append("")
99-
print("\t".join(to_print), file=append_maf)
100-
else:
101-
skipped += 1
102+
try:
103+
if datum:
104+
for item in header:
105+
if h_dict[item] != None:
106+
to_print.append(datum[h_dict[item]])
107+
else:
108+
to_print.append("")
109+
# bug fix for maf format in OpenPedCan
110+
for i in [n_ref_ct_idx, n_alt_ct_idx]:
111+
if to_print[i] == "NA":
112+
to_print[i] = ""
113+
print("\t".join(to_print), file=append_maf)
114+
else:
115+
skipped += 1
116+
except Exception as e:
117+
print (e)
118+
pdb.set_trace()
119+
hold = 1
102120
sys.stderr.write("Processed " + maf_fn + "\n")
103-
sys.stderr.write("Skipped " + str(skipped) + " entries meeting exlusion criteria\n")
121+
sys.stderr.write("Skipped " + str(skipped) + " entries meeting exclusion criteria\n")
104122

105123
append_maf.close()

COLLABORATIONS/openTARGETS/clinical_to_datasheets.py

+42-26
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
import sys
22
import argparse
3-
import json
43
import math
54
import re
65
import pdb
@@ -153,17 +152,17 @@ def build_header(header_list, entry):
153152

154153
# track some sample info so that somatic events can be collapsed
155154
samp_dict = {}
156-
s_type = header.index('sample_type')
157-
comp = header.index('composition')
158-
bs_id = header.index('Kids_First_Biospecimen_ID')
159-
exp = header.index('experimental_strategy')
155+
s_type_index = header.index('sample_type')
156+
comp_index = header.index('composition')
157+
bs_id_index = header.index('Kids_First_Biospecimen_ID')
158+
exp_index = header.index('experimental_strategy')
160159

161160
# experimental strategy may be too vague, use to tell if RNA
162161
# related to recent addition of DGD samples, need gene matrix file
163-
rna_lib = header.index('RNA_library')
164-
cohort = header.index('cohort')
162+
rna_lib_index = header.index('RNA_library')
163+
cohort_index = header.index('cohort')
165164
a_idx = header.index('aliquot_id')
166-
cbio_id = header.index('formatted_sample_id')
165+
cbio_id_index = header.index('formatted_sample_id')
167166
data_gene = open('data_gene_matrix_CHOP.txt', 'w')
168167
data_gene.write('SAMPLE_ID\tmutations\n')
169168

@@ -173,20 +172,20 @@ def build_header(header_list, entry):
173172

174173
for data in clin_data:
175174
info = data.rstrip('\n').split('\t')
176-
if info[s_type] == "Normal" and info[exp] != 'RNA-Seq':
175+
if info[s_type_index] == "Normal" and info[exp_index] != 'RNA-Seq':
177176
continue
178-
if info[bs_id] in blacklist_dict:
179-
sys.stderr.write('Skipping output of ' + info[bs_id] + ' because in blacklist for reason ' + blacklist_dict[info[bs_id]] + '\n')
177+
if info[bs_id_index] in blacklist_dict:
178+
sys.stderr.write('Skipping output of ' + info[bs_id_index] + ' because in blacklist for reason ' + blacklist_dict[info[bs_id_index]] + '\n')
180179
continue
181180
# adjust exp value if targeted sequencing
182-
if info[exp] == 'Targeted Sequencing':
183-
if info[rna_lib] != 'NA':
184-
info[exp] = 'RNA-Seq'
181+
if info[exp_index] == 'Targeted Sequencing':
182+
if info[rna_lib_index] != 'NA':
183+
info[exp_index] = 'RNA-Seq'
185184
# if DGD DNA, add to gene matrix
186-
elif info[cohort] == 'DGD':
185+
elif info[cohort_index] == 'DGD':
187186
# parse aliquot for panel type, i.e. ET_242MFKXW_DGD_STNGS_93
188-
test = re.match('.*_DGD_(\w+)_\d+', info[a_idx])
189-
data_gene.write(info[cbio_id] + '\tCHOP-' + test.group(1) + '\n')
187+
test = re.match(r'.*_DGD_(\w+)_\d+', info[a_idx])
188+
data_gene.write(info[cbio_id_index] + '\tCHOP-' + test.group(1) + '\n')
190189

191190
pt_id = info[header.index("Kids_First_Participant_ID")]
192191
if pt_id in pt_id_dict:
@@ -214,12 +213,19 @@ def build_header(header_list, entry):
214213
# age_at_last_known = float(value)
215214
# age_at_last_known = str(math.floor(age_at_last_known/365.25))
216215
value = str(math.floor(float(value)/365.25))
217-
elif header[i] == "PFS_days" and value != "NA":
216+
elif header[i] == "EFS_days" and value != "NA":
218217
value = str(math.floor(float(value)/30.5))
219218
# d_free_mos = value
220219
elif header[i] == "tumor_descriptor":
221220
if value in tumor_descriptor_dict:
222221
value = tumor_descriptor_dict[value]
222+
elif header[i] == "EFS_event_type":
223+
if value == "Not Applicable":
224+
value = "0:No Event"
225+
elif value == "Not Reported":
226+
value = "NA"
227+
else:
228+
value = "1:" + value
223229
# replace status with NA if value not acceptable
224230
elif header[i] == "OS_status":
225231
if value not in ["LIVING", "DECEASED", "NA"]:
@@ -238,11 +244,11 @@ def build_header(header_list, entry):
238244
if samp_id not in samp_dict:
239245
samp_dict[samp_id] = sample_to_print
240246
id_mapping[samp_id] = []
241-
id_mapping[samp_id].append(info[bs_id])
242-
if info[exp] == "RNA-Seq":
243-
bs_type[info[bs_id]] = "RNA"
247+
id_mapping[samp_id].append(info[bs_id_index])
248+
if info[exp_index] == "RNA-Seq":
249+
bs_type[info[bs_id_index]] = "RNA"
244250
else:
245-
bs_type[info[bs_id]] = "DNA"
251+
bs_type[info[bs_id_index]] = "DNA"
246252
# cycle through sample IDs to see if there's matched DNA/RNA and if one can be made
247253
check = {}
248254
for samp_id in id_mapping:
@@ -254,10 +260,20 @@ def build_header(header_list, entry):
254260
spec = id_mapping[samp_id][0] + ";" + id_mapping[samp_id][1]
255261
if bs_type[id_mapping[samp_id][0]] == "RNA":
256262
spec = id_mapping[samp_id][1] + ";" + id_mapping[samp_id][0]
257-
samp_dict[samp_id][0] = spec
263+
samp_dict[samp_id][1] = spec
258264
elif len(id_mapping[samp_id]) > 2:
259-
# QC check, only one or two biospec per sample ID
265+
# QC check, only one or two biospec per sample ID, unless it's new DGD RNA + separate fusion biospecimen
260266
sys.stderr.write("Saw more than two biospecimens for " + samp_id + ": " + ",".join(id_mapping[samp_id]) + "\n")
267+
if "DGD" in samp_id:
268+
# If two RNA types and is DGD, throw a note to check
269+
check_type = {"DNA": [], "RNA": []}
270+
for bs_id in id_mapping[samp_id]:
271+
check_type[bs_type[bs_id]].append(bs_id)
272+
if len(check_type["DNA"]) == 1 and len(check_type["RNA"]) == 2:
273+
spec = ";".join(check_type["DNA"] + check_type["RNA"])
274+
samp_dict[samp_id][1] = spec
275+
sys.stderr.write("Could be a DGD fusion + bulk RNA, may be ok\n")
276+
261277
# exit(1)
262278
else:
263279
# skip cell line re-matching
@@ -272,9 +288,9 @@ def build_header(header_list, entry):
272288
mapping_file = open("bs_id_sample_map.txt", "w")
273289
mapping_file.write("BS_ID\tSample Type\tCbio ID\n")
274290
for samp_id in id_mapping:
275-
for bs_id in id_mapping[samp_id]:
291+
for bs_id_index in id_mapping[samp_id]:
276292
try:
277-
mapping_file.write("\t".join([bs_id, bs_type[bs_id], samp_id]) + "\n")
293+
mapping_file.write("\t".join([bs_id_index, bs_type[bs_id_index], samp_id]) + "\n")
278294
except Exception as e:
279295
sys.stderr.write(str(e) + "\n")
280296
pdb.set_trace()

COLLABORATIONS/openTARGETS/cnv_to_tables.py

+5-4
Original file line numberDiff line numberDiff line change
@@ -52,12 +52,13 @@ def collate_data(cnv_fn):
5252
ploidy = data[p_idx]
5353
cn = data[c_idx]
5454
try:
55-
gistic = qual_to_gistic[data[s_idx]]
55+
qual = data[s_idx]
56+
if qual != "NA":
57+
qual = qual.lower()
58+
gistic = qual_to_gistic[qual]
5659
except Exception as e:
57-
sys.stderr.write(str(e) + "\nInvalid value for gistic, skipping " + line.decode())
60+
sys.stderr.write(str(e) + "\nInvalid value for gistic, skipping " + line)
5861
continue
59-
# pdb.set_trace()
60-
# hold=1
6162
if samp_id not in ploidy_dict:
6263
ploidy_dict[samp_id] = ploidy
6364
if gene not in cn_dict:

COLLABORATIONS/openTARGETS/header_desc.tsv

+4-2
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@ germline_sex_estimate 0 1 STRING 1 Female
33
cancer_predispositions 0 1 STRING 2 None documented
44
OS_days OS_MONTHS Overall survival in months since initial diagnosis 0 1 NUMBER 3 NA just convert to months
55
OS_status OS_STATUS Overall patient survival status 0 1 STRING 4 LIVING
6-
PFS_days PFS_MONTHS Progression free (months) since initial treatment 0 1 NUMBER 5 Not Reported just convert to months
6+
EFS_days EFS_MONTHS Event free (months) since initial treatment 0 1 NUMBER 5 Not Reported just convert to months
7+
EFS_event_type EFS_STATUS Event free status (months) since initial treatment 0 1 STRING 5 Not Reported
78
ethnicity ETHNICITY 0 1 STRING 6 Not Hispanic or Latino
89
race RACE 0 1 STRING 7 White
910
primary_site TUMOR_SITE 0 1 STRING 8
@@ -18,7 +19,8 @@ harmonized_diagnosis CANCER_TYPE_DETAILED 1 0 STRING 10 Adenoma
1819
primary_site TUMOR_TISSUE_SITE 1 0 STRING 9 Suprasellar/Hypothalamic/Pituitary
1920
tumor_descriptor TUMOR_TYPE 1 0 STRING 8 Initial CNS Tumor
2021
composition SAMPLE_TYPE 1 0 STRING 7 Solid Tissue
21-
cohort COHORT Source study cohort name 1 0 STRING 6
22+
cohort COHORT Source study cohort name 1 0 STRING 6
23+
sub_cohort SUB_COHORT Source study sub-cohort name 1 0 STRING 6
2224
CNS_region 1 0 STRING 5 Suprasellar
2325
tumor_ploidy 1 0 NUMBER 4 3
2426
tumor_fraction 1 0 NUMBER 3 0.476369391
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
#version 2.4
2+
Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome Start_Position End_Position Strand Variant_Classification Variant_Type Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 dbSNP_RS dbSNP_Val_Status Tumor_Sample_Barcode Matched_Norm_Sample_Barcode Match_Norm_Seq_Allele1 Match_Norm_Seq_Allele2 Tumor_Validation_Allele1 Tumor_Validation_Allele2 Match_Norm_Validation_Allele1 Match_Norm_Validation_Allele2 Verification_Status Validation_Status Mutation_Status Sequencing_Phase Sequence_Source Validation_Method Score BAM_File Sequencer Tumor_Sample_UUID Matched_Norm_Sample_UUID HGVSc HGVSp HGVSp_Short Transcript_ID Exon_Number t_depth t_ref_count t_alt_count n_depth n_ref_count n_alt_count all_effects Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation ALLELE_NUM DISTANCE STRAND_VEP SYMBOL SYMBOL_SOURCE HGNC_ID BIOTYPE CANONICAL CCDS ENSP SWISSPROT TREMBL UNIPARC RefSeq SIFT PolyPhen EXON INTRON DOMAINS AF AFR_AF AMR_AF ASN_AF EAS_AF EUR_AF SAS_AF AA_AF EA_AF CLIN_SIG SOMATIC PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POS MOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS TSL HGVS_OFFSET PHENO MINIMISED GENE_PHENO FILTER flanking_bps vcf_id vcf_qual gnomAD_AF gnomAD_AFR_AF gnomAD_AMR_AF gnomAD_ASJ_AF gnomAD_EAS_AF gnomAD_FIN_AF gnomAD_NFE_AF gnomAD_OTH_AF gnomAD_SAS_AF HGVSg vcf_pos gnomad_3_1_1_AC gnomad_3_1_1_AN gnomad_3_1_1_AF gnomad_3_1_1_nhomalt gnomad_3_1_1_AC_popmax gnomad_3_1_1_AN_popmax gnomad_3_1_1_AF_popmax gnomad_3_1_1_nhomalt_popmax gnomad_3_1_1_AC_controls_and_biobanks gnomad_3_1_1_AN_controls_and_biobanks gnomad_3_1_1_AF_controls_and_biobanks gnomad_3_1_1_AF_non_cancer gnomad_3_1_1_primate_ai_score gnomad_3_1_1_splice_ai_consequence MQ MQ0 CAL HotSpotAllele

0 commit comments

Comments
 (0)