This repository contains scripts to produce variant-to-disease (V2D) association tables for Open Targets Genetics.
Changes made (Jan 2019):
- All outputs in Apache Parquet format
- Top loci table contains harmonised effect size, 95% CI and direction
- GWAS Catalog sub-phenotypes (in
P-VALUE (TEXT)
column) are split into separatestudy_ids
- All
variant_id
s are decomposed intochrom
,pos
,ref
,alt
- GWAS Catalog curated associations are clustered to remove non-independent loci
- Usage
- Tables produced by workflow
# Set parameters.
export INSTANCE_NAME=v2d_data
export INSTANCE_ZONE=europe-west1-d
# Create the instance and SSH.
gcloud compute instances create \
${INSTANCE_NAME} \
--project=open-targets-genetics-dev \
--zone=${INSTANCE_ZONE} \
--machine-type=n1-highmem-32 \
--service-account=426265110888-compute@developer.gserviceaccount.com \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--create-disk=auto-delete=yes,boot=yes,device-name=${INSTANCE_NAME},image=projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20210927,mode=rw,size=2000,type=projects/open-targets-eu-dev/zones/europe-west1-d/diskTypes/pd-balanced
gcloud compute ssh --zone ${INSTANCE_ZONE} ${INSTANCE_NAME}
# Setup on gcloud if needed
bash setup_gcloud.sh
# Authenticate google cloud storage if needed
gcloud auth application-default login
# Set up the instance.
sudo apt update
sudo apt install -yf \
openjdk-13-jre-headless \
python3-pip
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh
# Install dependencies into isolated environment
conda env create -n v2d_data --file environment.yaml
conda activate v2d_data
# Alter configuration file
nano config.yaml
# Remove downloads from any previous runs
rm -r www.ebi.ac.uk/gwas/
# Execute workflows
# May want to start tmux session to avoid problems if connection is lost
# (I've gotten snakemake problems on subsequent attempts when this happens too)
tmux
# May want to use a smaller machine for step 1, then scale up to more
# cores for step 2, and back down to a small machine for step 3
export PYSPARK_SUBMIT_ARGS="--driver-memory 100g pyspark-shell"
export VERSION_DATE=`date +%y%m%d`
mkdir -p logs/$VERSION_DATE
# Run workflow
time snakemake -s 1_make_tables.Snakefile --config version=$VERSION_DATE --cores all | tee logs/$VERSION_DATE/1_make_tables.log 2>&1 # Takes a couple hours
time snakemake -s 2_calculate_LD_table.Snakefile --config version=$VERSION_DATE --cores all 2>&1 | tee logs/$VERSION_DATE/2_calculate_LD_table.log 2>&1 # Takes ~7 hrs on 31 cores
# Reduce machine size in Google VM instance
# This step only uses 1 core actually - but I'm not sure how much memory
cores=3
time snakemake -s 3_make_overlap_table.Snakefile --config version=$VERSION_DATE --cores all 2>&1 | tee logs/$VERSION_DATE/3_make_overlap_table.log 2>&1 # Takes a couple hours
# Upload output dir to google cloud storage
gsutil -m rsync -r output/$VERSION_DATE gs://genetics-portal-dev-staging/v2d/$VERSION_DATE
List of loci associated with disease. Currently this data comes from two sources: (i) GWAS Catalog, (ii) Neale et al. UK Biobank summary statistics (version 1).
file schema: schema
--------------------------------------------------------------------------------
study_id: OPTIONAL BINARY L:STRING R:0 D:1
chrom: OPTIONAL BINARY L:STRING R:0 D:1
pos: OPTIONAL INT64 R:0 D:1
ref: OPTIONAL BINARY L:STRING R:0 D:1
alt: OPTIONAL BINARY L:STRING R:0 D:1
rsid: OPTIONAL BINARY L:STRING R:0 D:1
direction: OPTIONAL BINARY L:STRING R:0 D:1
beta: OPTIONAL DOUBLE R:0 D:1
beta_ci_lower: OPTIONAL DOUBLE R:0 D:1
beta_ci_upper: OPTIONAL DOUBLE R:0 D:1
odds_ratio: OPTIONAL DOUBLE R:0 D:1
oddsr_ci_lower: OPTIONAL DOUBLE R:0 D:1
oddsr_ci_upper: OPTIONAL DOUBLE R:0 D:1
pval_mantissa: OPTIONAL DOUBLE R:0 D:1
pval_exponent: OPTIONAL INT64 R:0 D:1
study_id
: unique identifier for studyvariant_id_b37
: chrom_pos_ref_alt (build 37) identifier for variant. RSID to variant ID mapping is non-unique, therefore multiple IDs may exist separated by ';'rsid
: RSID for each corresponding variant ID, separated by ';'direction
([+, -, null]): direction of the effect wrt to the alt allelebeta
(float, nullable): beta effect size for quantitative traitoddsr
(float, nullable): OR effect size for binary traitci_lower
(float, nullable): Lower 95% confidence interval of effect estimateci_upper
(float, nullable): Upper 95% confidence interval of effect estimatepval_mantissa
: the p-value coefficient when written in scientfic notationpval_exponent
: the p-value exponent (base 10)
- Download all associations
- Map rsIDs and chr:pos fields to variant_ids (GRCh37)
- This is complicated by the fact that:
- some GWAScat loci have mulitple rsid assignments
- some rows don't have rsids, will use chrom:pos (GRCh37 assumed) instead
- rsIDs/chrom:pos can map to multiple variant ids (e.g. rs773154059)
- Steps:
- Load GWAS Catalog data and explode multiSNPs into separate rows
- Load HRC site list + Ensembl VCF and split multiallelic sites into multiple rows
- For each assoc in gwascat, extract rsID from SNP_ID_CURRENT or SNPS
- Variants are first joined to HRC sitelist VCF, then unmerged rows are joined to the Ensembl VCF. Join on:
- Where RSID starts with "rs", join to VCF on rsid
- Where RSID starts with "chr", split into chrom and pos then join to VCF on these fields
- Make variant IDs in the format chr_pos_ref_alt
- Use "risk allele" from GWAS Catalog to see if it is concordant with alt and ref. Keep those that are concordant or ambiguous (missing), remove those that are not concordant.
- Identify studies with an abnormally large number of reported loci compared to number of loci after distance-based clustering (±500kb). Criteria for "abnormal" are set to number of loci > 10, and decrease in loci count after clustering of >10%. For "abnormal" studies (N≈120), apply distance based clustering.
- Notes:
- Two associations don't have the same number of SNPs in SNPS, CHR_ID and CHR_POS. These have been dropped as they are ambiguous.
- I am assuming that variants with id e.g. chr21:43693789 are from build GRCh37.
- Associations that do not have an rsID or chr:pos will be dropped
- Associations that do not have a match in the Ensembl VCF will be dropped.
- Create standard format containing all required columns
Steps:
- Remove rows with no mapped
variant_id_b37
- Extract p-value mantissa and exponenet
- Remove rows with no mapped
- Append UKB top loci from the finemapping conditional analysis
Steps:
- Download finemapping results from GCS (
gs://genetics-portal-input/uk_biobank_analysis/em21/neale_summary_statistics_20170915/finemapping/results/neale_ukb_uk10kref_180502.crediblesets.long.varids.tsv.gz
) - Select rows where
locus_index_varid == varid
as these will be the top loci - Make study_id from trait name (prepend "NEALEUKB_")
- Set any p-values that == 0 to minimum float
- Remove any rows where nominal p-value is > 1e-5
- Extract pvalue mantissa and exponent.
- Download finemapping results from GCS (
- GWAS Catalog RSIDs are mapped to variant IDs using (i) HRC sitelist VCF, (ii) Ensembl VCF. Therefore, if the variant is not in either of these it will be dropped from the top loci table.
- There can be multiple P-values for each (study, variant). This occurs where different subphenotypes (
P-VALUE (TEXT)
) are reported (e.g. different sexs, subpopulations). Occasionally there are multiple P-values for the same subphenotye also For the time being I will drop these. - We have decided not to extract effect sizes/OR and SE. This this can't be done in a consistent way across studies, because:
- Both alleles are not reported so the direction of the effect cannot be inferred unambiguously
- The 95% CI column str in poorly formatted making it very difficult to parse
- Many do not report effects
Information about each study found in the top loci table.
file schema: schema
--------------------------------------------------------------------------------
study_id: OPTIONAL BINARY L:STRING R:0 D:1
pmid: OPTIONAL BINARY L:STRING R:0 D:1
pub_date: OPTIONAL BINARY L:STRING R:0 D:1
pub_journal: OPTIONAL BINARY L:STRING R:0 D:1
pub_title: OPTIONAL BINARY L:STRING R:0 D:1
pub_author: OPTIONAL BINARY L:STRING R:0 D:1
trait_reported: OPTIONAL BINARY L:STRING R:0 D:1
trait_efos: OPTIONAL F:1
.list: REPEATED F:1
..item: OPTIONAL BINARY L:STRING R:1 D:3
ancestry_initial: OPTIONAL F:1
.list: REPEATED F:1
..item: OPTIONAL BINARY L:STRING R:1 D:3
ancestry_replication: OPTIONAL F:1
.list: REPEATED F:1
..item: OPTIONAL BINARY L:STRING R:1 D:3
n_initial: OPTIONAL INT64 R:0 D:1
n_replication: OPTIONAL INT64 R:0 D:1
n_cases: OPTIONAL INT64 R:0 D:1
trait_category: OPTIONAL BINARY L:STRING R:0 D:1
num_assoc_loci: OPTIONAL INT64 R:0 D:1
study_id
: unique identifier for studypmid
: pubmed ID (GWAS Catalog studies only)pub_date
: publication datepub_journal
: publication journalpub_title
: publication titlepub_author
: publication authortrait_reported
: trait reported in publicationtrait_mapped
: ontology trait labeltrait_efos
: EFO short_formancestry_initial
: ancestry of initial GWAS sample, separated by ';'ancestry_replication
: ancestry of replication GWAS sample, separated by ';'n_initial
: GWAS initial sample sizen_replication
: GWAS replication sample sizen_cases
: number of cases. Warning: there is currently no easy way to get this information from GWAS Catalog, therefore it is set to nullnum_assoc_loci
(int): total number of associated loci for this study in the top loci table
Merges to top loci table using study_id
.
Steps:
- Get GWAS Catalog study and ancestry information.
- Combined ancestry information into a single field.
- Make trait_code hash using 16 characters from
hasher.hexdigest()[0:16]
- Remove rows with no sample size information (see note below)
- For studies with multiple different values in
P-VALUE (TEXT)
(multi-trait studies), split each phenotype into its own study ID.
Notes:
- There are 18 studies where sample size == 0 because ancestry in API is null. Annalisa has recommended these studies be removed, as they are old curations that are not consistent with the rest of the catalog.
Steps
- Get and merge self-reported and ICD10 manual EFO curations
- Map EFOs to mapped_trait
- Load Neale et al manifest
- Merge EFOs with manifest
- Extract required columns
- Create trait code using field code from Neale et al.
- Add trait categories field, acquired from Oxford BIG server
Todo:
- Fix field
S72 EFO_000855
in EFO curation (short_form not recognised)
Credible set analysis results used to link index variants to tag variants. Full finemapping methods can be seen here: https://github.com/opentargets/finemapping
file schema: schema
--------------------------------------------------------------------------------
study_id: OPTIONAL BINARY L:STRING R:0 D:1
lead_chrom: OPTIONAL BINARY L:STRING R:0 D:1
lead_pos: OPTIONAL INT64 R:0 D:1
lead_ref: OPTIONAL BINARY L:STRING R:0 D:1
lead_alt: OPTIONAL BINARY L:STRING R:0 D:1
tag_chrom: OPTIONAL BINARY L:STRING R:0 D:1
tag_pos: OPTIONAL INT64 R:0 D:1
tag_ref: OPTIONAL BINARY L:STRING R:0 D:1
tag_alt: OPTIONAL BINARY L:STRING R:0 D:1
log10_ABF: OPTIONAL DOUBLE R:0 D:1
posterior_prob: OPTIONAL DOUBLE R:0 D:1
study_id
: unique identifier for studyindex_variantid_b37
: unique variant identifier for index variant, chrom_pos_ref_alt (build 37)tag_variantid_b37
: unique variant identifier for tagging variant, chrom_pos_ref_alt (build 37)log10_ABF
: log10 of the approximate Bayes factor for this tagging variantposterior_prob
: posterior probability of causality for this tagging variant compared to other variants at this locus
Table is pre-filtered to contain only 95% credible set variants. Merges to top loci table using study_id
and index_variantid_b37
.
Steps
- Downlaod from from GCS
- Make study ID frrom the field code
- Filter to only keep those in 95% credible sets
- Put into a standardised format (above)
Table of LD values linking index varaints to tag variants. I
root
|-- study_id: string (nullable = true)
|-- lead_chrom: string (nullable = true)
|-- lead_pos: integer (nullable = true)
|-- lead_ref: string (nullable = true)
|-- lead_alt: string (nullable = true)
|-- tag_chrom: string (nullable = true)
|-- tag_pos: integer (nullable = true)
|-- tag_ref: string (nullable = true)
|-- tag_alt: string (nullable = true)
|-- overall_r2: double (nullable = true)
|-- pics_mu: double (nullable = true)
|-- pics_postprob: double (nullable = true)
|-- pics_95perc_credset: boolean (nullable = true)
|-- pics_99perc_credset: boolean (nullable = true)
|-- AFR_1000G_prop: double (nullable = true)
|-- AMR_1000G_prop: double (nullable = true)
|-- EAS_1000G_prop: double (nullable = true)
|-- EUR_1000G_prop: double (nullable = true)
|-- SAS_1000G_prop: double (nullable = true)
study_id
: unique identifier for studyindex_variantid_b37
: unique variant identifier for index variant, chrom_pos_ref_alt (build 37)tag_variantid_b37
: unique variant identifier for tagging variant, chrom_pos_ref_alt (build 37)overall_r2
: overall R2 averaged of superpopulationspics_mu
: Mu statistic from PICS calculationpics_postprob
: PICS posterior probability that this variant is causalpics_95perc_credset
: If this variant is in 95% credible setpics_99perc_credset
: If this variant is in 99% credible setAFR_1000G_prop
: proportion of sample from AFR superpopulationAMR_1000G_prop
: proportion of sample from AMR superpopulationEAS_1000G_prop
: proportion of sample from EAS superpopulationEUR_1000G_prop
: proportion of sample from EUR superpopulationSAS_1000G_prop
: proportion of sample from SAS superpopulation
Table is pre-filtered to only contain R2 > 0.5.
LD calculated for each (study_id, index_variant) pair using 1000 Genomes p3.
Methods:
- GWAS Catalog populations mapped to 1000 Genomes super populations
- Manually curated. Suppl fig 6 from here https://arxiv.org/pdf/1805.03233.pdf is useful.
- Curations in
configs/gwascat_superpopulation_lut.curated.tsv
- Output list of index variants with super population proportions for each study (
ld_analysis_input.tsv.gz
). Proportions are calulcated across both initial and replication samples. If a compound ancestry is provided by GWAS Catalog e.g. "popA, popB=100", then the sample size is split equally amoungst them, i.e. popA=50, popB=50. This file is used as the LD analysis input manifest. - Prepare 1000 Genomes haplotypes. Script available here: 1. Download 2. Normalise to reference genome 3. Extract samples separately for each super population 4. Filter to remove MAF < 1% and genotyping-rate < 5% 5. Convert to bed, bim, fam
- Use plink to calculate correlation coefficients (R) for each index study in each 1000G superpopulation.
- Merge index variant correlation tables to input manifest
- Fisher-Z transform R coefficients.
- For each study take weighted average across populations weighting by sample size.
- Inverse-Fisher-Z transform to get R values
- Conduct PICS finemapping analysis
Notes:
- Studies with missing or NR ancestries will be assumed to be European
Table containing metrics on the degree of (tag variant) overlap between index variants with 5Mb of each other.
study_id_A
: locusA unique identifier for studyindex_variantid_b37_A
: locusA unique variant identifier, chrom_pos_ref_alt (build 37)study_id_B
: locusB unique identifier for studyindex_variantid_b37_B
: locusB unique variant identifier, chrom_pos_ref_alt (build 37)set_type
: one of 'finemapping', 'ld' or 'combined' (see below)distinct_A
: number of tag variants distinct to locusAoverlap_AB
: number of tag variants overlapping locusA and locusBdistinct_B
: number of tag variants distinct to locusB
Only loci with >=1 overlapping tag variants are stored.
Table showing the number of overlapping tag variants for each (study_id, index_variant) found within 5Mb of each other. This calculated using: (i) only finemapping sets, (ii) only LD sets, (iii) a combination of finemapping and LD, prefering finemapping over LD where available. The 'combined' set is used for the Genetics Portal.
- GCST006612 1_55505647_G_T effect allele=T -0.325
- GCST002898 1_55505647_G_T effect allele=T -0.53
- GCST005194_1 1_55505647_G_T effect allele=T -0.282