-
Notifications
You must be signed in to change notification settings - Fork 36
[SCAVENGE] Fine mapping analysis of GWAS variants prior to SCAVENGE analysis
SCAVENGE accepts finemapped data as input. Typically finemapping is performed on regions of significant association identified in a GWAS. This page serves as a very brief reference and helping hand on the complicated topic of finemapping, it will:
- provide links to common finemapping methods which include documentation on their own use,
- brings attention to some areas that users may want to pay close attention to or have questions on,
- and provides a basic example approach in R of approximate bayes factor (ABF) finemapping
Finally it is worth mentioning again, this is a complicated topic and there are many excellent articles on the area [LINK]
Q. which finemapping tool do I use?
- There are many available tools, with many available backends and statistical frameworks. Anecdotally, the most commonly used are [FINEMAP], [susieR] and the ABF approach we describe below but there are others (links below). [PolyFun] and [PAINTOR] leverage additional annotation data to inform priors. [MsCAVIAR] is an example that takes input from multiple studies. Approximate bayes factor (ABF) finemapping (described in part originally here [Wakefield paper]) is unique in that it makes inferences of variant casual probability without LD reference information. You might consider running multiple approaches for an 'ensemble' approach to your 95% credible set, by averaging out the posterior inclusion probabilities of a given variant across the methods used for example (though this is currently not commonly done).
Q. how do I choose the areas to finemap?
- Some studies take the variant with the lowest p-value within a visible manhattan peak and consider a window around that variant. We recommend using [GCTA-COJO] to identify conditionally independent loci, and use those identified SNPs as so-called 'sentinel' variants which the finemapping window is formed around. This method requires a reference panel with which to draw LD information from, this should as closely as possible match the population represented in the GWAS results.
Q. how widely do I base a finemapping window?
- Typically, GWAS significant (or suggestive) lead SNPs are taken and a window of 1MBp created. Some studies consider a region of 1MBp either side (2MBp in total), you should try to choose a window you believe will encapsulate variants in LD with the lead.
Q. some tools allow me to select the number of causal snps, what do I choose?
- This is a choice that should probably reflect the complexity of the region you want to finemap, leaving this as the default (usually n=5) or setting to n=1 (i.e. the 'single-variant assumption') should make no appreciable difference to your SCAVENGE results.
As mentioned above, approximate bayes factor (ABF) finemapping is unique in that it makes inferences of variant casual probability without LD reference information. This is beneficial in situations where an LD reference panel is not available or where the GWAS summary statistics in question are from a large meta-analysis of heterogenous populations that cannot be split up to allow an LD reference panel to correctly match. Not using LD reference information is a disadvantage also, since causal probability inferences may not be as accurate. It is also more complicated to run in a complex association area, for example where you have more than one conditionally independent identified variant in the same window (this is explained further below). We present an overview of this approach here, since it is a flexible approach.
- Filter summary statistics to reasonable thresholds on MAF / imputation INFO scores e.t.c.
- If possible, run GCTA COJO
--cojo-slct
which requires an approximate LD reference panel, and which will identify a set of variants which are conditionally independent lead variants - you might choose to set a threshold of <5e-8 (genome wide significant) or <1e-6 (genome wide suggestive) in--cojo-p
. Documentation for [GCTA-COJO] is available. Lead/Sentinel variants will be found in the output.jma
files. If not possible, consider identifying your lead variants (which will dictate where the finemapping windows will lie) by another method, such as lowest p-value in a given region.
- Load your summary statistics into R, use the previously identified 'sentinel' SNPs to define the windows you expect to finemap.
# load in summary statistics
full_summarystats <- data.table::fread("sumstats.tsv")
# columns may include: SNP, CHR, BP, NEA, EA, EA_FRQ, BETA, SE, P, N
# to clarify: BP is genomic position coordinates, NEA is the non-tested allele (ie. typically the genome reference), EA is the tested allele, EA_FRQ is the frequency of the EA
# load in a table of sentinel SNPs (from step 2.)
topsnp <- data.table::fread("sentinel_snps.tsv")
# columns may include: GENE, SNP, CHR, POS
# make a list object to collect the results
result <- list()
# use a for loop to iterate through the sentinel variants
for (i in 1:nrow(topsnp)) {
# set your window size here this is a 1MBp total window
window_size <- (1e6 / 2)
window_name <- topsnp[i,]$GENE
leadsnp_name <- topsnp[i,]$SNP
leadsnp_pos <- topsnp[i,]$POS
leadsnp_chr <- topsnp[i,]$CHR
pos_min <- leadsnp_pos - window_size
pos_max <- leadsnp_pos + window_size
print(paste0("working on: ", window_name))
windowss <- full_summarystats %>%
distinct(SNP, .keep_all=TRUE) %>%
drop_na() %>%
filter( CHR==leadsnp_chr & (between(BP, pos_min, pos_max)) ) %>%
mutate(MAF = ifelse(EA_FRQ>0.5, 1-EA_FRQ, EA_FRQ)) %>%
filter(MAF>0.01)
coloc_ds_input = list(
# this will change depending on whether your statistics are based on a quantitative GWAS trait or binary trait (see the above link to coloc package for details)
type="quant",
pvalues=windowss$P,
N=windowss$N,
MAF=windowss$MAF,
beta=windowss$BETA,
varbeta=(windowss$SE^2),
#s=NULL,
#sdy=NULL,
snp=windowss$SNP
)
priornum = 1e-04
result[[paste0(window_name)]] <- coloc::finemap.abf(coloc_ds_input, p1 = priornum) %>%
dplyr::filter(as.character(prior) == as.character(priornum)) %>%
arrange(-SNP.PP) %>% # arrange in descending order
mutate(SNP.PP.cumsum = cumsum(SNP.PP)) %>% # calculate cumsum
mutate(incl95cs = ifelse( round(SNP.PP.cumsum, 2) <= 0.95, TRUE, FALSE )) %>% # label whether in the 95%CS or not
mutate(LDsentinel = ifelse (snp == leadsnp_name, TRUE, FALSE)) # label which is the lead snp
print("done")
}
# save results
saveRDS(result, "ABF-rawresults.rds")
# or merge and then save as a table
result.df <- data.table::rbindlist(result)
- Merge the results into a data frame, group by the region ID, calculate cumulative sums or posterior probabilities, filter to 95% credible sets
N.B. with this ABF approach, if you identify 2 or more conditionally independent variants from GCTA-COJO within each other's finemapping windows, you might consider conditioning upon the other variant(s) to retrieve a set of conditioned summary statistics for that region. This situation is a little out of the scope of this basic article aimed at being a helping hand in finemapping, but this can be performed using the GCTA-COJO software with the --cojo-cond
flag indicating the other sentinel variants that are within the window of interest, and by reading in the .cma
results as summary statistics for that sentinel's region instead of the raw summary statistics used for other regions. It is also worth mentioning, that if you have a region with >=3 conditionally identified variants in the same window - you might consider seeking an alternative approach that uses LD information (FINEMAP/susieR e.t.c.) since it begins to get complicated with the ABF method.
This web resource and vignette compiliation shows how to reproduce results of SCAVENGE analysis with monocyte count on a 10X PBMC dataset [Vignette-pdf], [Vignette-R markdown code].
For details, please explore the sections on the right-hand side!