Skip to content

[SCAVENGE] Fine mapping analysis of GWAS variants prior to SCAVENGE analysis

FLYu edited this page Dec 1, 2022 · 1 revision

Finemapping - a brief note on approaches

SCAVENGE accepts finemapped data as input. Typically finemapping is performed on regions of significant association identified in a GWAS. This page serves as a very brief reference and helping hand on the complicated topic of finemapping, it will:

  • provide links to common finemapping methods which include documentation on their own use,
  • brings attention to some areas that users may want to pay close attention to or have questions on,
  • and provides a basic example approach in R of approximate bayes factor (ABF) finemapping

Finally it is worth mentioning again, this is a complicated topic and there are many excellent articles on the area [LINK]

Points to consider when finemapping

Q. which finemapping tool do I use?

  • There are many available tools, with many available backends and statistical frameworks. Anecdotally, the most commonly used are [FINEMAP], [susieR] and the ABF approach we describe below but there are others (links below). [PolyFun] and [PAINTOR] leverage additional annotation data to inform priors. [MsCAVIAR] is an example that takes input from multiple studies. Approximate bayes factor (ABF) finemapping (described in part originally here [Wakefield paper]) is unique in that it makes inferences of variant casual probability without LD reference information. You might consider running multiple approaches for an 'ensemble' approach to your 95% credible set, by averaging out the posterior inclusion probabilities of a given variant across the methods used for example (though this is currently not commonly done).

Q. how do I choose the areas to finemap?

  • Some studies take the variant with the lowest p-value within a visible manhattan peak and consider a window around that variant. We recommend using [GCTA-COJO] to identify conditionally independent loci, and use those identified SNPs as so-called 'sentinel' variants which the finemapping window is formed around. This method requires a reference panel with which to draw LD information from, this should as closely as possible match the population represented in the GWAS results.

Q. how widely do I base a finemapping window?

  • Typically, GWAS significant (or suggestive) lead SNPs are taken and a window of 1MBp created. Some studies consider a region of 1MBp either side (2MBp in total), you should try to choose a window you believe will encapsulate variants in LD with the lead.

Q. some tools allow me to select the number of causal snps, what do I choose?

  • This is a choice that should probably reflect the complexity of the region you want to finemap, leaving this as the default (usually n=5) or setting to n=1 (i.e. the 'single-variant assumption') should make no appreciable difference to your SCAVENGE results.

Example 1. ABF finemapping.

As mentioned above, approximate bayes factor (ABF) finemapping is unique in that it makes inferences of variant casual probability without LD reference information. This is beneficial in situations where an LD reference panel is not available or where the GWAS summary statistics in question are from a large meta-analysis of heterogenous populations that cannot be split up to allow an LD reference panel to correctly match. Not using LD reference information is a disadvantage also, since causal probability inferences may not be as accurate. It is also more complicated to run in a complex association area, for example where you have more than one conditionally independent identified variant in the same window (this is explained further below). We present an overview of this approach here, since it is a flexible approach.

  1. Filter summary statistics to reasonable thresholds on MAF / imputation INFO scores e.t.c.
  1. If possible, run GCTA COJO --cojo-slct which requires an approximate LD reference panel, and which will identify a set of variants which are conditionally independent lead variants - you might choose to set a threshold of <5e-8 (genome wide significant) or <1e-6 (genome wide suggestive) in --cojo-p. Documentation for [GCTA-COJO] is available. Lead/Sentinel variants will be found in the output .jma files. If not possible, consider identifying your lead variants (which will dictate where the finemapping windows will lie) by another method, such as lowest p-value in a given region.
  1. Load your summary statistics into R, use the previously identified 'sentinel' SNPs to define the windows you expect to finemap.
# load in summary statistics
full_summarystats <- data.table::fread("sumstats.tsv")
# columns may include: SNP, CHR, BP, NEA, EA, EA_FRQ, BETA, SE, P, N
# to clarify: BP is genomic position coordinates, NEA is the non-tested allele (ie. typically the genome reference), EA is the tested allele, EA_FRQ is the frequency of the EA
# load in a table of sentinel SNPs (from step 2.)
topsnp <- data.table::fread("sentinel_snps.tsv")
# columns may include: GENE, SNP, CHR, POS
  1. There are many ways to conduct the actual finemapping step, including manual calculations, however the [coloc] R package has a helpful function that will perform this step for you finemap.abf(). An alternative is the [GTX] R package which has a function abf.Wakefield()
# make a list object to collect the results
result <- list()
# use a for loop to iterate through the sentinel variants
for (i in 1:nrow(topsnp)) {
  # set your window size here this is a 1MBp total window
  window_size <- (1e6 / 2)
  window_name <- topsnp[i,]$GENE
  leadsnp_name <- topsnp[i,]$SNP
  leadsnp_pos <- topsnp[i,]$POS
  leadsnp_chr <- topsnp[i,]$CHR
  pos_min <- leadsnp_pos - window_size
  pos_max <- leadsnp_pos + window_size

  print(paste0("working on: ", window_name))
  
  windowss <- full_summarystats %>% 
    distinct(SNP, .keep_all=TRUE) %>%
    drop_na() %>%
    filter( CHR==leadsnp_chr & (between(BP, pos_min, pos_max)) ) %>%
    mutate(MAF = ifelse(EA_FRQ>0.5, 1-EA_FRQ, EA_FRQ)) %>%
    filter(MAF>0.01)

  coloc_ds_input = list(
    # this will change depending on whether your statistics are based on a quantitative GWAS trait or binary trait (see the above link to coloc package for details)
    type="quant",
    pvalues=windowss$P,
    N=windowss$N,
    MAF=windowss$MAF,
    beta=windowss$BETA,
    varbeta=(windowss$SE^2),
    #s=NULL,
    #sdy=NULL,
    snp=windowss$SNP
  )
  
  priornum = 1e-04

  result[[paste0(window_name)]] <- coloc::finemap.abf(coloc_ds_input, p1 = priornum) %>% 
                                    dplyr::filter(as.character(prior) == as.character(priornum)) %>%
                                    arrange(-SNP.PP) %>% # arrange in descending order
                                    mutate(SNP.PP.cumsum = cumsum(SNP.PP)) %>% # calculate cumsum
                                    mutate(incl95cs = ifelse( round(SNP.PP.cumsum, 2) <= 0.95, TRUE, FALSE )) %>% # label whether in the 95%CS or not
                                    mutate(LDsentinel = ifelse (snp == leadsnp_name, TRUE, FALSE)) # label which is the lead snp
print("done")
}

# save results
saveRDS(result, "ABF-rawresults.rds")
# or merge and then save as a table
result.df <- data.table::rbindlist(result)

  1. Merge the results into a data frame, group by the region ID, calculate cumulative sums or posterior probabilities, filter to 95% credible sets

N.B. with this ABF approach, if you identify 2 or more conditionally independent variants from GCTA-COJO within each other's finemapping windows, you might consider conditioning upon the other variant(s) to retrieve a set of conditioned summary statistics for that region. This situation is a little out of the scope of this basic article aimed at being a helping hand in finemapping, but this can be performed using the GCTA-COJO software with the --cojo-cond flag indicating the other sentinel variants that are within the window of interest, and by reading in the .cma results as summary statistics for that sentinel's region instead of the raw summary statistics used for other regions. It is also worth mentioning, that if you have a region with >=3 conditionally identified variants in the same window - you might consider seeking an alternative approach that uses LD information (FINEMAP/susieR e.t.c.) since it begins to get complicated with the ABF method.

SCAVENGE Tutorial

This web resource and vignette compiliation shows how to reproduce results of SCAVENGE analysis with monocyte count on a 10X PBMC dataset [Vignette-pdf], [Vignette-R markdown code].