Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor inversion recall performance in BAM files with a high amount of variants #114

Open
Rapsssito opened this issue Feb 10, 2023 · 3 comments

Comments

@Rapsssito
Copy link

I am working with large number of genomes (BAM files). I have noticed that BRASS greatly reduces its recall performance on inversions once it passes a certain "threshold" of variants. This only happens with certain BAMs that contain a noticeable higher amount of variants.

I have taken a closer look at the code, and I believe it is related to this section in metropolis_hastings_inversions.R:

# Write out results
output_file = sub(".inversions.pdf", ".is_fb_artefact.txt", pdf_file)
if (bad_groups_count >= 50) {
write.table(
data.frame(
d[,7], # ID
mcmc_res[["artefact_prob"]][,1], # Probability of being true
rank(1 - mcmc_res[["artefact_prob"]][,1]) < threshold_idx # Whether the rearrangement is to be kept
),
output_file,
row.names = F,
col.names = F,
sep = "\t",
quote = F
)
} else {
write.table(
data.frame(
d[,7], # ID
rep(1, nrow(d)),
rep(TRUE, nrow(d))
),
output_file,
row.names = F,
col.names = F,
sep = "\t",
quote = F
)
}

In these BAMs with higher amount of variants, the bad_groups_count variable is key. When is higher than 50, it causes the script to follow a more precision-based approach and skips almost all inversions:

rank(1 - mcmc_res[["artefact_prob"]][,1]) < threshold_idx  # Whether the rearrangement is to be kept

From reading the rest of the code, it looks like bad_groups_count is just the count of all variants that have 4 tumor reads and are less than 1e5 in length. From the outside, it looks like a very arbitrary choice that unnecessarily links BRASS' recall performance to the number of variants in the genome (they might be more to it, but I could not find it).

We are developing a benchmarking platform for somatic variant calling and this auto-penalty greatly affects BRASS's position in the ranking. Before publishing the results, we wanted to let you know as it looks like a very easy fix.

@yl3
Copy link
Contributor

yl3 commented Mar 26, 2023

This script was written to address a small proportion of samples that are affected by library construction artifacts that manifest are locally inverted read pairs. When the rate of such artifactual fold-back read pairs exceed a certain level, they can generate apparent clusters of fold-back read pairs which are called by Brass as hundreds or even thousands of fold-back inversions (I have seen a sample with as many as ≥100,000 in the PCAWG data).

When a sample contains more than 50 fold-back inversions as defined by the thresholds you referred to, the script assumes that a given sample is affected by a "significant number" of fold-back inversion artefacts and will attempt to model and remove them. This is why you see a sudden change in the number of called fold-back inversions at 50. You are right that the threshold of 50 is arbitrary...

Just out of curiosity, how do you ascertain the biological presence/absence of a fold-back inversion in your benchmarking data?

@Rapsssito
Copy link
Author

@yl3, thanks for the answer!

(...) You are right that the threshold of 50 is arbitrary...

Is there a possibility to add an extra parameter to avoid the activation of this "hard mode"?

Just out of curiosity, how do you ascertain the biological presence/absence of a fold-back inversion in your benchmarking data?

We work with previously validated datasets (such as PCAWG) and we also create our own genomes. We either introduce the variants manually or copy them from other genome. We use these genomes to "compress" the benchmarking dataset. This way we don't have to run all the variant callers on 50+ genomes.

@davidrajones
Copy link

@Rapsssito I will put this request into our prioritisation queue. However, if and when it will gets done is entirely down to other priorities set by the scientists I'm afraid.

We are of course open to receiving pull requests with the requested changes in in the mean time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants