-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor inversion recall performance in BAM files with a high amount of variants #114
Comments
This script was written to address a small proportion of samples that are affected by library construction artifacts that manifest are locally inverted read pairs. When the rate of such artifactual fold-back read pairs exceed a certain level, they can generate apparent clusters of fold-back read pairs which are called by Brass as hundreds or even thousands of fold-back inversions (I have seen a sample with as many as ≥100,000 in the PCAWG data). When a sample contains more than 50 fold-back inversions as defined by the thresholds you referred to, the script assumes that a given sample is affected by a "significant number" of fold-back inversion artefacts and will attempt to model and remove them. This is why you see a sudden change in the number of called fold-back inversions at 50. You are right that the threshold of 50 is arbitrary... Just out of curiosity, how do you ascertain the biological presence/absence of a fold-back inversion in your benchmarking data? |
@yl3, thanks for the answer!
Is there a possibility to add an extra parameter to avoid the activation of this "hard mode"?
We work with previously validated datasets (such as PCAWG) and we also create our own genomes. We either introduce the variants manually or copy them from other genome. We use these genomes to "compress" the benchmarking dataset. This way we don't have to run all the variant callers on 50+ genomes. |
@Rapsssito I will put this request into our prioritisation queue. However, if and when it will gets done is entirely down to other priorities set by the scientists I'm afraid. We are of course open to receiving pull requests with the requested changes in in the mean time. |
I am working with large number of genomes (BAM files). I have noticed that BRASS greatly reduces its recall performance on inversions once it passes a certain "threshold" of variants. This only happens with certain BAMs that contain a noticeable higher amount of variants.
I have taken a closer look at the code, and I believe it is related to this section in metropolis_hastings_inversions.R:
BRASS/perl/share/Rscripts/metropolis_hastings_inversions.R
Lines 324 to 352 in dd0e1c1
In these BAMs with higher amount of variants, the
bad_groups_count
variable is key. When is higher than 50, it causes the script to follow a more precision-based approach and skips almost all inversions:From reading the rest of the code, it looks like
bad_groups_count
is just the count of all variants that have 4 tumor reads and are less than 1e5 in length. From the outside, it looks like a very arbitrary choice that unnecessarily links BRASS' recall performance to the number of variants in the genome (they might be more to it, but I could not find it).We are developing a benchmarking platform for somatic variant calling and this auto-penalty greatly affects BRASS's position in the ranking. Before publishing the results, we wanted to let you know as it looks like a very easy fix.
The text was updated successfully, but these errors were encountered: