-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can AMULET identify doublets with relatively low read counts from cells with a wide range of read counts #17
Comments
Ideally, singlets with UQ = 30000 wouldn't be identified as a doublet by AMULET since there shouldn't be many instances of overlaps >2. If it is identified as a doublet by AMULET, then the overlaps are likely driven by repetitive regions, copy number variations, or sequencing errors. AMULET tries to take repetitive regions into account by filtering regions identified by repeat masker. For better identifying doublets within 3 < log10(UQ + 1) < 4, you could try running AMULET on just those subset of cells. For this, you can provide a different singlecell.csv file with only cells in this range. It would adjust the background estimate used for the poisson distribution. Changing the statistical method to something that accounts for the total unique fragments within the cell does sound like a great way to improve the method! |
The expected number (2) is based on the biology of how many copies of chromosomes are within diploid cells. What this means is that this number should only change based on the number of chromosome copies within the cells being sequenced. For example, finding multiplets in proliferating diploid cells will need to adjust this number since they may have up to 4 copies. Consequently, running AMULET on cells such as PBMCs may detect proliferating cells. Unfortunately, AMULET doesn't have a built in option for accounting for sequencing depth/unique fragments within the cells. This would be a future work/extension of the method. However, I imagine that the AMULET output files that provide the overlap count information in conjunction with the number of unique reads within the cell can be easily utilized to apply a different statistical method for identifying the most probable multiplets. |
Thank you and good to know this! Can you suggest some statistical method for this situation? |
AMULET does remove the reads that are marked as duplicates. My first thoughts are that there is either a bug in the code that's producing these outputs, or the BAM file does not have the duplicates flagged. Do you mind sharing what version of AMULET you are running and what the inputs are (e.g., Cell Ranger ATAC 2.0 etc)? |
I am running AMULET-v1.1. The original inputs are count matrix . |
Processing the bam file to mark the duplicates would allow the tool to run. However, the algorithm to mark the duplicates would need to be aware that not all reads come from the same cell. Reads with the same start/end positions from different cells should not be marked as duplicates, but otherwise would be without this awareness. An alternative solution is to update AMULET to include an extra step to save the last processed read for each cell and skip reads with the same start/end position as the previous read from the corresponding cell. Hoping to find time to include this as an option for AMULET. |
Right! Could you suggest some tool that does not mark reads with identical start/end positions from different cells as duplicates? I am considering Picard, but I am not sure if it has such functionality. |
Hi! I am applying AMULET on snATAC-seq data generated from my lab. For a dataset like this, after filtering with TSSE >= 10 and log10(UQ + 1) >= 3 (UQ means the number of uniquely mapped fragments), the cells still have a wide range of UQ.
Since the number of overlaps (genomic regions with >2 overlapping reads) for a cell positively correlates with the UQ of the cell, cells have high UQ tend to have a larger number of overlaps. Because the majority of cells fall in the range of 3 < log10(UQ + 1) < 4, I believe the majority of doublets are also there (at least with the same order of magnitude). However, with a doublet formed by two singlets each with UQ = 3000 and a singlet with UQ = 30000, AMULET would very probably locate more overlaps in the singlet than in the doublet. The cells whose UQ > 10000 tend to have more overlaps and are thus more likely to be identified as doublets.
The AMULET paper indicates that 25K median valid reads per nucleus is optimal, but is there a way to maximize the probability of finding doublets in the range of 3 < log10(UQ + 1) < 4 in this dataset given that there are cells in range 4 < log10(UQ + 1) < 5?
The text was updated successfully, but these errors were encountered: