-
Notifications
You must be signed in to change notification settings - Fork 11
Description
I am using cellSNP is a prior step to vireo and it works very well. But it is incredibly slow and I think there's a lot of room for speeding it up. I don't program in Python and don't have much experience profiling Python code, but I think a lot of time is wasted in
https://github.com/huangyh09/cellSNP/blob/master/cellSNP/utils/pileup_utils.py#L44-L68
There's overhead to opening a bam file, and overhead to accessing references. This is done once per SNP, so the overhead is dramatic for say the 1kGenome SNPs. A quick fix can be to change
if chrom != None:
if chrom not in samFile.references:
if chrom.startswith("chr"):
chrom = chrom.split("chr")[1]
else:
chrom = "chr" + chrom
if chrom not in samFile.references:
print("Can't find references %s in samFile" %chrom)
return None, None
to
if chrom != None:
sam_refs = samFile.references
if chrom not in sam_refs:
if chrom.startswith("chr"):
chrom = chrom.split("chr")[1]
else:
chrom = "chr" + chrom
if chrom not in sam_refs:
print("Can't find references %s in samFile" %chrom)
return None, None
Which in a small test case I have with 5000 cells, 20000 reads and with 100,000 SNPs resulted in a 45% speedup. Another more dramatic speedup is to avoid re-opening the BAM files over and over, I played around with creating a global that stores the currently open filename, chromosome and file object. Then returning the global if the request is identical
def check_pysam_chrom(samFile, chrom=None):
"""Chech if samFile is a file name or pysam object, and if chrom format.
"""
global CHROM_CACHE
global CHROM_CACHE_ID
global CHROM_CHACH_SAMFILE
CHROM_CACHE_SAMFILE = samFile
if CHROM_CACHE is not None:
if (samFile == CHROM_CACHE_SAMFILE) and (chrom == CHROM_CACHE_ID):
return CHROM_CACHE, CHROM_CACHE_ID
if type(samFile) == str:
ftype = samFile.split(".")[-1]
if ftype != "bam" and ftype != "sam" and ftype != "cram" :
print("Error: file type need suffix of bam, sam or cram.")
sys.exit(1)
if ftype == "cram":
samFile = pysam.AlignmentFile(samFile, "rc")
elif ftype == "bam":
samFile = pysam.AlignmentFile(samFile, "rb")
else:
samFile = pysam.AlignmentFile(samFile, "r")
if chrom != None:
if chrom not in samFile.references:
if chrom.startswith("chr"):
chrom = chrom.split("chr")[1]
else:
chrom = "chr" + chrom
if chrom not in samFile.references:
print("Can't find references %s in samFile" %chrom)
return None, None
CHROM_CACHE = samFile
CHROM_CACHE_ID = chrom
return samFile, chrom
This results in a 3.5x speedup under a single thread, but I don't know how the global is handled under multiple threads so I cannot recommend this solution.
The last thing I tried, was to remove the top level thread pools and push the threading down to the lower level by using
samFile = pysam.AlignmentFile(samFile, "rb", threads = nproc)
however this resulted in negative performance impacts, for me it was slower on all thread counts higher than 1.