Optimisation suggestions

I am using cellSNP is a prior step to vireo and it works very well. But it is incredibly slow and I think there's a lot of room for speeding it up. I don't program in Python and don't have much experience profiling Python code, but I think a lot of time is wasted in 

https://github.com/huangyh09/cellSNP/blob/master/cellSNP/utils/pileup_utils.py#L44-L68

There's overhead to opening a bam file, and overhead to accessing references. This is done once per SNP, so the overhead is dramatic for say the 1kGenome SNPs. A quick fix can be to change

    if chrom != None:
        if chrom not in samFile.references:
            if chrom.startswith("chr"):
                chrom = chrom.split("chr")[1]
            else:
                chrom = "chr" + chrom
        if chrom not in samFile.references:
            print("Can't find references %s in samFile" %chrom)
            return None, None

to 

    if chrom != None:
        sam_refs = samFile.references
        if chrom not in sam_refs:
            if chrom.startswith("chr"):
                chrom = chrom.split("chr")[1]
            else:
                chrom = "chr" + chrom
        if chrom not in sam_refs:
            print("Can't find references %s in samFile" %chrom)
            return None, None

Which in a small test case I have with 5000 cells, 20000 reads and with 100,000 SNPs resulted in a 45% speedup. Another more dramatic speedup is to avoid re-opening the BAM files over and over, I played around with creating a global that stores the currently open filename, chromosome and file object. Then returning the global if the request is identical

    def check_pysam_chrom(samFile, chrom=None):
        """Chech if samFile is a file name or pysam object, and if chrom format. 
        """
        global CHROM_CACHE
        global CHROM_CACHE_ID
        global CHROM_CHACH_SAMFILE

        CHROM_CACHE_SAMFILE = samFile

        if CHROM_CACHE is not None:
            if (samFile == CHROM_CACHE_SAMFILE) and (chrom == CHROM_CACHE_ID):
                return CHROM_CACHE, CHROM_CACHE_ID

        if type(samFile) == str:
            ftype = samFile.split(".")[-1]
            if ftype != "bam" and ftype != "sam" and ftype != "cram" :
                print("Error: file type need suffix of bam, sam or cram.")
                sys.exit(1)
            if ftype == "cram":
                samFile = pysam.AlignmentFile(samFile, "rc")
            elif ftype == "bam":
                samFile = pysam.AlignmentFile(samFile, "rb")
            else:
                samFile = pysam.AlignmentFile(samFile, "r")

        if chrom != None:
            if chrom not in samFile.references:
                if chrom.startswith("chr"):
                    chrom = chrom.split("chr")[1]
                else:
                    chrom = "chr" + chrom
            if chrom not in samFile.references:
                print("Can't find references %s in samFile" %chrom)
                return None, None
        
        CHROM_CACHE = samFile
        CHROM_CACHE_ID = chrom
        return samFile, chrom

This results in a 3.5x speedup under a single thread, but I don't know how the global is handled under multiple threads so I cannot recommend this solution.

The last thing I tried, was to remove the top level thread pools and push the threading down to the lower level by using

    samFile = pysam.AlignmentFile(samFile, "rb", threads = nproc)

however this resulted in negative performance impacts, for me it was slower on all thread counts higher than 1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimisation suggestions #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimisation suggestions #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions