Skip to content

Optimisation suggestions #1

@Shians

Description

@Shians

I am using cellSNP is a prior step to vireo and it works very well. But it is incredibly slow and I think there's a lot of room for speeding it up. I don't program in Python and don't have much experience profiling Python code, but I think a lot of time is wasted in

https://github.com/huangyh09/cellSNP/blob/master/cellSNP/utils/pileup_utils.py#L44-L68

There's overhead to opening a bam file, and overhead to accessing references. This is done once per SNP, so the overhead is dramatic for say the 1kGenome SNPs. A quick fix can be to change

if chrom != None:
    if chrom not in samFile.references:
        if chrom.startswith("chr"):
            chrom = chrom.split("chr")[1]
        else:
            chrom = "chr" + chrom
    if chrom not in samFile.references:
        print("Can't find references %s in samFile" %chrom)
        return None, None

to

if chrom != None:
    sam_refs = samFile.references
    if chrom not in sam_refs:
        if chrom.startswith("chr"):
            chrom = chrom.split("chr")[1]
        else:
            chrom = "chr" + chrom
    if chrom not in sam_refs:
        print("Can't find references %s in samFile" %chrom)
        return None, None

Which in a small test case I have with 5000 cells, 20000 reads and with 100,000 SNPs resulted in a 45% speedup. Another more dramatic speedup is to avoid re-opening the BAM files over and over, I played around with creating a global that stores the currently open filename, chromosome and file object. Then returning the global if the request is identical

def check_pysam_chrom(samFile, chrom=None):
    """Chech if samFile is a file name or pysam object, and if chrom format. 
    """
    global CHROM_CACHE
    global CHROM_CACHE_ID
    global CHROM_CHACH_SAMFILE

    CHROM_CACHE_SAMFILE = samFile

    if CHROM_CACHE is not None:
        if (samFile == CHROM_CACHE_SAMFILE) and (chrom == CHROM_CACHE_ID):
            return CHROM_CACHE, CHROM_CACHE_ID

    if type(samFile) == str:
        ftype = samFile.split(".")[-1]
        if ftype != "bam" and ftype != "sam" and ftype != "cram" :
            print("Error: file type need suffix of bam, sam or cram.")
            sys.exit(1)
        if ftype == "cram":
            samFile = pysam.AlignmentFile(samFile, "rc")
        elif ftype == "bam":
            samFile = pysam.AlignmentFile(samFile, "rb")
        else:
            samFile = pysam.AlignmentFile(samFile, "r")

    if chrom != None:
        if chrom not in samFile.references:
            if chrom.startswith("chr"):
                chrom = chrom.split("chr")[1]
            else:
                chrom = "chr" + chrom
        if chrom not in samFile.references:
            print("Can't find references %s in samFile" %chrom)
            return None, None
    
    CHROM_CACHE = samFile
    CHROM_CACHE_ID = chrom
    return samFile, chrom

This results in a 3.5x speedup under a single thread, but I don't know how the global is handled under multiple threads so I cannot recommend this solution.

The last thing I tried, was to remove the top level thread pools and push the threading down to the lower level by using

samFile = pysam.AlignmentFile(samFile, "rb", threads = nproc)

however this resulted in negative performance impacts, for me it was slower on all thread counts higher than 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions