Skip to content

Accuracy #59

@ralphmatar

Description

@ralphmatar

Hi, I am using stringMLST and I noticed that I got different results for running the same sample more than once. The database changed (updated). What was surprising that some of the assigned ST's in the first run were completely different in the second. is this normal?

Activity

ar0ch

ar0ch commented on Aug 9, 2024

@ar0ch
Member

Hi Ralph,

stringMLST should be deterministic given the same reads and kmer db. If you update the database that has the possibility of calling a different ST because additional gene sequences and alleles are available - this I don't think is super surprising. I would be surprised if running stringMLST on the same sample, with the same db resulted in different results. This should really only happen if there's significant contamination and even then be rare.

❯ mkdir -p stringMLST_analysis; cd stringMLST_analysis
stringMLST.py --getMLST -P neisseria/nmb --species neisseria
Preparing: neisseria
	Database ready for neisseria
	neisseria/nmb

~/stringMLST_analysis took 23s
❯ wget -qqq ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR026/ERR026529/ERR026529_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR026/ERR026529/ERR026529_2.fastq.gz

~/stringMLST_analysis took 16s
❯ for n ({1..5}); stringMLST.py --predict -P neisseria/nmb -1 ERR026529_1.fastq.gz -2 ERR026529_2.fastq.gz && echo '-----'

Sample	abcZ	adk	aroE	fumC	gdh	pdhC	pgm	ST
ERR026529	231	180	306	612	269	277	260	10174
-----
Sample	abcZ	adk	aroE	fumC	gdh	pdhC	pgm	ST
ERR026529	231	180	306	612	269	277	260	10174
-----
Sample	abcZ	adk	aroE	fumC	gdh	pdhC	pgm	ST
ERR026529	231	180	306	612	269	277	260	10174
-----
Sample	abcZ	adk	aroE	fumC	gdh	pdhC	pgm	ST
ERR026529	231	180	306	612	269	277	260	10174
-----
Sample	abcZ	adk	aroE	fumC	gdh	pdhC	pgm	ST
ERR026529	231	180	306	612	269	277	260	10174
-----

ralphmatar

ralphmatar commented on Aug 18, 2024

@ralphmatar
Author

I appreciate the clarification. I have 2 more questions if you don't mind.

  1. Is StringMLST applicable on Oxford Nanopore Technologies (long reads), I have the amplicon from 7 housekeeping genes sequenced.
  2. I also used RAxML on the variant calling from mpileup files including all fastq files, some samples have the same ST predicted and in the tree they do not seem to cluster together.
ralphmatar

ralphmatar commented on Aug 18, 2024

@ralphmatar
Author

Screenshot from 2024-08-18 18-35-39
This is an example

ar0ch

ar0ch commented on Aug 22, 2024

@ar0ch
Member
  1. Is StringMLST applicable on Oxford Nanopore Technologies (long reads), I have the amplicon from 7 housekeeping genes sequenced.

Technically yes, though it's sensitive to read errors, which tend to be more prevalent in ONT reads, and converges to a solution better with higher read count which can sometimes be an issue with ONT reads. If you have high coverage ONT, error corrected data you should be fine.

  1. I also used RAxML on the variant calling from mpileup files including all fastq files, some samples have the same ST predicted and in the tree they do not seem to cluster together.

Variant calling provides much more fine grained data - remember an ST is 7 data points, variant calls could be hundreds or thousands. STs are an approximation of genetic relatedness (more phenotype than genotype). They may not cluster in the same branch but they'll likely appear in the same subtree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @ar0ch@ralphmatar

        Issue actions

          Accuracy · Issue #59 · jordanlab/stringMLST