Skip to content

Releases: blahah/transrate

v1.0.0 alpha 1

17 Oct 10:29

Choose a tag to compare

v1.0.0 alpha 1 Pre-release
Pre-release

screenshot 2014-10-17 11 22 27

transrate v1.0.0 alpha 1

This is the first alpha release of transrate v1.

To install this pre-release, use the following command:

$ gem uninstall transrate
$ gem install --pre transrate --version 1.0.0.alpha1
$ transrate --install-deps

This is an alpha release, so we expect there to be bugs. Please report any problems on the issue tracker.

New features

The Transrate score

The Transrate score is an estimate of the probability that the assembly is correct. A score is produced for the whole assembly, and for each contig. The scoring process uses the reads that were used to generate the assembly as evidence - so if you want to get a Transrate score, you need to run transrate in read-metrics mode (by passing in the reads with --left and --right).

The assembly score

The assembly score allows you to compare two or more assemblies made with the same reads. The score is designed so that an increased score is very likely to correspond to an assembly that is more biologically accurate.

The score is calculated as the geometric mean of all contig scores multiplied by the proportion of input reads that provide positive support for the assembly.

Thus, the score captures how confident you can be in what was assembled, as well as how complete the assembly is.

The contig score

Contig scores can be used to filter out bad contigs from an assembly, leaving you with only the well-assembled ones. Examining the distribution of contig scores can also give more detailed insight into the differences between assemblies.

Each contig is assigned a score by measuring how well it is supported by read evidence. The contig score can be thought of as an estimate of the probability that the contig is an accurate, non-redundant representation of a transcript that was present in the sequenced sample

There are five components to the contig score:

  1. The probability that each base has been called correctly. This is estimated using the mean per-base edit distance, i.e. how many changes would have to be made to a read covering a base before the sequence of the read and the covered region of the contig agreed perfectly.
  2. The probability that each base is truly part of the transcript. This is estimated by determining whether any reads provide agreeing coverage for a base.
  3. The probability that each base is not contained in another contig. This is estimated by considering the root-mean-squared MAPQ score of the reads covering each base.
  4. The probability that the contig is derived from a single transcript (rather than pieces of two or more transcripts). This is estimated by assuming that fragments from different transcripts are likely to be generated at different rates, and that this difference is detectable as a difference in coverage distribution. The probability is then calculated using a bayesian sequence segmentation algorithm which models the coverage distribution as a Dirichlet distribution over a reduced set of finite coverage states.
  5. The probability that the contig is structurally complete and correct. This is estimated as the proportion of mapped read pairs that agree with the structure and composition of the contig, which in turn is calculated by classifying the read pair alignments.

The score is the product of the components.

The score components are useful independently of the contig score, as they can identify contigs that can be treated in different ways to improve the quality of an assembly.

Faster processing

We identified all the major bottlenecks in our code and rewrote large parts of the codebase in C++ to provide an ~20x speedup.

Faster alignment

We have moved to using the SNAP aligner for an ~20x speedup in read alignment.

Probabilistic assignment of multi-mapping reads

We have moved to using eXpress to select the most likely assignment for each multi-mapping read. This has led to a considerable increase in the usefulness of read-mapping metrics.

v0.3.1

25 Jul 14:53

Choose a tag to compare

Improvements:

  • add citation to README

Bugfixes:

  • fix bug where contig stats were overwritten by each new assembly if multiple assemblies were analysed

v0.3.0

25 Jul 14:47

Choose a tag to compare

Features:

  • metrics on each contig are now calculated and output to a file (by default, transrate_contigs.csv)
  • output files can have a custom prefix with the --outfile argument

Improvements:

  • all inline C methods replaced with extension C (cleaner code, only compiles once on install)
  • linguistic complexity now calculated in C (400x speedup)
  • support new CRB-BLAST feature that splits BLAST query into one chunk per core (faster than threaded BLAST in most cases)
  • check dependencies at every run and give instructions for installing them if missing

Bug fixes:

  • remove redundant express dependency
  • fix parsing of samtools mpileup output so contig names are now matched up
  • fix bug where null characters in contigs weren't handled (wtf are null characters doing in contigs, Trinity?)
  • handle all bases that aren't ACTG by considering them to be Ns

Transrate v0.2.0

08 Jul 14:01

Choose a tag to compare

Transrate v0.2.0 brings a very broad overhaul of most features, some new features and many bug fixes (since we are pre-v1.0 bug fixes are not listed in this release).

New features:

  • All dependencies can be installed by transrate itself using the --install-deps command. This works cross-platform.
  • Multiple assemblies can be compared using the --assembly option by providing a comma-separated list of files with no spaces.
  • Base-level coverage analysis
  • Full documentation at http://hibberdlab.com/transrate