Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: soedinglab/transannot
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: 2-c6f1836
Choose a base ref
...
head repository: soedinglab/transannot
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: main
Choose a head ref

Commits on Jun 9, 2024

  1. Update README.md

    mariia-zelenskaia committed Jun 9, 2024
    Copy the full SHA
    382ff09 View commit details

Commits on Jun 10, 2024

  1. minor revision

    mariia-zelenskaia committed Jun 10, 2024
    Copy the full SHA
    cd4986b View commit details

Commits on Jun 13, 2024

  1. Copy the full SHA
    1093cf4 View commit details
  2. .

    mariia-zelenskaia committed Jun 13, 2024
    Copy the full SHA
    b7cb672 View commit details
  3. update mmseqs

    mariia-zelenskaia committed Jun 13, 2024
    Copy the full SHA
    e14ed6b View commit details
  4. Copy the full SHA
    e07c97b View commit details
  5. Copy the full SHA
    a388d91 View commit details
  6. .

    mariia-zelenskaia committed Jun 13, 2024
    Copy the full SHA
    e0b345e View commit details
  7. Update README.md

    vragh authored Jun 13, 2024
    Copy the full SHA
    93da1cf View commit details
  8. Added wiki link to readme

    vragh authored Jun 13, 2024
    Copy the full SHA
    9fa0328 View commit details

Commits on Jun 14, 2024

  1. .

    mariia-zelenskaia committed Jun 14, 2024
    Copy the full SHA
    6ae2161 View commit details
  2. .

    mariia-zelenskaia committed Jun 14, 2024
    Copy the full SHA
    9b8f8b2 View commit details

Commits on Jun 19, 2024

  1. Copy the full SHA
    8f30dd0 View commit details
  2. Copy the full SHA
    4ef5e6e View commit details
  3. .

    mariia-zelenskaia committed Jun 19, 2024
    Copy the full SHA
    093cca2 View commit details
  4. .

    mariia-zelenskaia committed Jun 19, 2024
    Copy the full SHA
    f6e1905 View commit details
  5. .

    mariia-zelenskaia committed Jun 19, 2024
    Copy the full SHA
    c0b45b8 View commit details
  6. Copy the full SHA
    0551243 View commit details
  7. .

    mariia-zelenskaia committed Jun 19, 2024
    Copy the full SHA
    6c899c6 View commit details
  8. .

    mariia-zelenskaia committed Jun 19, 2024
    Copy the full SHA
    de4046a View commit details
  9. Copy the full SHA
    e14bf5c View commit details
  10. .

    mariia-zelenskaia authored Jun 19, 2024
    Copy the full SHA
    1e4e0e1 View commit details

Commits on Jul 2, 2024

  1. Copy the full SHA
    b0f48ce View commit details
  2. .

    mariia-zelenskaia committed Jul 2, 2024
    Copy the full SHA
    0af0a1a View commit details
  3. .

    mariia-zelenskaia committed Jul 2, 2024
    Copy the full SHA
    0d5b3f8 View commit details
  4. .

    mariia-zelenskaia committed Jul 2, 2024
    Copy the full SHA
    306f5fe View commit details
  5. .

    mariia-zelenskaia committed Jul 2, 2024
    Copy the full SHA
    26c40dc View commit details

Commits on Jul 3, 2024

  1. .

    mariia-zelenskaia committed Jul 3, 2024
    Copy the full SHA
    35b4f61 View commit details
  2. .

    mariia-zelenskaia committed Jul 3, 2024
    Copy the full SHA
    a73a0c9 View commit details
  3. .

    mariia-zelenskaia committed Jul 3, 2024
    Copy the full SHA
    c927ae5 View commit details
  4. .

    mariia-zelenskaia committed Jul 3, 2024
    Copy the full SHA
    b44926d View commit details
  5. Copy the full SHA
    c54d8b4 View commit details

Commits on Jul 5, 2024

  1. debug

    mariia-zelenskaia committed Jul 5, 2024
    Copy the full SHA
    29bdf1a View commit details
  2. debug

    mariia-zelenskaia committed Jul 5, 2024
    Copy the full SHA
    6a8b9fe View commit details
  3. debug [2]

    mariia-zelenskaia committed Jul 5, 2024
    Copy the full SHA
    ee4e6ff View commit details
  4. Copy the full SHA
    6396c7e View commit details

Commits on Jul 8, 2024

  1. update mmseqs

    mariia-zelenskaia committed Jul 8, 2024
    Copy the full SHA
    5e651f5 View commit details
  2. Copy the full SHA
    69cb5bf View commit details
  3. .

    mariia-zelenskaia committed Jul 8, 2024
    Copy the full SHA
    e84fa15 View commit details

Commits on Jul 10, 2024

  1. Copy the full SHA
    b3bac40 View commit details
  2. .

    mariia-zelenskaia committed Jul 10, 2024
    Copy the full SHA
    cba09d8 View commit details
  3. .

    mariia-zelenskaia committed Jul 10, 2024
    Copy the full SHA
    ad3f184 View commit details

Commits on Jul 11, 2024

  1. no download

    wget download might not be possible on some machines due to the fire wall configuration
    mariia-zelenskaia committed Jul 11, 2024
    Copy the full SHA
    2e26c42 View commit details

Commits on Jul 12, 2024

  1. .

    mariia-zelenskaia committed Jul 12, 2024
    Copy the full SHA
    3c4a1f6 View commit details
  2. .

    mariia-zelenskaia committed Jul 12, 2024
    Copy the full SHA
    d610129 View commit details
  3. Update annotate.sh

    mariia-zelenskaia committed Jul 12, 2024
    Copy the full SHA
    cd79f44 View commit details

Commits on Jul 13, 2024

  1. build for macos 12

    mariia-zelenskaia committed Jul 13, 2024
    Copy the full SHA
    4d472c0 View commit details
  2. .

    mariia-zelenskaia committed Jul 13, 2024
    Copy the full SHA
    2b4e572 View commit details
  3. no download

    mariia-zelenskaia committed Jul 13, 2024
    Copy the full SHA
    98e3d07 View commit details

Commits on Jul 15, 2024

  1. Update README.md

    vragh authored Jul 15, 2024
    Copy the full SHA
    95e3361 View commit details
Showing with 149,948 additions and 748 deletions.
  1. +97 −24 README.md
  2. +7 −5 azure-pipelines.yml
  3. +91 −33 data/annotate.sh
  4. +10 −6 data/annotatecustom.sh
  5. +45,803 −0 examples/longreads_result/longreadsDNA_annotation_table
  6. +31,229 −0 examples/longreads_result/longreadsRNA_annotatemodule_table
  7. +70,940 −0 examples/longreads_result/longreadsRNA_easytransannotmodule_table
  8. +33 −27 examples/resDB_regression
  9. +5 −5 lib/mmseqs/.cirrus.yml
  10. +26 −5 lib/mmseqs/azure-pipelines.yml
  11. +9 −5 lib/mmseqs/cmake/MMseqsSetupDerivedTarget.cmake
  12. +41 −5 lib/mmseqs/data/workflow/blastp.sh
  13. +42 −4 lib/mmseqs/data/workflow/blastpgp.sh
  14. +13 −4 lib/mmseqs/data/workflow/createtaxdb.sh
  15. +20 −11 lib/mmseqs/data/workflow/databases.sh
  16. +5 −1 lib/mmseqs/data/workflow/taxpercontig.sh
  17. +3 −3 lib/mmseqs/data/workflow/tsv2exprofiledb.sh
  18. +1 −1 lib/mmseqs/lib/alp/sls_pvalues.hpp
  19. +15 −11 lib/mmseqs/lib/ksw2/kseq.h
  20. +3 −2 lib/mmseqs/src/CMakeLists.txt
  21. +5 −0 lib/mmseqs/src/CommandDeclarations.h
  22. +46 −13 lib/mmseqs/src/MMseqsBase.cpp
  23. +7 −2 lib/mmseqs/src/alignment/BandedNucleotideAligner.cpp
  24. +1 −1 lib/mmseqs/src/alignment/Matcher.cpp
  25. +6 −7 lib/mmseqs/src/alignment/MultipleAlignment.cpp
  26. +7 −6 lib/mmseqs/src/alignment/PSSMCalculator.cpp
  27. +2 −2 lib/mmseqs/src/alignment/PSSMCalculator.h
  28. +14 −6 lib/mmseqs/src/alignment/rescorediagonal.cpp
  29. +44 −38 lib/mmseqs/src/clustering/ClusteringAlgorithms.cpp
  30. +88 −91 lib/mmseqs/src/commons/Application.cpp
  31. +3 −3 lib/mmseqs/src/commons/CSProfile.cpp
  32. +5 −0 lib/mmseqs/src/commons/Command.cpp
  33. +2 −0 lib/mmseqs/src/commons/Command.h
  34. +1 −0 lib/mmseqs/src/commons/Debug.h
  35. +23 −3 lib/mmseqs/src/commons/IndexReader.h
  36. +5 −10 lib/mmseqs/src/commons/KSeqWrapper.cpp
  37. +67 −7 lib/mmseqs/src/commons/Parameters.cpp
  38. +51 −4 lib/mmseqs/src/commons/Parameters.h
  39. +2 −2 lib/mmseqs/src/commons/ProfileStates.cpp
  40. +2 −0 lib/mmseqs/src/commons/Sequence.cpp
  41. +12 −11 lib/mmseqs/src/commons/SubstitutionMatrix.cpp
  42. +1 −1 lib/mmseqs/src/commons/SubstitutionMatrix.h
  43. +1 −0 lib/mmseqs/src/commons/Util.h
  44. +0 −2 lib/mmseqs/src/linclust/LinsearchIndexReader.cpp
  45. +10 −1 lib/mmseqs/src/mmseqs.cpp
  46. +2 −2 lib/mmseqs/src/multihit/combinepvalperset.cpp
  47. +47 −8 lib/mmseqs/src/prefiltering/CacheFriendlyOperations.cpp
  48. +3 −1 lib/mmseqs/src/prefiltering/CacheFriendlyOperations.h
  49. +25 −10 lib/mmseqs/src/prefiltering/IndexBuilder.cpp
  50. +5 −2 lib/mmseqs/src/prefiltering/IndexBuilder.h
  51. +10 −4 lib/mmseqs/src/prefiltering/IndexTable.h
  52. +1 −0 lib/mmseqs/src/prefiltering/Indexer.h
  53. +19 −12 lib/mmseqs/src/prefiltering/Prefiltering.cpp
  54. +1 −0 lib/mmseqs/src/prefiltering/Prefiltering.h
  55. +34 −24 lib/mmseqs/src/prefiltering/PrefilteringIndexReader.cpp
  56. +1 −1 lib/mmseqs/src/prefiltering/PrefilteringIndexReader.h
  57. +24 −20 lib/mmseqs/src/prefiltering/QueryMatcher.cpp
  58. +1 −2 lib/mmseqs/src/prefiltering/QueryMatcher.h
  59. +21 −6 lib/mmseqs/src/prefiltering/QueryMatcherTaxonomyHook.h
  60. +10 −12 lib/mmseqs/src/prefiltering/UngappedAlignment.cpp
  61. +4 −4 lib/mmseqs/src/prefiltering/UngappedAlignment.h
  62. +131 −83 lib/mmseqs/src/prefiltering/ungappedprefilter.cpp
  63. +0 −1 lib/mmseqs/src/test/CMakeLists.txt
  64. +1 −0 lib/mmseqs/src/test/TestAlignment.cpp
  65. +2 −2 lib/mmseqs/src/test/TestAlignmentPerformance.cpp
  66. +5 −4 lib/mmseqs/src/test/TestAlignmentTraceback.cpp
  67. +1 −0 lib/mmseqs/src/test/TestAlp.cpp
  68. +0 −1 lib/mmseqs/src/test/TestBacktraceTranslator.cpp
  69. +0 −1 lib/mmseqs/src/test/TestBestAlphabet.cpp
  70. +1 −0 lib/mmseqs/src/test/TestCompositionBias.cpp
  71. +24 −14 lib/mmseqs/src/test/TestDiagonalScoring.cpp
  72. +8 −7 lib/mmseqs/src/test/TestDiagonalScoringPerformance.cpp
  73. +0 −37 lib/mmseqs/src/test/TestIndexTable.cpp
  74. +1 −0 lib/mmseqs/src/test/TestKmerGenerator.cpp
  75. +1 −0 lib/mmseqs/src/test/TestKmerNucl.cpp
  76. +1 −0 lib/mmseqs/src/test/TestKmerScore.cpp
  77. +2 −2 lib/mmseqs/src/test/TestKsw2.cpp
  78. +0 −1 lib/mmseqs/src/test/TestKwayMerge.cpp
  79. +2 −1 lib/mmseqs/src/test/TestMultipleAlignment.cpp
  80. +2 −1 lib/mmseqs/src/test/TestPSSM.cpp
  81. +2 −1 lib/mmseqs/src/test/TestPSSMPrune.cpp
  82. +2 −1 lib/mmseqs/src/test/TestProfileAlignment.cpp
  83. +1 −0 lib/mmseqs/src/test/TestReduceMatrix.cpp
  84. +1 −0 lib/mmseqs/src/test/TestScoreMatrixSerialization.cpp
  85. +1 −0 lib/mmseqs/src/test/TestSequenceIndex.cpp
  86. +1 −0 lib/mmseqs/src/test/TestTanTan.cpp
  87. +5 −0 lib/mmseqs/src/util/CMakeLists.txt
  88. +73 −7 lib/mmseqs/src/util/convertalignments.cpp
  89. +148 −0 lib/mmseqs/src/util/createclusterdb.cpp
  90. +1 −1 lib/mmseqs/src/util/createdb.cpp
  91. +2 −2 lib/mmseqs/src/util/expandaln.cpp
  92. +2 −2 lib/mmseqs/src/util/extractframes.cpp
  93. +1 −1 lib/mmseqs/src/util/gff2db.cpp
  94. +41 −22 lib/mmseqs/src/util/indexdb.cpp
  95. +54 −0 lib/mmseqs/src/util/makepaddedseqdb.cpp
  96. +22 −6 lib/mmseqs/src/util/mergeresultsbyset.cpp
  97. +1 −1 lib/mmseqs/src/util/msa2profile.cpp
  98. +1 −1 lib/mmseqs/src/util/msa2result.cpp
  99. +45 −10 lib/mmseqs/src/util/pairaln.cpp
  100. +148 −0 lib/mmseqs/src/util/recoverlongestorf.cpp
  101. +73 −42 lib/mmseqs/src/util/result2msa.cpp
  102. +1 −1 lib/mmseqs/src/util/result2profile.cpp
  103. +0 −2 lib/mmseqs/src/util/result2stats.cpp
  104. +21 −0 lib/mmseqs/src/util/setextendeddbtype.cpp
  105. +1 −0 lib/mmseqs/src/util/summarizeresult.cpp
  106. +58 −4 lib/mmseqs/src/util/unpackdb.cpp
  107. +1 −1 lib/mmseqs/src/workflow/Cluster.cpp
  108. +39 −6 lib/mmseqs/src/workflow/Search.cpp
  109. +1 −0 lib/mmseqs/src/workflow/Taxonomy.cpp
  110. +3 −1 src/commons/LocalParameters.h
  111. +9 −11 src/transannot.cpp
  112. +1 −0 src/workflow/annotate.cpp
121 changes: 97 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -5,6 +5,7 @@ Optionally, TransAnnot can use [Plass](https://github.com/soedinglab/plass) for

TransAnnot is a free and open-source (GPLv3), modular toolkit developed in C++.

Now published in Bioinformatics Advances: [https://doi.org/10.1093/bioadv/vbae152](https://doi.org/10.1093/bioadv/vbae152)

<p align="center"><img src="https://github.com/soedinglab/transannot/blob/main/.github/TransAnnot_logo.png" height="400" /></p>

@@ -29,40 +30,84 @@ Other dependencies for the compilation from the source are `zlib` and `bzip`.

## Workflow dependencies

- Plass - should be installed separately in the current working directory, see [corresponding repository](https://github.com/soedinglab/plass), to perform *de novo* assembly.
- Databases - Pfam, EggNOG and UniProtKB/Swiss-Prot
<!-- Since genome assembly is a dynamic field and corresponding software are being constantly updated, we prefer not to integrate external genome assemblers (e.g. Trinity) into the TransAnnot package. One can install them separately on demand. -->
- Plass - should be installed separately, see [corresponding repository](https://github.com/soedinglab/plass). To perform *de novo* assembly, it is required to install Plass to the current working directory. Standard usage is running on the results of a nucleotide assembler such as `Trinity`. PLASS requires read lengths of at least 100 nt, so for shorter reads, a nucleotide assembler has to be used.

## Quick start
- Genome assembly is a dynamic field, so the software is being continously updated. That is why no non-inhouse assemblers (e.g Trinity) are included in the TransAnnot release package. One can install them separately on demand. Some of the tools which might be useful are:

* [transdecoder](https://github.com/TransDecoder/TransDecoder) - identifies coding regions within the transcript
* [mrna-spades]()
* [Trinity]()


- Another dependencies are UniProtKB/Swiss-Prot, eggNOG and Pfam databases provided in the MMseqs2 format.

## Before starting

### tmp folder

`tmp` folder keeps temporary files. By default, all the intermediate output files from different modules will be kept in this folder. To clear `tmp` pass `--remove-tmp-files` parameter.

## Quick-ish start
For the fastest results, please consider assembling the data and translating it into amino acid sequences beforehand.

#### STEP 1: Download default databases
(THIS IS A ONE-TIME PROCESS THAT WILL ONLY HAVE TO BE EXECUTED THE FIRST TIME AFTER `TRANSANNOT` HAS BEEN DOWNLOADED.)

Download the default databases using `transannot downloaddb`:

transannot downloaddb eggNOG <path_to_output>/<eggNOGDB_name> <eggNOGDB_tmpdir_name> [options]
transannot downloaddb Pfam <path_to_output>/<PfamDB_name> <PfamDB_tmpdir_name> [options]
transannot downloaddb SwissProt <path_to_output>/<SwissProtDB_name> <SwissProtDB_tmpdir_name> [options]

(The downloads can take quite long depending on the download server loads, so it is advisable to execute these commands inside a windows manager such as `tmux` or `screen`.)

#### STEP 2: Annotating the data
This can be done in one of the following three ways (we strongly recommend option C).

##### Option A: Starting with sequencing reads
The quickest way to run TransAnnot is by using the `easytransannot` module:

transannot easytransannot <inputReads.fastq> Pfam-A.full eggNOG UniProtKB/Swiss-Prot <resDB> <tmp> [options]

If (one of the) target databases is already downloaded in MMseqs2 format, directly provide the path to them, otherwise simply specify their names, and the databases will be downloaded automatically. `easytransannot` uses Plass assembler, for more details check the descriptions for `assemblereads` module below.

## Input
##### Option B: Starting with assembled nucleotide sequences (NOT RECOMMENDED)
Should a nucleotide assembly already be available (e.g., in `<input.fasta>`), it can be annotated as follows:

transannot createquerydb <input.fasta> <input_queryDB_name> <tmp> [options]
transannot annotate <input_queryDB_name> Pfam-A.full eggNOG UniProtKB/Swiss-Prot <resDB> <tmp> [options]

(`<input_queryDB_name>` is just a string providing either the name and additionally the path to the MMseqs2-formatted database for the input sequences.)

Possible inputs can be one of the following:
We recommend against starting with nucleotide sequences as of the current release because the translated search that `TransAnnot` relies upon is quite slow.

* translated sequences of assembled transcriptomes (obtained e.g. using Trinity followed by TransDecoder)
* raw transcriptome reads in fastq format, which will be *de novo* assembled by `plass` at the protein level
##### Option C: Starting with assembled, _in silico_ translated amino acid sequences (RECOMMENDED)
It is far more preferable to translate the assembly with a tool such as [TransDecoder](https://github.com/TransDecoder/TransDecoder) prior to annotation with `TransAnnot` as the searches are very fast in this case. The workflow in this case is identical to the one described above; simply provide the input `FASTA` file containing the translated amino acid sequences to `tranannot createquerydb` and then supply the created query DB as input to `transannot annotate`.

TransAnnot accepts input files from single-organism transcriptomes as well as metatranscriptomes.
<!-- In such case, it is possible to check for the contamination with the `contamination` module, which is based on MMseqs2 taxonomy workflow -->
## Inputs

## Running
Possible inputs are assembled on the protein level:

* assembled transcriptomes (obtained e.g. using Trinity) or raw transcriptome reads, which will be *de novo* assembled on the protein level using `plass`
* metatranscriptomes
* single-organism transcriptomes
* TransAnnot can work with the long reads input too. Thus, to enable PLASS assembly, input should be provided as a concatenated single file for single-end reads
<!-- in such case it is possible to check for the contamination with `contamination` module, which is based on MMseqs2 taxonomy workflow -->

## Execution

### Modules

* `assemblereads` *de novo* assembles raw sequencing reads to large genomic fragments (contigs).
* `createquerydb` creates a database in a memory-efficient MMSeqs2 format for the query input sequence.
* `downloaddb` downloads reference databases in MMSeqs2 format on which annotations for query sequences will be searched.
* `annotate` performs clustering on input sequences to reduce redundancy and runs sequence-profile and sequence-sequence searches for the reference query sequences to obtain the closest homologs with the annotated function. In addition, it maps descriptions of orthologous groups and protein families to the query sequences.
* `assemblereads` - *de novo* assembles raw sequencing reads to large genomic fragments (contigs).
* `createquerydb` - creates a database in a memory-efficient MMSeqs2 format for the query input sequence.
* `downloaddb` - downloads reference databases in MMSeqs2 format on which annotations for query sequences will be searched.
* `annotate` - performs clustering on input sequences to reduce redundancy and runs sequence-profile and sequence-sequence searches for the reference query sequences to obtain the closest homologs with the annotated function. In addition, it maps descriptions of orthologous groups and protein families to the query sequences.
* `easytransannot` - an easy one-line command module for the complete transannot workflow, starting from input assembly, downloading reference databases to an output of sequence annotations.
* `annotatecustom` - facilitates annotation against user-supplied databases instead of the default databases used by `TransAnnot`.
<!-- After running the search UniProt IDs will be retrieved to get more detailed information about the provided transcriptome. -->
<!-- (It finds homologs for assembled contigs in the custom-defined protein sequence database (default UniProtKB) using reciprocal-best hits (rbh module) search from MMseqs2 suite if taxonomy ID `--taxid` is provided, or MMseqs2 search if no taxonomy ID is supplied. After running the search Gene Ontology ID will be obtained from UniProt.) -->
<!-- * `contamination` It checks contaminated contigs using _easy-taxonomy_ module from MMseqs2 suite. This approach uses taxonomy assignments of every contig to identify contamination -->
* `easytransannot` an easy one-line command module for the complete transannot workflow, starting from input assembly, downloading reference databases to an output of sequence annotations.


### assemblereads

@@ -95,23 +140,43 @@ and execute the below command to download the databases (Ensure the same keyword

transannot downloaddb <selection> <outDB> <tmp> [options]

By default, `transannot` runs 3 searches in the subsequent `annotate` module against the following databases: (i) `Pfam-A.full` (profile database), (ii) `eggNOG` (profile database) and (iii) `UniProtKB/SwissProt` (sequence database). Hence, use the above command separately for each database to download them, for more information check [MMseqs2 user guide](https://github.com/soedinglab/MMseqs2/wiki#downloading-databases).
By default, `transannot` runs 3 searches in the subsequent `annotate` module against the following databases: (i) `Pfam-A.full` (profile database), (ii) `eggNOG` (profile database) and (iii) `UniProtKB/SwissProt` (sequence database). Hence, use the above command separately for each database to download them, for more information check [MMseqs2 user guide](https://github.com/soedinglab/MMseqs2/wiki#downloading-databases).We use the abovementioned databases for the default annotation workflow to ensure comprehensive set of annotations that include hand-reviewed homologs (`SwissProt`), fine-grained orthologs (`eggNOG`), and domains (`Pfam-A`). On demand, one can use [`annotatecustom`](##Use-of-custom-database-for-annotation) to perform annotation against user-defined database.

`downloaddb` allows resuming the download of the DB if it's detected in the provided directory path.

### annotate

This module extracts representative sequences from the query database using clustering (redundancy-free set) and uses them as search input for 3 transcriptome annotation searches (one sequence-sequence and two sequence-profile).
This module extracts representative sequences from the query database using clustering (redundancy-free set) and uses them as search input for 3 transcriptome annotation searches (one sequence-sequence and two sequence-profile). To ensure deep coverage of the transcriptome, TransAnnot only retains non-overlapping hits for each query.

To run the annotate module, execute the following command:

transannot annotate <assembledQueryDB> <path to Pfam profileTargetDB> <path to eggNOG profileTargetDB> <path to SwissProt sequenceTargetDB> <o:resTsvFile> <tmp> [options]

#### Output
### annotatecustom

Annotates against user-specified databases. Run as follows:

transannot annotatecustom <assembledQueryDB> <user-defined DB>

Outut is a tab-separated `.tsv` file containing the following columns:
## Output

queryID targetID description E-value sequenceIdentity bitScore typeOfSearch nameOfDatabase
`TransAnnot`'s output is, in general, a tab-separated `.tsv` file containing the following columns:

#### Important options of the annotate module
queryID targetID qstart qend description E-value sequenceIdentity bitScore typeOfSearch nameOfDatabase

Where,
* `queryID` is the sequence identifier for the annotated transcript.
* `targetID` is the identifier for the sequence/profile from which the annotation is sourced.
* `qstart` start index of the annotated query.
* `qend` end index of the annotated query.
* `description` is the string (e.g., `FASTA` header from the matched `Swiss-Prot` sequence) supplying the human-readable annotation (e.g., "XYZ protein").
* `E-value` is the expectation value for this particular match (combination of query and target sequences) which indicates, loosely, the confidence one can have that the match is real (i.e., due to shared evolutionary history) and not due to chance (lower the `E-value` the better).
* `sequenceIdentity` is the number of positions in both sequences that are identical to one another, represented as a fraction of the total (sum) of the sequence length(s). This is a percentage value.
* 'bitScore` is the normalized size (in bits) of the hypothetical database one would have to search against to obtain the match at hand purely by chance (higher the `bitScore` the better).
* `typeOfSearch` is a filed that indicates whether the database searched is a sequence or a profile database.
* `nameOfDatabase` is a filed indicating the name of the database searched (e.g., "Pfam).

## Important options of the annotate module

`--simple-output` parameter allows users to obtain simplified output for each query sequence, which only includes query and target IDs, a header of the target database and E-value. Whereas standard output contains sequence identity and bit score in addition to details provided in the `--simple-output`. Usage:

@@ -123,13 +188,21 @@ When no tag is used, standard output will be provided.

`--no-run-clust` performs annotation without clustering. All the input query sequences will undergo annotation searches.

#### Use of custom database for annotation
## Use of custom database for annotation
If one is interested in annotation against a user-defined database, `annotatecustom` module provides such an opportunity. To run the custom annotate module execute the following command:

transannot annotatecustom <assembledQueryDB> <user-defined DB>

The user-provided database will be converted to the MMseqs2 format within the module, but it is also possible to initially provide a MMseqs2-formatted database. A limitation is that unless ID descriptors are included in the database, no mapping can be performed and no group descriptors will be retrieved.

#### tmp folder
## tmp folder

`tmp` folder keeps temporary files. By default, all the intermediate output files from different modules will be kept in this folder. To clear `tmp` pass `--remove-tmp-files` parameter [bool], applicable for all modules except `createquerydb` and `downloaddb`.

## TransAnnot wiki

The `TransAnnot` wiki can be found here: [https://github.com/soedinglab/transannot/wiki](https://github.com/soedinglab/transannot/wiki).

## Citing TransAnnot

Please cite our publication ([https://doi.org/10.1093/bioadv/vbae152](https://doi.org/10.1093/bioadv/vbae152)) in Bioinformatics Advances.
12 changes: 7 additions & 5 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
@@ -130,10 +130,10 @@ jobs:
targetPath: $(Build.SourcesDirectory)/build/src/transannot
artifactName: transannot-linux-$(SIMD)

- job: build_macos_11
displayName: macOS 11
- job: build_macos_12
displayName: macOS 12
pool:
vmImage: 'macos-11'
vmImage: 'macos-12'
steps:
- checkout: self
submodules: true
@@ -158,15 +158,17 @@ jobs:
pool:
vmImage: 'ubuntu-latest'
dependsOn:
- build_macos_11
- build_macos_12
- build_ubuntu_2004
- build_ubuntu_cross_2004
steps:
- script: |
cd "${BUILD_SOURCESDIRECTORY}"
mkdir transannot
cp -f README.md LICENCE.md transannot
cp -f README.md LICENCE.md data/Pfam-A.clans.tsv transannot
mkdir transannot/bin
mv -f data/Pfam-A.clans.tsv transannot/bin
rm -f transannot/Pfam-A.clans.tsv
- task: DownloadPipelineArtifact@1
inputs:
artifactName: transannot-darwin-universal
Loading