Skip to content
Pieter Verschaffelt edited this page Mar 31, 2025 · 5 revisions

Welcome to the Unipept Index wiki page. This wiki contains guides on how to construct the suffix array used by the Unipept API, and some background information that can aid you in understanding how the Unipept Index is implemented.

Guides

Background information

Commands

sa-builder

The command that's used to start construction of the suffix array is sa-builder and can be started from the root of the unipept-index repository by running ./target/release/sa-builder.

Usage

sa-builder [OPTIONS] --database-file <DATABASE_FILE> --output <OUTPUT>

Required config values

  • --database-file / -d: An uncompressed tsv-file that contains all proteins from the UniProtKB database that should be included in the suffix array. This tsv-file needs to have 4 columns that respectively contain: 1) UniProt Accession ID, 2) NCBI Taxon ID, 3) Protein sequence, 4) List of functional annotations (separated by semicolons).
  • --output / -o: A binary file that contains the generated suffix array. This file can grow quite large (a few hundred gigabytes), so make sure to store it somewhere with enough space.

Optional config values

  • --sparseness-factor / -s: Sparseness factor, default value is 1 (which means every value in the SA is used). Internally, a library call will be performed with a maximum sparseness of 5 (because of memory usage). If a higher sparsity is desired, the largest divisor smaller than or equal to 5 is used for the library call. Then, the SA is filtered to achieve the desired sparsity.
  • --construction-algorithm / -a: Construction algorithm. The algorithm used to construct the suffix array (default value lib-sais). Supported values are:
    • lib-sais: If you select lib-sais, the suffix array will be directly constructed with the desired sparseness in mind. If you decide to build a sparse suffix array, this option uses drastically less memory and time than lib-div-suf-sort. This is because the libsais-packed algorithm will be selected.
    • lib-div-suf-sort: If, for any reason, lib-sais is not working for you, you can also pick lib-div-suf-sort. This is a different algorithm that performs suffix array construction. It does not support direct sparse suffix array construction and will be slower in most cases.
  • --compress-sa, -c: Apply bitpacking on the values in the suffix array? Default value is true. Lowers required memory of the suffix array when in use.
  • --help / -h: Print help information.

Clone this wiki locally