Skip to content

Building tables for the suffix array

Pieter Verschaffelt edited this page Apr 7, 2025 · 5 revisions

In order to generate all input tables (tsv.lz4 files) required to build the Unipept suffix array, you can use the ./scripts/generate_sa_tables.sh script.

Usage

Usage: ./scripts/generate_sa_tables.sh [OPTIONS]

Required config values

  • --output-dir: Directory to save the output files.

Optional config values

  • --database-sources: Comma-separated list of database sources ('swissprot', 'trembl'), (optional, default: 'swissprot,trembl').
  • --temp-dir: Temporary directory for intermediate files (optional, default: '/tmp').
  • --help: Prints this help message.

Examples

./scripts/generate_sa_tables.sh --database-sources swissprot,trembl --output-dir /path/to/output
./scripts/generate_sa_tables.sh --database-sources swissprot --output-dir /path/to/output --temp-dir /custom/tmp

Output

After successful execution, this script generates the following files that can be used to construct a suffix array:

  • uniprot_entries.tsv.lz4
  • taxons.tsv.lz4
  • lineages.tsv.lz4
  • interpro_entries.tsv.lz4
  • go_terms.tsv.lz4
  • ec_numbers.tsv.lz4
  • .version

Running the script

Important

  • Make sure to always pull the latests code changes from the main branch of this repository before running any of the commands below.
  • Since a lot of the commands in this guide can take several hours to complete, it's recommended to run these in a screen session.

Unipept's suffix array, which can be found in the unipept-index repository is what currently powers the metaproteomics analysis performed by Unipept (and the Unipept API). Since the suffix array also works for non-tryptic peptides (in contrast to the traditional API that was written in Ruby, and that is used by the Unipept Desktop app), we no longer need to perform an in-silico tryptic digest of all proteins in UniProtKB. This significantly shortens the runtime for parsing and converting new releases of the UniProtKB database.

First, for convenience, we export the current UniProtKB database version that we're processing as a variable. This makes it easier to create the required directories later on.

export uniprot_version=2025-01

Building the files for the suffix array is as simple as running this command from the root of the unipept-database repository. The paths in this example command are configured for use on the Unipept API-servers, but can easily be interchanged:

./scripts/generate_sa_tables.sh --database-sources swissprot,trembl --output-dir "/mnt/data/${uniprot_version}/tables" --temp-dir "/mnt/data/unipept-temp"

Note

This command requires the following directories to already be created and must be writeable by the current user:

  • /mnt/data/unipept-temp
  • /mnt/data/${uniprot_version}/tables

Please note that at least 1TiB of free space should be available on the filesystem to which this data will be stored.

After running this command, the following files should've been generated:

  • /mnt/data/uniprot-${uniprot_version}/tables/uniprot_entries.tsv.lz4
  • /mnt/data/uniprot-${uniprot_version}/tables/taxons.tsv.lz4
  • /mnt/data/uniprot-${uniprot_version}/tables/lineages.tsv.lz4
  • /mnt/data/uniprot-${uniprot_version}/tables/interpro_entries.tsv.lz4
  • /mnt/data/uniprot-${uniprot_version}/tables/go_terms.tsv.lz4
  • /mnt/data/uniprot-${uniprot_version}/tables/ec_numbers.tsv.lz4
  • /mnt/data/uniprot-${uniprot_version}/tables/reference_proteomes.tsv.lz4

In order to continue building the Unipept Index (suffix array) itself, head over to the instructions on the Unipept Index wiki.

Clone this wiki locally