Skip to content

Building the suffix array

Pieter Verschaffelt edited this page Apr 7, 2025 · 27 revisions

Preparations

Note

The folders and commands in this guide are configured accordingly to the file system on the Unipept API machines. Please adjust these if you plan to execute these commands on another machine. Since some of these commands can take a very long time to execute, it's recommended to start a screen session before attempting to follow this guide.

Parse UniProtKB and produce all required table files

Set the correct UniProt version

export uniprot_version=2025-01

Create all output and temporary directories required by the commands in this tutorial

sudo mkdir -p "/mnt/data/uniprot-${uniprot_version}"/{suffix-array,tables}

Set the right permissions

sudo chmod -R 777 "uniprot-${uniprot_version}"

Save the version number

echo "${uniprot_version}" | tr '-' '.' > "uniprot-${uniprot_version}/suffix-array/.version"

Before we can start constructing a new version of the suffix array, we need to prepare some of it's input files. For this, we need to use the unipept-database repository. Clone this repository, and follow the instructions written here to prepare all the files necessary for the suffix array.

Check that the following files are available after running the build_database.sh script, these are required by the suffix array to be constructed

  • /mnt/data/uniprot-${uniprot_version}/tables/uniprot_entries.tsv.lz4
  • /mnt/data/uniprot-${uniprot_version}/tables/taxons.tsv.lz4
  • /mnt/data/uniprot-${uniprot_version}/tables/lineages.tsv.lz4
  • /mnt/data/uniprot-${uniprot_version}/tables/interpro_entries.tsv.lz4
  • /mnt/data/uniprot-${uniprot_version}/tables/go_terms.tsv.lz4
  • /mnt/data/uniprot-${uniprot_version}/tables/ec_numbers.tsv.lz4

Convert the table files to the required format of the suffix array

Execute this command to extract and convert the uniprot_entries table to the correct format:

lz4cat /mnt/data/uniprot-${uniprot_version}/tables/uniprot_entries.tsv.lz4 | cut -f2,4,7,8 > /mnt/data/uniprot-${uniprot_version}/suffix-array/proteins.tsv

Building the suffix array

Important

Make sure to always pull the latests code changes from the main branch of this repository before running any of the commands below.

In order to start the construction of the suffix array, you need to first compile the most recent version of the code using cargo build --release.

See this page for all available configuration options and usage of the sa-builder command.

Build using the CLI

Build the default suffix array used by the Unipept API

./target/release/sa-builder --database-file "/mnt/data/uniprot-${uniprot_version}/suffix-array/proteins.tsv" --output "/mnt/data/uniprot-${uniprot_version}/suffix-array/sa.bin" -a "lib-sais" -s 2 -c

This command assumes that you're still using the same directory structure that was configured at the start of this document.

Right now, the default configuration values for the suffix array that is running on the Unipept API machines are the following:

  • sparseness: 2
  • compressed: true

Note that this step can take several hours or days to complete.

Build using the HPC

Move the input files to the HPC VO

Set the HPC Virtual Organisation

export HPC_VO_LOCATION="/kyukon/data/gent/vo/000/gvo00038"

Move the files

scp "/mnt/data/uniprot-$UNIPROT_VERSION/suffix-array/proteins.tsv" "hpc-tibo:$HPC_VO_LOCATION/suffix-array"
scp "/mnt/data/uniprot-$UNIPROT_VERSION/suffix-array/taxons.tsv" "hpc-tibo:$HPC_VO_LOCATION/suffix-array"

Run the PBS job

Warning

Execute the following commands on the HPC login node!

Clone the unipept-index repository

git clone https://github.com/unipept/unipept-index

Go to the root of the repository

cd unipept-index

Swap to the high-memory gallade cluster

module swap cluster/gallade

Submit the PBS script to start the process

VSC_DATA_VO=/kyukon/data/gent/vo/000/gvo00038 qsub sa-builder/build.pbs

VSC_DATA_VO has to contain the path to the virtual organisation.

Troubleshooting

Error: attribute name space is experimental

error[E0658]: `#[diagnostic]` attribute name space is experimental
   --> /user/gent/437/vsc43736/.cargo/registry/src/index.crates.io-6f17d22bba15001f/axum-0.7.5/src/handler/mod.rs:130:5
    |
130 |     diagnostic::on_unimplemented(
    |     ^^^^^^^^^^
    |
    = note: see issue #111996 <https://github.com/rust-lang/rust/issues/111996> for more information
    = help: add `#![feature(diagnostic_namespace)]` to the crate attributes to enable

For more information about this error, try `rustc --explain E0658`.
error: could not compile `axum` (lib) due to previous error

Solution: Downgrade the version of the package to a working version