-
Notifications
You must be signed in to change notification settings - Fork 1
Building the suffix array
Note
The folders and commands in this guide are configured accordingly to the file system on the Unipept API machines.
Please adjust these if you plan to execute these commands on another machine.
Since some of these commands can take a very long time to execute, it's recommended to start a screen session before attempting to follow this guide.
Set the correct UniProt version
export uniprot_version=2025-01Create all output and temporary directories required by the commands in this tutorial
sudo mkdir -p "/mnt/data/uniprot-${uniprot_version}"/{suffix-array,tables}Set the right permissions
sudo chmod -R 777 "uniprot-${uniprot_version}"Save the version number
echo "${uniprot_version}" | tr '-' '.' > "uniprot-${uniprot_version}/suffix-array/.version"Before we can start constructing a new version of the suffix array, we need to prepare some of it's input files. For this, we need to use the unipept-database repository. Clone this repository, and follow the instructions written here to prepare all the files necessary for the suffix array.
Check that the following files are available after running the build_database.sh script, these are required by the suffix array to be constructed
/mnt/data/uniprot-${uniprot_version}/tables/uniprot_entries.tsv.lz4/mnt/data/uniprot-${uniprot_version}/tables/taxons.tsv.lz4/mnt/data/uniprot-${uniprot_version}/tables/lineages.tsv.lz4/mnt/data/uniprot-${uniprot_version}/tables/interpro_entries.tsv.lz4/mnt/data/uniprot-${uniprot_version}/tables/go_terms.tsv.lz4/mnt/data/uniprot-${uniprot_version}/tables/ec_numbers.tsv.lz4
Execute this command to extract and convert the uniprot_entries table to the correct format:
lz4cat /mnt/data/uniprot-${uniprot_version}/tables/uniprot_entries.tsv.lz4 | cut -f2,4,7,8 > /mnt/data/uniprot-${uniprot_version}/suffix-array/proteins.tsvImportant
Make sure to always pull the latests code changes from the main branch of this repository before running any of the commands below.
In order to start the construction of the suffix array, you need to first compile the most recent version of the code using cargo build --release.
See this page for all available configuration options and usage of the sa-builder command.
Build the default suffix array used by the Unipept API
./target/release/sa-builder --database-file "/mnt/data/uniprot-${uniprot_version}/suffix-array/proteins.tsv" --output "/mnt/data/uniprot-${uniprot_version}/suffix-array/sa.bin" -a "lib-sais" -s 2 -cThis command assumes that you're still using the same directory structure that was configured at the start of this document.
Right now, the default configuration values for the suffix array that is running on the Unipept API machines are the following:
-
sparseness:
2 -
compressed:
true
Note that this step can take several hours or days to complete.
Set the HPC Virtual Organisation
export HPC_VO_LOCATION="/kyukon/data/gent/vo/000/gvo00038"Move the files
scp "/mnt/data/uniprot-$UNIPROT_VERSION/suffix-array/proteins.tsv" "hpc-tibo:$HPC_VO_LOCATION/suffix-array"
scp "/mnt/data/uniprot-$UNIPROT_VERSION/suffix-array/taxons.tsv" "hpc-tibo:$HPC_VO_LOCATION/suffix-array"Warning
Execute the following commands on the HPC login node!
Clone the unipept-index repository
git clone https://github.com/unipept/unipept-indexGo to the root of the repository
cd unipept-indexSwap to the high-memory gallade cluster
module swap cluster/galladeSubmit the PBS script to start the process
VSC_DATA_VO=/kyukon/data/gent/vo/000/gvo00038 qsub sa-builder/build.pbsVSC_DATA_VO has to contain the path to the virtual organisation.
error[E0658]: `#[diagnostic]` attribute name space is experimental
--> /user/gent/437/vsc43736/.cargo/registry/src/index.crates.io-6f17d22bba15001f/axum-0.7.5/src/handler/mod.rs:130:5
|
130 | diagnostic::on_unimplemented(
| ^^^^^^^^^^
|
= note: see issue #111996 <https://github.com/rust-lang/rust/issues/111996> for more information
= help: add `#![feature(diagnostic_namespace)]` to the crate attributes to enable
For more information about this error, try `rustc --explain E0658`.
error: could not compile `axum` (lib) due to previous error
Solution: Downgrade the version of the package to a working version