This repository provides code for sketching genomic data using random projection to efficiently process and compare large metagenomic datasets.
Follow these steps to set up the necessary environment and build the executables.
Clone the repository and its submodules recursively:
git clone --recursive https://github.com/RolandFaure/metagenome_vector_sketches.git
cd metagenome_vector_sketches
git submodule update --init --recursiveCreate a new Conda environment named faiss_env and install the required dependencies, including FAISS for fast similarity search.
conda create -n faiss_env python=3.12
conda activate faiss_env
conda install -c pytorch faiss-cpu
conda install -c conda-forge pybind11 scipy matplotlib pandas hdf5 h5pyNavigate back to the main directory, create a build folder, and compile the C++ code using cmake. This step generates all necessary executables inside the build folder.
cd metagenome_vector_sketches
mkdir build
cd build
cmake -DHDF5_ROOT=$CONDA_PREFIX \
-DPython_EXECUTABLE=$(which python) \
-DPython_ROOT_DIR=$CONDA_PREFIX \
-DPython_FIND_STRATEGY=LOCATION \
..
cmake --build . -j 8The following examples use data in the test folder. All compiled executables are located inside the build folder. Running any executable without arguments will display its usage instructions.
Use project_everything to create projected vectors from fracminhash data. The output vectors will be stored in the specified index folder (toy_index/).
cd test/
../build/project_everything toy toy_db/ -t 8 -d 2048 -s 0After generating vectors, you can create a FAISS index for efficient search using the Python script jaccard.py.
python3 ../src/jaccard.py index toy_db -t 8The pairwise_comp_optimized executable computes the similarity matrix between all vectors.
To compute the matrix:
../build/pairwise_comp_optimized --db toy_db/ --dimension 2048 --output_folder toy_index/ --max_memory_gb 12 --num_threads 8Strategy Note: The default strategy is 0=random projections. You can use --strategy 1 for MinHashes.
The query_pc_mat executable allows you to query the computed similarity matrix.
Query Pairwise Comparison Matrix
Usage:
../build/query_pc_mat [--matrix <folder>] [--db <folder>] [--query_file <file>] [--top
<int>] [--thread <int>] [--batch_size <int>] [--write_to_file <file>]
[--show_all] [--print] [--help]
../build/query_pc_mat [--matrix <folder>] [--db <folder>] [--query_ids <ids>...] [--top
<int>] [--thread <int>] [--batch_size <int>] [--write_to_file <file>]
[--show_all] [--print] [--help]
../build/query_pc_mat [--matrix <folder>] [--db <folder>] [--row_file <row> [--col_file]
<col>] [--top <int>] [--thread <int>] [--batch_size <int>]
[--write_to_file <file>] [--show_all] [--print] [--help]
Options:
--matrix Folder containing the pairwise matrix files
--db Folder containing the matrix meta data
--query_file File containing query IDs (one per line)
--query_ids Query IDs as command line arguments (numeric indices or identifiers)
--row_file File containing query row IDs (one per line)
--col_file File containing query col IDs (one per line)
--top Number of top jaccard values to show [default 10]
--batch_size Number of queries to process per batch [default 1000]
--thread Number of threads to use [default 1]
--write_to_file Where to save the output (expected format: *.csv/*.tsv/*.npy/*npz/*h5 for row-col query. *.csv/*tsv/*txt for regular query).
--show_all Whether to show all neighbors instead of top N
--print Whether to print the outputs to screen
--help Show this help message
Note: Batches are executed in parallel, up to the configured number of threads. Within each batch, queries are processed sequentially. The write phase for sliced queries is also performed sequentially.
To query from all accessions inside the server, use --matrix /scratch/mgs_project/matrix_unzipped/ --db /scratch/mgs_project/db/
Query the matrix for neighbors of specific IDs listed in a file (query_strs.txt):
../build/query_pc_mat --matrix toy_index --db toy_db/ --query_file query_strs.txt --write_to_file toy_neighbors.txt --batch_size 5 --thread 2 --show_allThis command outputs one file per query ID (e.g., DRR000821_toy_neighbors.txt) containing all neighbors, as --show_all is specified.
Query a slice of the matrix (a sub-matrix) defined by IDs in a row file and a column file:
../build/query_pc_mat --matrix toy_index --db toy_db/ --row_file row_file.txt --col_file col_file.txt --write_to_file row_col.h5 --batch_size 5 --thread 2Important Output Format Note:
Sliced (Row-Col) Query: Output file must be *.csv, *.tsv, *.npy, *npz or *h5. *h5 gives the most compressed output.
Regular Query: Output file must be *.csv, *.tsv, or *.txt.
The read_pc_mat.py script provides a Python interface for searching the pairwise comparison matrix.
Usage: read_pc_mat.py [-h] --matrix MATRIX --db DB [--query_file QUERY_FILE] [--row_file ROW_FILE] [--col_file COL_FILE]
Pairwise Comparison Matrix Search
options:
-h, --help show this help message and exit
--matrix MATRIX Folder containing matrix data
--db DB Folder containing auxilary information of the matrix
--query_file QUERY_FILE
File with query IDs (one ID per line)
--row_file ROW_FILE File containing row IDs (one ID per line)
--col_file COL_FILE File containing column IDs (one ID per line)python3 ../src/read_pc_mat.py --matrix toy_index --db toy_db/ --query_file query_strs.txtpython3 ../src/read_pc_mat.py --matrix toy_index --db toy_db/ --row_file row_file.txt --col_file col_file.txt