🧬 Metagenome Vector Sketches

This repository provides code for sketching genomic data using random projection to efficiently process and compare large metagenomic datasets.

🛠️ Installation Guide

Follow these steps to set up the necessary environment and build the executables.

Clone the Repository

Clone the repository and its submodules recursively:

git clone --recursive https://github.com/RolandFaure/metagenome_vector_sketches.git
cd metagenome_vector_sketches
git submodule update --init --recursive

Set Up the Conda Environment

Create a new Conda environment named faiss_env and install the required dependencies, including FAISS for fast similarity search.

conda create -n faiss_env python=3.12
conda activate faiss_env
conda install -c pytorch faiss-cpu
conda install -c conda-forge pybind11 scipy matplotlib pandas hdf5 h5py

Build the Executables

Navigate back to the main directory, create a build folder, and compile the C++ code using cmake. This step generates all necessary executables inside the build folder.

cd metagenome_vector_sketches
mkdir build
cd build
cmake -DHDF5_ROOT=$CONDA_PREFIX \
      -DPython_EXECUTABLE=$(which python) \
      -DPython_ROOT_DIR=$CONDA_PREFIX \
      -DPython_FIND_STRATEGY=LOCATION \
      ..
cmake --build . -j 8

🚀 Usage Examples

The following examples use data in the test folder. All compiled executables are located inside the build folder. Running any executable without arguments will display its usage instructions.

Create Projected Vectors

Use project_everything to create projected vectors from fracminhash data. The output vectors will be stored in the specified index folder (toy_index/).

cd test/
../build/project_everything toy toy_db/ -t 8 -d 2048 -s 0

Create FAISS Index

After generating vectors, you can create a FAISS index for efficient search using the Python script jaccard.py.

python3 ../src/jaccard.py index toy_db -t 8

Compute Pairwise Comparison Matrix

The pairwise_comp_optimized executable computes the similarity matrix between all vectors.

To compute the matrix:

../build/pairwise_comp_optimized --db toy_db/ --dimension 2048 --output_folder toy_index/ --max_memory_gb 12 --num_threads 8

Strategy Note: The default strategy is 0=random projections. You can use --strategy 1 for MinHashes.

Query the Pairwise Matrix

The query_pc_mat executable allows you to query the computed similarity matrix.

Query Pairwise Comparison Matrix

Usage:
        ../build/query_pc_mat [--matrix <folder>] [--db <folder>] [--query_file <file>] [--top
                              <int>] [--thread <int>] [--batch_size <int>] [--write_to_file <file>]
                              [--show_all] [--print] [--help]

        ../build/query_pc_mat [--matrix <folder>] [--db <folder>] [--query_ids <ids>...] [--top
                              <int>] [--thread <int>] [--batch_size <int>] [--write_to_file <file>]
                              [--show_all] [--print] [--help]

        ../build/query_pc_mat [--matrix <folder>] [--db <folder>] [--row_file <row> [--col_file]
                              <col>] [--top <int>] [--thread <int>] [--batch_size <int>]
                              [--write_to_file <file>] [--show_all] [--print] [--help]

Options:
  --matrix       Folder containing the pairwise matrix files
  --db   Folder containing the matrix meta data
  --query_file   File containing query IDs (one per line)
  --query_ids    Query IDs as command line arguments (numeric indices or identifiers)
  --row_file     File containing query row IDs (one per line)
  --col_file     File containing query col IDs (one per line)
  --top  Number of top jaccard values to show [default 10]
  --batch_size   Number of queries to process per batch [default 1000]
  --thread       Number of threads to use [default 1]
  --write_to_file        Where to save the output (expected format: *.csv/*.tsv/*.npy/*npz/*h5 for row-col query. *.csv/*tsv/*txt for regular query).
  --show_all     Whether to show all neighbors instead of top N
  --print        Whether to print the outputs to screen
  --help         Show this help message

Note: Batches are executed in parallel, up to the configured number of threads. Within each batch, queries are processed sequentially. The write phase for sliced queries is also performed sequentially.

To query from all accessions inside the server, use --matrix /scratch/mgs_project/matrix_unzipped/ --db /scratch/mgs_project/db/

Regular Query (Nearest Neighbors)

Query the matrix for neighbors of specific IDs listed in a file (query_strs.txt):

../build/query_pc_mat --matrix toy_index --db toy_db/ --query_file query_strs.txt --write_to_file toy_neighbors.txt --batch_size 5 --thread 2 --show_all

This command outputs one file per query ID (e.g., DRR000821_toy_neighbors.txt) containing all neighbors, as --show_all is specified.

Sliced Matrix Query (Sub-matrix)

Query a slice of the matrix (a sub-matrix) defined by IDs in a row file and a column file:

../build/query_pc_mat --matrix toy_index --db toy_db/  --row_file row_file.txt --col_file col_file.txt --write_to_file row_col.h5 --batch_size 5 --thread 2

Important Output Format Note:

    Sliced (Row-Col) Query: Output file must be *.csv, *.tsv, *.npy, *npz or *h5. *h5 gives the most compressed output.

    Regular Query: Output file must be *.csv, *.tsv, or *.txt.

Python Interface for Matrix Search

The read_pc_mat.py script provides a Python interface for searching the pairwise comparison matrix.

Usage: read_pc_mat.py [-h] --matrix MATRIX --db DB [--query_file QUERY_FILE] [--row_file ROW_FILE] [--col_file COL_FILE]

Pairwise Comparison Matrix Search

options:
  -h, --help            show this help message and exit
  --matrix MATRIX       Folder containing matrix data
  --db DB               Folder containing auxilary information of the matrix
  --query_file QUERY_FILE
                        File with query IDs (one ID per line)
  --row_file ROW_FILE   File containing row IDs (one ID per line)
  --col_file COL_FILE   File containing column IDs (one ID per line)

Regular Query (Python)

python3 ../src/read_pc_mat.py --matrix toy_index --db toy_db/ --query_file query_strs.txt

Sliced Matrix Query (Python)

python3 ../src/read_pc_mat.py --matrix toy_index --db toy_db/ --row_file row_file.txt --col_file col_file.txt

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
bits @ 9be7afa		bits @ 9be7afa
external		external
include		include
src		src
test		test
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧬 Metagenome Vector Sketches

🛠️ Installation Guide

Clone the Repository

Set Up the Conda Environment

Build the Executables

🚀 Usage Examples

Create Projected Vectors

Create FAISS Index

Compute Pairwise Comparison Matrix

Query the Pairwise Matrix

Regular Query (Nearest Neighbors)

Sliced Matrix Query (Sub-matrix)

Python Interface for Matrix Search

Regular Query (Python)

Sliced Matrix Query (Python)

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

RolandFaure/metagenome_vector_sketches

Folders and files

Latest commit

History

Repository files navigation

🧬 Metagenome Vector Sketches

🛠️ Installation Guide

Clone the Repository

Set Up the Conda Environment

Build the Executables

🚀 Usage Examples

Create Projected Vectors

Create FAISS Index

Compute Pairwise Comparison Matrix

Query the Pairwise Matrix

Regular Query (Nearest Neighbors)

Sliced Matrix Query (Sub-matrix)

Python Interface for Matrix Search

Regular Query (Python)

Sliced Matrix Query (Python)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages