Follow the steps below to install and configure HYMET
The easiest way to install HYMET is through Bioconda:
conda install -c bioconda hymet
After installation, you will need to download the reference databases as described in the Reference Sketched Databases section.
Alternatively, you can clone the repository to your local environment:
git clone https://github.com/ieeta-pt/HYMET.git
cd HYMET
If you prefer using Docker, follow these steps:
-
Build the Docker Image:
docker build -t hymet .
-
Run the Container:
docker run -it hymet
-
Inside the Container:
- The environment will already be set up with all dependencies installed.
- Run the tool as needed.
If you cloned the repository, you can create a Conda environment from the included file:
-
Create the Conda Environment:
conda env create -f environment.yml
-
Activate the Environment:
conda activate hymet_env
The tool expects input files in FASTA format (.fna
or .fasta
). Each file should contain metagenomic sequences with headers in the following format:
>sequence_id additional_info
SEQUENCE_DATA
- sequence_id: A unique identifier for the sequence.
- additional_info: Optional metadata (e.g., source organism, length).
- SEQUENCE_DATA: The nucleotide sequence.
Place your input files in the directory specified by the $input_dir
variable in the main.pl
script.
For example, if your input directory contains the following files:
input_dir/
├── sample1.fna
├── sample2.fna
└── sample3.fna
Each file (sample1.fna
, sample2.fna
, etc.) should follow the FASTA format described above.
Ensure all scripts have execution permissions:
chmod +x config.pl
chmod +x main.pl
chmod +x scripts/*.sh
chmod +x scripts/*.py
After installation and configuration, first you should run the configuration script to download and prepare the taxonomy files, and define the main paths.
./config.pl
Then, you can run the main tool to perform taxonomic identification:
./main.pl
If installed via Conda, you can use:
hymet-config
hymet
The databases required to run the tool are available for download on Google Drive:
-
Download the Files:
- Click on the links above to download the
.gz
files.
- Click on the links above to download the
-
Place the Files in the
data/
Directory:- Move the downloaded files to the
data/
directory of the project.
- Move the downloaded files to the
-
Unzip the Files:
- Use the following command to unzip the
.gz
files:gunzip data/sketch1.msh.gz gunzip data/sketch2.msh.gz gunzip data/sketch3.msh.gz
- This will extract the files
sketch1.msh
,sketch2.msh
, andsketch3.msh
in thedata/
directory.
- Use the following command to unzip the
-
Verify the Files:
- After unzipping, ensure the files are in the correct format and location:
ls data/
- You should see the files
sketch1.msh
,sketch2.msh
, andsketch3.msh
.
- After unzipping, ensure the files are in the correct format and location:
- config.pl: Configuration script that downloads and prepares taxonomy files.
- main.pl: Main script that runs the taxonomic identification pipeline.
- scripts/: Directory containing helper scripts in Perl, Python, and Bash.
- mash.sh: Script to run Mash.
- downloadDB.py: Script to download genomes.
- minimap.sh: Script to run Minimap2.
- classification.py: Script for taxonomic classification.
- taxonomy_files/: Directory containing downloaded taxonomy files.
- data/: Directory for storing intermediate data.
- sketch1.msh
- sketch2.msh
- sketch3.msh
- taxonomy_hierarchy.tsv
- output/: Directory where final results are saved.
The tool generates a classified_sequences.tsv
file in the output/
directory with the following columns:
- Query: Identifier of the queried sequence.
- Lineage: Identified taxonomic lineage.
- Taxonomic Level: Taxonomic level (e.g., species, genus).
- Confidence: Classification confidence (0 to 1).
- This folder includes scripts to install and prepare all necessary data to replicate the work using our dataset.
- Prerequisites:
- Before running the scripts in this folder, users need to download the assembly files (
assembly_files.txt
) for each domain from the NCBI FTP site.
- Before running the scripts in this folder, users need to download the assembly files (
- Scripts:
create_database.py
: Downloads 10% of the content from each downloaded assembly file and organizes the datasets by domain.extractNC.py
: Maps the content of each Genome Collection File (GCF) with its respective sequence identifiers. It generates a CSV file containing this mapping, with one column for the GCF and another column for the sequence identifiers (such as NC, NZ, etc.) present in each GCF.extractTaxonomy.py
: Creates a CSV file containing the GCF and its respective taxonomy, among other information.- Additional scripts modify the data format and organization, including:
- Implementing mutations
- Converting formats (e.g., FASTA to FASTQ)
- Formatting into paired-end reads
GCFtocombinedfasta.py
: Combines all GCFs from each domain into a single FASTA file, separating sequences by identifier. This script is used as input for most of the tools.
- Prerequisites:
For questions or issues, please open an issue in the repository.