Bakta version = 1.9.4 Bakta database version = 5.1 PyHmmer version = 0.10.15 PFamA version = 36
conda create -n bakta -c conda-forge python=3.11 cloudpathlib-s3 pandas notebook fsspec s3fs
- comma delimited
- two columns with headers:
genome_id
,genome_path
- Example: test_20221220_0.seedfile.csv
The helper script, create_seedfile.py, will create the properly formatted seedfile for you if you can point it to an S3 path.
cd nf-bakta
python bin/create_seedfile.py \
-g s3://maf-users/Nathan_Johns/DBs/Segata_Genomes/Fastas/ \
-project UHGG_Annotation \
-prefix 20221221 \
--extension .fasta
This helper script will also recommend a job submission command that you can use to launch your job using the seedfile that was just created.
nextflow run main.nf \
--seedfile test/test_20221220_0.seedfile.csv \
--project 00_Test \
--prefix 20241010-pfam
aws batch submit-job \
--job-name nf-bakta-pfam-test-1 \
--job-queue priority-maf-pipelines \
--job-definition nextflow-production \
--container-overrides command=FischbachLab/nf-bakta,\
"--seedfile","s3://genomics-workflow-core/Results/Bakta/00_Test/seedfiles/test_20221220_0.seedfile.csv",\
"--project","00_Test",\
"--prefix","20241010-pfam"
v4.0 = ???
v5.0, type=full, 2023-02-20, DOI: 10.5281/zenodo.7669534
v5.1, type=full, 2024-01-19, DOI: 10.5281/zenodo.10522951
The database for this pipeline is stored on our EFS at /mnt/efs/databases/Bakta/db/v5.0
. This path is provided as the bakta_db
parameter. Note that this path should not be staged within the pipleine, but just passed as a value. This is done because all containers have access to that path, i.e. it's already available/staged/mounted for the container to use.
This was needed when Bakta moved from db schema v4.0 to v5.0.
cd /mnt/efs/databases/Bakta/db
mkdir v5
docker container run \
--rm \
-u $(id -u):$(id -g) \
-v /mnt/efs/databases/Bakta/db/v5:/db \
458432034220.dkr.ecr.us-west-2.amazonaws.com/bakta:1.9.3 \
bakta_db download --output /db --type full
mkdir -p /mnt/efs/databases/Bakta/db/tmp
cd /mnt/efs/databases/Bakta/db
docker container run \
-it \
--rm \
-v /mnt/efs/databases/Bakta/db/v5:/db \
-v /mnt/efs/databases/Bakta/db/db_tmp:/bakta_tmp \
458432034220.dkr.ecr.us-west-2.amazonaws.com/bakta:1.9.3 \
bakta_db update --db /db --tmp-dir /tmp
References:
-
hmmscan
vshmmsearch
: Use hmmsearch to annotate the proteins from the bakta pipeline. -
pyhmmer
: