Skip to content

Commit 301bcff

Browse files
authored
Merge pull request #121 from tkchafin/db_params
Db params
2 parents 8d2b83f + 7cfbcc0 commit 301bcff

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+527
-64
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -35,19 +35,9 @@ jobs:
3535
with:
3636
version: "${{ matrix.NXF_VER }}"
3737

38-
- name: Download the NCBI taxdump database
39-
run: |
40-
mkdir ncbi_taxdump
41-
curl -L https://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz | tar -C ncbi_taxdump -xzf -
42-
43-
- name: Download the BUSCO lineage database
44-
run: |
45-
mkdir busco_database
46-
curl -L https://tolit.cog.sanger.ac.uk/test-data/resources/busco/blobtoolkit.GCA_922984935.2.2023-08-03.lineages.tar.gz | tar -C busco_database -xzf -
47-
4838
- name: Run pipeline with test data
4939
# You can customise CI pipeline run tests as required
5040
# For example: adding multiple test runs with different parameters
5141
# Remember that you can parallelise this by using strategy.matrix
5242
run: |
53-
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --taxdump $PWD/ncbi_taxdump --busco $PWD/busco_database --outdir ./results
43+
nextflow run ${GITHUB_WORKSPACE} -profile test,docker --outdir ./results

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
44
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
55

6-
## [[0.7.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.7.0)] – Psyduck – [2024-10-02]
6+
## [[0.7.0](https://github.com/sanger-tol/blobtoolkit/releases/tag/0.7.0)] – Psyduck – [2024-11-20]
77

88
The pipeline is now considered to be a complete and suitable replacement for the Snakemake version.
99

@@ -13,6 +13,7 @@ The pipeline is now considered to be a complete and suitable replacement for the
1313
to indicate in the samplesheet whether the reads are paired or single.
1414
- Updated the Blastn settings to allow 7 days runtime at most, since that
1515
covers 99.7% of the jobs.
16+
- Allow database inputs to be optionally compressed (`.tar.gz`)
1617

1718
### Software dependencies
1819

-13.8 MB
Binary file not shown.
-15.2 MB
Binary file not shown.
-32 KB
Binary file not shown.
-8.72 KB
Binary file not shown.
-824 Bytes
Binary file not shown.
-268 Bytes
Binary file not shown.
-1.32 KB
Binary file not shown.
-716 Bytes
Binary file not shown.
-57.8 MB
Binary file not shown.
-16 KB
Binary file not shown.
-252 Bytes
Binary file not shown.
Binary file not shown.
-13.8 MB
Binary file not shown.
-13.9 MB
Binary file not shown.
Binary file not shown.
Binary file not shown.
-284 Bytes
Binary file not shown.
Binary file not shown.
-274 Bytes
Binary file not shown.
-176 Bytes
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.

conf/test.config

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -30,11 +30,11 @@ params {
3030
taxon = "Meles meles"
3131

3232
// Databases
33-
taxdump = "/lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump"
34-
busco = "/lustre/scratch123/tol/resources/nextflow/busco/blobtoolkit.GCA_922984935.2.2023-08-03"
35-
blastp = "${projectDir}/assets/test/mMelMel3.1.buscogenes.dmnd"
36-
blastx = "${projectDir}/assets/test/mMelMel3.1.buscoregions.dmnd"
37-
blastn = "${projectDir}/assets/test/nt_mMelMel3.1"
33+
taxdump = "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz"
34+
busco = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/blobtoolkit.GCA_922984935.2.2023-08-03.tar.gz"
35+
blastp = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/mMelMel3.1.buscogenes.dmnd.tar.gz"
36+
blastx = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/mMelMel3.1.buscoregions.dmnd.tar.gz"
37+
blastn = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/nt_mMelMel3.1.tar.gz"
3838

3939
// Need to be set to avoid overfilling /tmp
4040
use_work_dir_as_temp = true

conf/test_full.config

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,11 +25,11 @@ params {
2525
taxon = "Laetiporus sulphureus"
2626

2727
// Databases
28-
taxdump = "/lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump"
28+
taxdump = "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz"
2929
busco = "/lustre/scratch123/tol/resources/busco/latest"
30-
blastp = "${projectDir}/assets/test_full/gfLaeSulp1.1.buscogenes.dmnd"
31-
blastx = "${projectDir}/assets/test_full/gfLaeSulp1.1.buscoregions.dmnd"
32-
blastn = "${projectDir}/assets/test_full/nt_gfLaeSulp1.1"
30+
blastp = "https://tolit.cog.sanger.ac.uk/test-data/Laetiporus_sulphureus/resources/gfLaeSulp1.1.buscogenes.dmnd.tar.gz"
31+
blastx = "https://tolit.cog.sanger.ac.uk/test-data/Laetiporus_sulphureus/resources/gfLaeSulp1.1.buscoregions.dmnd.tar.gz"
32+
blastn = "https://tolit.cog.sanger.ac.uk/test-data/Laetiporus_sulphureus/resources/nt_gfLaeSulp1.1.tar.gz"
3333

3434
// Need to be set to avoid overfilling /tmp
3535
use_work_dir_as_temp = true

conf/test_raw.config

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -31,11 +31,11 @@ params {
3131
taxon = "Meles meles"
3232

3333
// Databases
34-
taxdump = "/lustre/scratch123/tol/resources/taxonomy/latest/new_taxdump"
35-
busco = "/lustre/scratch123/tol/resources/nextflow/busco/blobtoolkit.GCA_922984935.2.2023-08-03"
36-
blastp = "${projectDir}/assets/test/mMelMel3.1.buscogenes.dmnd"
37-
blastx = "${projectDir}/assets/test/mMelMel3.1.buscoregions.dmnd"
38-
blastn = "${projectDir}/assets/test/nt_mMelMel3.1/"
34+
taxdump = "https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz"
35+
busco = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/blobtoolkit.GCA_922984935.2.2023-08-03.tar.gz"
36+
blastp = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/mMelMel3.1.buscogenes.dmnd.tar.gz"
37+
blastx = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/mMelMel3.1.buscoregions.dmnd.tar.gz"
38+
blastn = "https://tolit.cog.sanger.ac.uk/test-data/Meles_meles/resources/nt_mMelMel3.1.tar.gz"
3939

4040
// Need to be set to avoid overfilling /tmp
4141
use_work_dir_as_temp = true

docs/usage.md

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,15 +78,20 @@ The BlobToolKit pipeline can be run in many different ways. The default way requ
7878

7979
It is a good idea to put a date suffix for each database location so you know at a glance whether you are using the latest version. We are using the `YYYY_MM` format as we do not expect the databases to be updated more frequently than once a month. However, feel free to use `DATE=YYYY_MM_DD` or a different format if you prefer.
8080

81+
Note that all input databases may be optionally passed directly to the pipeline compressed as `.tar.gz`, and the pipeline will handle decompression.
82+
The instructions below show how to build each input database in _two_ forms: decompressed _and_ compressed. You may not need to do both. Select the one that is most appropriate for how you want to use the pipeline.
83+
8184
#### 1. NCBI taxdump database
8285

8386
Create the database directory, retrieve and decompress the NCBI taxonomy:
8487

8588
```bash
8689
DATE=2024_10
8790
TAXDUMP=/path/to/databases/taxdump_${DATE}
91+
TAXDUMP_TAR=/path/to/databases/taxdump_${DATE}.tar.gz
8892
mkdir -p "$TAXDUMP"
89-
curl -L ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz | tar -xzf - -C "$TAXDUMP"
93+
curl -L ftp://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.gz -o $TAXDUMP_TAR
94+
tar -xzf $TAXDUMP_TAR -C "$TAXDUMP"
9095
```
9196

9297
#### 2. NCBI nucleotide BLAST database
@@ -96,6 +101,7 @@ Create the database directory and move into the directory:
96101
```bash
97102
DATE=2024_10
98103
NT=/path/to/databases/nt_${DATE}
104+
NT_TAR=/path/to/databases/nt_${DATE}.tar.gz
99105
mkdir -p $NT
100106
cd $NT
101107
```
@@ -113,6 +119,11 @@ done
113119
wget "https://ftp.ncbi.nlm.nih.gov/blast/db/v5/taxdb.tar.gz" &&
114120
tar xf taxdb.tar.gz -C $NT &&
115121
rm taxdb.tar.gz
122+
123+
# Compress and cleanup
124+
cd ..
125+
tar -cvzf $NT_TAR $NT
126+
rm -r $NT
116127
```
117128

118129
#### 3. UniProt reference proteomes database
@@ -126,6 +137,7 @@ Create the database directory and move into the directory:
126137
```bash
127138
DATE=2024_10
128139
UNIPROT=/path/to/databases/uniprot_${DATE}
140+
UNIPROT_TAR=/path/to/databases/uniprot_${DATE}.tar.gz
129141
mkdir -p $UNIPROT
130142
cd $UNIPROT
131143
```
@@ -152,6 +164,12 @@ diamond makedb -p 16 --in reference_proteomes.fasta.gz --taxonmap reference_prot
152164
# clean up
153165
mv extract/{README,STATS} .
154166
rm -r extract
167+
rm -r $TAXDUMP
168+
169+
# Compress final database and cleanup
170+
cd ..
171+
tar -cvzf $UNIPROT_TAR $UNIPROT
172+
rm -r $UNIPROT
155173
```
156174

157175
#### 4. BUSCO databases
@@ -161,6 +179,7 @@ Create the database directory and move into the directory:
161179
```bash
162180
DATE=2024_10
163181
BUSCO=/path/to/databases/busco_${DATE}
182+
BUSCO_TAR=/path/to/databases/busco_${DATE}.tar.gz
164183
mkdir -p $BUSCO
165184
cd $BUSCO
166185
```
@@ -181,6 +200,13 @@ If you have [GNU parallel](https://www.gnu.org/software/parallel/) installed, yo
181200
find v5/data -name "*.tar.gz" | parallel "cd {//}; tar -xzf {/}"
182201
```
183202

203+
Finally re-compress and cleanup the files:
204+
205+
```bash
206+
tar -cvzf $BUSCO_TAR $BUSCO
207+
rm -r $BUSCO
208+
```
209+
184210
## Changes from Snakemake to Nextflow
185211

186212
### Commands

modules.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,11 @@
8787
"installed_by": ["modules"],
8888
"patch": "modules/nf-core/seqtk/subseq/seqtk-subseq.diff"
8989
},
90+
"untar": {
91+
"branch": "master",
92+
"git_sha": "666652151335353eef2fcd58880bcef5bc2928e1",
93+
"installed_by": ["modules"]
94+
},
9095
"windowmasker/mkcounts": {
9196
"branch": "master",
9297
"git_sha": "32cac29d4a92220965dace68a1fb0bb2e3547cac",

modules/local/generate_config.nf

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,11 @@ process GENERATE_CONFIG {
1010
val taxon_query
1111
val busco_lin
1212
path lineage_tax_ids
13-
tuple val(meta2), path(blastn)
1413
val reads
15-
// The following are passed as "val" because we just want to know the full paths. No staging necessary
16-
val blastp_path
17-
val blastx_path
18-
val blastn_path
19-
val taxdump_path
14+
tuple val(meta2), path(blastp)
15+
tuple val(meta3), path(blastx)
16+
tuple val(meta4), path(blastn)
17+
tuple val(meta5), path(taxdump)
2018

2119
output:
2220
tuple val(meta), path("*.yaml") , emit: yaml
@@ -43,10 +41,10 @@ process GENERATE_CONFIG {
4341
$accession_params \\
4442
--nt $blastn \\
4543
$input_reads \\
46-
--blastp ${blastp_path} \\
47-
--blastx ${blastx_path} \\
48-
--blastn ${blastn_path} \\
49-
--taxdump ${taxdump_path} \\
44+
--blastp ${blastp} \\
45+
--blastx ${blastx} \\
46+
--blastn ${blastn} \\
47+
--taxdump ${taxdump} \\
5048
--output_prefix ${prefix}
5149
5250
cat <<-END_VERSIONS > versions.yml

modules/nf-core/untar/environment.yml

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

modules/nf-core/untar/main.nf

Lines changed: 84 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

modules/nf-core/untar/meta.yml

Lines changed: 49 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)