Skip to content

Commit

Permalink
Merge pull request #87 from SPAAM-community/2024-pipelineupdate
Browse files Browse the repository at this point in the history
More tweaks fixing aMeta commands
  • Loading branch information
jfy133 authored Jul 19, 2024
2 parents e4a8497 + a127ad3 commit fd2ff09
Show file tree
Hide file tree
Showing 3 changed files with 89 additions and 42 deletions.
125 changes: 85 additions & 40 deletions ancient-metagenomic-pipelines.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -538,11 +538,7 @@ rm -r *
## What is aMeta?
::: {.callout-note collapse="true" title="Self guided: chapter environment setup"}
For this chapter's exercises, if not already performed, we will need to create the [conda environment](before-you-start.qmd#creating-a-conda-environment) from the following [`yml` file](https://github.com/NBISweden/aMeta/blob/main/workflow/envs/environment.yaml), and activate the environment.

```bash
conda activate aMeta
```
For this chapter's exercises, if not already performed, we will need to create the special aMeta [conda environment](before-you-start.qmd#creating-a-conda-environment) and activate the environment.
:::

While nf-core/eager is a solid pipeline for microbial genomics, and can also perform metagenomic screening via the integrated HOPS pipeline [@Hubler2019-qw] or `Kraken2` [@Wood2019-mf], in some cases we may wish to have a more accurate and resource efficient pipeline In this section, we will demonstrate an example of using aMeta, a `Snakemake` workflow proposed by @Pochon2022-hj that aims to minimise resource usage by combining both low-resource requiring k-mer based taxonomic profiling as well as accurate read-alignment ([@fig-ancientmetagenomicpipelines-ametadiagram]).
Expand All @@ -564,40 +560,33 @@ In this tutorial we will try running the small test data that comes with aMeta.

aMeta has been written in `Snakemake`, which means running the pipeline has to be installed in a slightly different manner to the `nextflow pull` command that can be used for nf-core/eager.

Make sure you have followed the instructions in the [Before You Start Chapter](/before-you-start.qmd#ancient-metagenomic-pipelines) for cloning the aMeta GitHub repository to the `ancient-metagenomic-pipelines/` directory. Once we have done this, we can make sure we are in the aMeta directory, if not already.

```bash
cd /<path>/<to>/ancient-metagenomic-pipelines/aMeta
```

As aMeta also includes tools that normally require very large computational resources that cannot fit on a standard laptop, we will instead try to re-use the internal very small 'fake' data the aMeta developers use to test the pipeline.
And activate the dedicated aMeta conda environment.

```bash
conda activate aMeta
```

We don't have to worry about trying to understand exactly what the following commands are doing, they will not be important for the rest of the chapter.
However generally the commands try to pull all the relevant software (via conda), make a fake database and download other required files, and then reconstruct the basic directory and file structure required for running the pipeline.
As aMeta also includes tools that normally require very large computational resources that cannot fit on a standard laptop, we will instead try to re-use the internal very small 'fake' data the aMeta developers use to test the pipeline.

:::{.callout-warning}
The next steps, particularly the `set up conda envs` will take a very long time!
If we are impatient, we can speed this process up by using `mamba` rather than conda.
```bash
## Download installation script and run
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh
The following steps are already performed up for Students in the summer schools as, particularly the `set up conda envs` will take a very long time!

## When asked: Accept the license
## When asked: Press 'y' to run conda init
If you are doing this chapter self-guided, it is critical to perform the following set up steps!

## Turn off base
conda config --set auto_activate_base false
We don't have to worry about trying to understand exactly what the following commands are doing, they will not be important for the rest of the chapter.
However generally the commands try to pull all the relevant software (via conda), make a fake database and download other required files, and then reconstruct the basic directory and file structure required for running the pipeline.
## Re-add original conda environments to be recognised by mamba
conda config --append envs_dirs /home/ubuntu/bin/miniconda3/envs/
```
:::
:::{.callout-note title="Self-guided: aMeta set up and configuration" collapse=true}
```bash
## Change into ~/.test to set up all the required test resources (Databases etc.)
cd ~/.test
cd .test/
## Set up conda envs
## If we can an error about a 'non-default solver backend' the run `conda config --set solver classic` and re-start the command
Expand All @@ -610,7 +599,7 @@ source $(dirname $(dirname $CONDA_EXE))/etc/profile.d/conda.sh
## Build dummy KrakenUniq database
env=$(grep krakenuniq .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g")
conda activate $env
krakenuniq-build --db resources/KrakenUniq_DB --kmer-len 21 --minimiser-len 11 --jellyfish-bin $(pwd)/$env/bin/jellyfish
krakenuniq-build --db resources/KrakenUniq_DB --kmer-len 21 --minimizer-len 11 --jellyfish-bin $(pwd)/$env/bin/jellyfish
conda deactivate
## Get Krona taxonomy tax dump
Expand All @@ -633,13 +622,11 @@ conda deactivate
touch .initdb
## Run a quick test
snakemake --use-conda --show-failed-logs --conda-cleanup-pkgs cache -s ../workflow/Snakefile $@ --conda-frontend conda
## Run a quick test and generate the report (you can open this to check it looks like everythin was generated)
snakemake --use-conda --show-failed-logs --conda-cleanup-pkgs cache -s ../workflow/Snakefile $@ --conda-frontend conda -j 4
snakemake -s ../workflow/Snakefile --report --report-stylesheet ../workflow/report/custom.css --conda-frontend conda
## Now we move back into the main repository where we can symlink all the database files back to try running our 'own' test
cd ../
cd resources/
ln -s ../.test/resources/* .
Expand All @@ -648,17 +635,27 @@ mv config.yaml config.yaml.bkp
mv samples.tsv samplest.tsv.bkp
cd ../
ln -s .test/data/ .
ln -s .test/.snakemake/ . ## so we can re-use conda environments from the `.test` directory for the summer school run
## Again get hte taxonomy tax dump for Krona, but this time for a real run
env=$(grep krona .snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1)
## Again get the taxonomy tax dump for Krona, but this time for a real run
## Make sure you're now in the root directory of the repository!
env=$(grep krona .test/.snakemake/conda/*yaml | awk '{print $1}' | sed -e "s/.yaml://g" | head -1)
conda activate $env
cd $env/opt/krona
./updateTaxonomy.sh taxonomy
cd -
conda deactivate

## And back to the root of the repo for practising aMeta properly!
cd ../
```
Now hopefully we can forget all this, and imagine we are running data though aMeta as you would normally from scratch.
:::
OK now aMeta is all set up, we can now simulate running a 'real' pipeline job!
:::
### aMeta configuration
Expand All @@ -668,8 +665,8 @@ In a text editor (e.g. `nano`), write the following names paths in TSV format.
```bash
sample fastq
foo data/bar.fq.gz
bar data/foo.fq.gz
bar data/bar.fq.gz
foo data/foo.fq.gz
```
:::{.callout-warning}
Expand All @@ -679,8 +676,11 @@ Make sure when copy pasting into our test editor, tabs are not replaced with spa
Then we need to write a config file.
This tells aMeta where to find things such as database files and other settings.
These paths and settings go inside a `config.yaml` file inside `aMeta/config/`.
A minimal example `config.yaml` files can look like this.
This includes specifying the location the main samplesheet, which points to a TSV file that contains all the FASTQs if the samples we want to analyse, and paths to all the required database files and reference genomes you may need.
These paths and settings go inside the `config.yaml` file that must be placed inside inside `aMeta/config/`.
Make a configuration file with your text editor of choice (e.g. `nano`).
```bash
samplesheet: "config/samples.tsv"
Expand All @@ -699,24 +699,38 @@ ncbi_db: resources/ncbi
n_unique_kmers: 1000
n_tax_reads: 200
```
And make a two column samplesheet file with the following content in a file called `samples.tsv`, also under `configs/`.
```tsv
sample fastq
foo data/foo.fq.gz
bar data/bar.fq.gz
```
:::{.callout-warning}
aMeta (v1.0.0) currently only supports single-end or pre-merged- data only!
:::
Once this config file is generated, we can start the run.
:::{.callout-note}
As this is only a dummy run (due to the large-ish computational resources required for KrakenUniq), we re-use some of the resource files here.
While this will produce nonsense output, it is used here to demonstrate how we would execute the pipeline.
:::
### Prepare and run aMeta
Make sure we're still in the `aMeta` conda environment, and in the main aMeta directory with the following.
Make sure we're still in the `aMeta` conda environment, and that we are still in the main aMeta directory with the following.
```bash
cd /<path/<to>/ancient-metagenomic-pipelines/ameta/aMeta/
conda activate aMeta
cd /<path/<to>/ancient-metagenomic-pipelines/aMeta/
```
And, finally, we are ready to run aMeta!
Finally, we are ready to run aMeta, where it will automatically pick up our config and samplesheet file we placed in `config/`!
```bash
#| eval: false
Expand Down Expand Up @@ -746,6 +760,14 @@ Complete log: .snakemake/log/2023-10-05T155051.524987.snakemake.log
All output files of the workflow are located in `aMeta/results` directory.
To get a quick overview of ancient microbes present in our samples we should check a heatmap in `results/overview_heatmap_scores.pdf`.
:::{.callout-warning}
If running during the summer school, you can use the following command to open the PDF file from the command line.
```bash
evince results/overview_heatmap_scores.pdf
```
:::
![Example microbiome profiling summary heatmap from aMeta.
The columns represent different samples, and the rows of different species.
The cells of the heatmap are coloured from blue, to yellow, to red, representing aMeta authentication scores from 0 to 10, with the higher the number the more confident of the hit being both the correct taxonomic assignment and that it is ancient.
Expand Down Expand Up @@ -794,11 +816,34 @@ From Left to Right, Top from bottom, the panels consists of:
9. A general statistics table including the name of the taxonomic node, number of reads, duplicates, and mean read length etc.
](assets/images/chapters/ancient-metagenomic-pipelines/aMeta_output.png){#fig-ancientmetagenomicpipelines-persampleplot}
::: {.callout-tip title="Question" appearance="simple"}
In our test data, what score does the sample 'foo' for the hit against _Yersinia pestis_?
Is this a good score?
Inspect the results `AUTHENTICATION/xxx/authentic_Sample_foo_*.pdf` file
What could have contributed to this particular score?
Hint: Check Supplementary File 2, section S5 of [@Pochon2022-hj] for some hints.
:::
::: {.callout-note collapse="true" title="Answer"}
The sample foo gets a score of `4`.
This is a low score, and indicates that aMeta is not very confident that this is a true hit.
The metrics that contribute to this score are:
- Edit distance all reads (+1)
- Deamination plot (+2)
- Reads mapped with identity (+1),
:::
### Clean up
Before continuing onto the next section of this chapter, we will need to deactivate from the conda environment.
Before continuing onto the next section of this chapter, we will need to remove the output files, and deactivate from the conda environment.
```bash
rm -r results/ log/
## You can also optionall remove the conda environments if we are running out of space
# rm -r .snakemake/ .test/.snakemake
conda deactivate
```
Expand Down
2 changes: 2 additions & 0 deletions before-you-start.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -216,5 +216,7 @@ For some chapters you may need the following software/and or data manually insta
cd /<path>/<to>/ancient-metagenomic-pipelines/
git clone https://github.com/NBISweden/aMeta
cd aMeta
## We have to patch the environment to use an old version of Snakemake as aMeta is not compatible with the latest version
sed -i 's/snakemake-minimal>=5.18/snakemake <=6.3.0/' workflow/envs/environment.yaml
conda env create -f workflow/envs/environment.yaml
```
4 changes: 2 additions & 2 deletions git-github.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -840,9 +840,9 @@ Once the edit window is opened, add your name and GitHub user name to the list (

![Screenshot of GitHub file edit window, with a name added to a bullet point list at the bottom.](assets/images/chapters/git-github/github-fork-addname.png){#fig-gitgithub-fork-addname}

Make our commit to record the change to Git history (@fig-accessingdata-firstpagefig-gitgithub-fork-commitedit) and double check we've made the change ()
Make our commit to record the change to Git history (@fig-accessingdata-firstpagefig-gitgithub-fork-commitedit) and double check we've made the change (@fig-gitgithub-fork-confirmedit).

![A commit message being written describing the addition of a new name in the GitHub commit interface.](assets/images/chapters/git-github/github-fork-commitedit.png){#fig-gitgithub-fork-commitedit}
![A commit message being written describing the addition of a new name in the GitHub commit interface.](assets/images/chapters/git-github/github-fork-commitedit.png){#fig-accessingdata-firstpagefig-gitgithub-fork-commitedit}

![The rendered README with the newly added name at the bottom of the list.](assets/images/chapters/git-github/github-fork-confirmedit.png){#fig-gitgithub-fork-confirmedit}

Expand Down

0 comments on commit fd2ff09

Please sign in to comment.