accessing-ancient-metagenomic-data.qmd

---
title: Accessing Ancient Metagenomic Data
author: James A. Fellows Yates
number-sections: true
crossref:
  chapters: true
bibliography: assets/references/accessing-ancient-metagenomic-data.bib
---

::: {.callout-note collapse="true" title="Self guided: chapter environment setup"}
For this chapter's exercises, if not already performed, you will need to download the chapter's dataset, decompress the archive, and create and activate the conda environment.

Do this, use `wget` or right click and save to download this Zenodo archive: [10.5281/zenodo.13759163](https://doi.org/10.5281/zenodo.13759163), and unpack

```bash
tar xvf accessing-ancient-metagenomic-data.tar.gz 
cd accessing-ancient-metagenomic-data/
```

You can then create the subsequently activate environment with

```bash
conda env create -f accessing-ancient-metagenomic-data.yml
conda activate accessing-ancient-metagenomic-data
```
:::


## Introduction

In most bioinformatic projects, we need to include publicly available comparative data to expand or compare our newly generated data with.

Including public data can benefit ancient metagenomic studies in a variety of ways.
It can help increase our sample sizes (a common problem when dealing with rare archaeological samples) - thus providing stronger statistical power. 
Comparison with a range of previously published data of different preservational levels can allow an estimate on the quality of the new samples. 
When considering solely (re)using public data, we can consider that this can also spawn new ideas, projects, and meta analyses to allow further deeper exploration of ancient metagenomic data (e.g., looking for correlations between various environmental factors and preservation).

Fortunately for us, geneticists and particularly palaeogenomicists have been very good at uploading raw sequencing data to well-established databases [@Anagnostou2015-mz].

In the vast majority of cases you will be able to find publically available sequencing data on the INSDC ([https://www.insdc.org/](https://www.insdc.org/)) association of databases, namely the EBI's European Nucleotide Archive (ENA; [https://www.ebi.ac.uk/ena/](https://www.ebi.ac.uk/ena/)), and NCBI ([https://www.ncbi.nlm.nih.gov/sra](https://www.ncbi.nlm.nih.gov/sra)) or DDBJ's ([https://www.ddbj.nig.ac.jp/dra/index-e.html](https://www.ddbj.nig.ac.jp/dra/index-e.html)) Sequence Read Archives (SRA).
However, you may in some cases find ancient metagenomic data on institutional FTP servers, domain specific databases (e.g. OAGR ([https://oagr.org](https://oagr.org)), Zenodo ([https://zenodo.org](https://zenodo.org)), Figshare ([https://figshare.com](https://figshare.com)), or GitHub ([https://github.com](https://github.com))).

But while the data is publicly available, we need to ask whether it is 'FAIR'.

## Finding Ancient Metagenomic Data

FAIR principles [@Wilkinson2016-ou] were defined by researchers, librarians, and industry in 2016 to improve the quality of data uploads - primarily by making data uploads more 'machine readable'.
FAIR standards for:

- Findable
- Accessible
- Interoperable
- Reusable

When we consider ancient (meta)genomic data, we are pretty close to this.
Sequencing data is in most cases accessible (via the public databases like ENA, SRA), interoperable and reproducible because we use field standard formats such as FASTQ or BAM files.
However _findable_ remains an issue.

This is because the _metadata_ about each data file is dispersed over many places, and very often not with the data files themselves.

In this case I am referring to metadata such as: What is the sample's name? How old is it? Where is it from? Which enzymes were used for library construction? What sequencing machine was this library sequenced on?

To find this information about a given data file, you have to search many places (main text, supplementary information, the database itself), for different types of metadata (as authors report different things), and also in different formats (text, tables, figures).

This very heterogenous landscape makes it difficult for machines to index all this information (if at all), and thus means you cannot search for the data you want to use for your own research in online search engines.

## AncientMetagenomeDir

This is where the SPAAM community project 'AncientMetagenomeDir' ([https://github.com/spaam-community/AncientMetagenomeDir](https://github.com/spaam-community/AncientMetagenomeDir)) comes in [@Fellows_Yates2021-rp].
AncientMetagenomeDir is a resource of lists of metadata of all publishing and publicly available ancient metagenomes and microbial genome-level enriched samples and their associated libraries.

By aggregating and standardising metadata and accession codes of ancient metagenomic samples and libraries, the project aims to make it easier for people to find comparative data for their own projects, appropriately re-analyse libraries, as well as help track the field over time and facilitate meta analyses.

Currently the project is split over three main tables: host-associated metagenomes (e.g. ancient microbiomes), host-associated single-genomes (e.g. ancient pathogens), and environmental metagenomes (e.g. lakebed cores or cave sediment sequences).

The repository already contains more than 2000 samples and 5000 libraries, spanning the entire globe and as far back as hundreds of thousands of years.

To make the lists of samples and their metadata as accessible and interoperable as possible, we utilise simple text (TSV - tab separate value) files - files that can be opened by pretty much all spreadsheet tools (e.g., Microsoft Office excel, LibreOffice Calc) and languages (R, Python etc.) (@fig-accessingdata-exampledirtable).

![Example few columns and rows of an AncientMetagenomeDir table, including project name, publication year, site name, latitude, longitude, country and sample name.](assets/images/chapters/accessing-ancient-metagenomic-data/fig-accessingdata-exampledirtable.png){#fig-accessingdata-exampledirtable}

Critically, by standardising the recorded all metadata across all publications this makes it much easier for researchers to filter for particular time periods, geographical regions, or sample types of their interest - and then use the also recorded accession numbers to efficiently download the data.

At their core all different AncientMetagenomeDir tables must have at 6 minimum metadata sets at the sample level:

- Publication information (doi)
- Sample name(s)
- Geographic location (e.g. country, coordinates)
- Age
- Sample type (e.g. bone, sediment, etc.)
- Data Archive and accessions

Each table then has additional columns depending on the context (e.g. what time of microbiome is expected for host-associated metagenomes, or species name of the genome that was reconstructed).

The AncientMetagenomeDir project already has 12 major releases, and will continued to be regularly updated as the community continues to submit new metadata of samples of new publications as they come out.

::: {.callout-tip title="Question" appearance="simple"}
What is the naming scheme of the AncientMetagenomeDir releases? Try to find the 'release' and 'wiki' sections of the GitHub repository interface and see if you can find the information...
:::

::: {.callout-note collapse="true" title="Answer"}
The 'release' listing of a GitHub repository can be found on the right hand metadata bar.

In many cases you can also find a 'CHANGELOG' file that will list all the changes that have been made on each release.

The release name scheme is of different places listed on the Unesco World Heritage list ([https://whc.unesco.org/en/list/](https://whc.unesco.org/en/list/)). More background you can find here on the Wiki page ([https://github.com/SPAAM-community/AncientMetagenomeDir/wiki/Release-Name-List](https://github.com/SPAAM-community/AncientMetagenomeDir/wiki/Release-Name-List))
:::

## AMDirT

But how does one explore such a large dataset of tables with thousands of rows? You could upload this into a spreadsheet tool or in a programming language like R, but you would still have to do a lot of manual filtering and parsing of the dataset to make it useful for downstream analyses.

In response to this the SPAAM Community have also developed a companion tool 'AMDirT' to facilitate this [@Borry2024-dz].
Amongst other functionality, AMDirT allows you to load different releases of AncientMetagenomeDir, filter and explore to specific samples or libraries of interest, and then generate download scripts, configuration files for pipelines, and reference BibTeX files for you, both via a command-line (CLI) or graphical user interface (GUI)!

### Running AMDirT viewer

We will now demonstrate how to use the AMDirT graphical user interface to load a dataset, filter to samples of interest, and download some configuration input files for downstream ancient DNA pipelines.

![AMDirT logo: a cog with the SPAAM icon in the middle with the word AMDirT to the side](assets/images/chapters/accessing-ancient-metagenomic-data/fig-accessingdata-amdirtlogo.png){#fig-accessingdata-amdirtlogo fig-align="center"}

::: {.callout-warning}
This tutorial will require a web-browser! Make sure to run on your local laptop/PC or, if on on a server, with X11 forwarding activated.
:::

First, we will need to activate a conda environment, and then install the latest development version of the tool for you.

While in the `accessing-ancient-metagenomic-data` conda environment, run the following command to load the GUI into your web-browser. If the browser doesn't automatically load, copy the IP address and paste it in your browser's URL bar.

```bash
AMDirT viewer
```

:::{.callout-note}
The first time opening AMDirT (a streamlit app), it may ask you to sign up for a newsletter using your email.

Do not type anything when prompted (i.e., just press <kbd>enter</kbd>), this is entirely optional and will not affect the usage of the tool.

You will not be asked again when in the same conda environment.
:::

Your web browser should now load, and you should see a two panel page.

Under _Select a table_ use the dropdown menu to select 'ancientsinglegenome-hostassociated'.

You should then see a table (@fig-accessingdata-firstpage), pretty similar what you are familiar with with spreadsheet tools such as Microsoft Excel or LibreOffice calc.

![Main AMDirT Viewer page on load. A toolbar on the left displays the AMDirT version, and dropdown menus for the AncientMetagenomeDir release, table, and downloading tool to use. The rest of the page shows a tabular window containing rows and columns corresponding to sample metadata and rows of samples with metadata such as project name, publication year and site name.](assets/images/chapters/accessing-ancient-metagenomic-data/fig-accessingdata-firstpage.png){#fig-accessingdata-firstpage fig-align="center"}

To navigate, you can scroll down to see more rows, and press <kbd>shift</kbd> and scroll to see more columns, or use click on a cell and use your arrow keys (<kbd>⬆</kbd>,<kbd>⬇</kbd>,<kbd>⬅</kbd>,<kbd>➡</kbd>) to move around the table.

You can reorder columns by clicking on the column name, and also filter by pressing the little 'burger' icon that appears on the column header when you hover over a given column.

As an exercise, we will try filtering to a particular set of samples, then generate some download scripts, and download the files.

First, filter the _project_name_ column to 'Muhlemann2020' from [@M_hlemann_2020, @fig-accessingdata-projectfilter].

![The AMDirT viewer with a filter menu coming out of the project name column, with an example search of the project 'Muhlemann2020' in the search bar.](assets/images/chapters/accessing-ancient-metagenomic-data/fig-accessingdata-projectfilter.png){#fig-accessingdata-projectfilter fig-align="center"}

Then scroll to the right, and filter the _geo\_loc\_name_ to 'Norway'  (@fig-accessingdata-geofilter).

![The AMDirT viewer with a filter menu coming out of the geo location column, with an example search of the country 'Norway' written in the search bar and the entry ticked in the results.](assets/images/chapters/accessing-ancient-metagenomic-data/fig-accessingdata-geofilter.png){#fig-accessingdata-geofilter fig-align="center"}

You should be left with 2 rows.

Finally, scroll back to the first column and tick the boxes of these two samples (@fig-accessingdata-tickbox).

![The AMDirT viewer with just two rows with samples from Muhlemann2020 that are located in the Norway being displayed, and the tickboxes next to each one selected](assets/images/chapters/accessing-ancient-metagenomic-data/fig-accessingdata-tickbox.png){#fig-accessingdata-tickbox fig-align="center"}

Once you've selected the samples you want, you can press _Validate selection_ below the table box.
You should then see a series loading-spinner, and then a new table will appear below (@fig-accessingdata-librarytable).

![The AMDirT viewer with the library table loaded, with a range of libraries listed corresponding to the samples selected in the previous table with their AncientMetagenomeDir metadata. The tickboxes are unticked, and a has a red 'validate library selection' button below it.](assets/images/chapters/accessing-ancient-metagenomic-data/fig-accessingdata-librarytable.png){#fig-accessingdata-librarytable fig-align="center"}

This table contains all the AncientMetagenomeDir recorded library metadata for the samples you selected.

You can filter this table in exactly the same way as you did for the samples table, and reorder columns in the same way.
For example, let's select only libraries that underwent a genome _capture_ (@fig-accessingdata-libraryfilter).

![The AMDirT viewer with the library table loaded, and the library_strategy column filter open and only 'Targeted-Capture' selected in the filter.](assets/images/chapters/accessing-ancient-metagenomic-data/fig-accessingdata-libraryfilter.png){#fig-accessingdata-libraryfilter fig-align="center"}

Again, you can select the libraries of interested after the filtering (@fig-accessingdata-libraryselection).

![The AMDirT viewer with the library table loaded, and the four 'Targeted-Capture' libraries row's tickboxes selected.](assets/images/chapters/accessing-ancient-metagenomic-data/fig-accessingdata-libraryselection.png){#fig-accessingdata-libraryselection fig-align="center"}

Once selected, you can press _Validate library selection_.
A lot of buttons should then appear  below the library table (@fig-accessingdata-validated)!

![The AMDirT viewer after pressing the 'validate selection' button. A range of buttons are displayed at the bottom including a warning message, with the buttons offering download of a range of files such as download scripts, AncientMetagenomeDir library tables and input configuration sheets for a range of ancient DNA bioinformatics pipelines. Hovering over the 'sample download script' button produces a tooltip with an estimate of the expected download size of all selected FASTQ files](assets/images/chapters/accessing-ancient-metagenomic-data/fig-accessingdata-validated.png){#fig-accessingdata-validated fig-align="center"}

You should have four categories of buttons:

- Download AncientMetagenomeDir Library Table
- Download Curl sample download script
- Download _\<tool/pipeline name\>_ input TSV
- Download Citations as BibText

The purposes of the buttons are as follows:

- The first button is to download a table containing all the AncientMetagenomeDir metadata of the selected samples.
- The second is for generating a download script that will allow you to immediately download all sequencing data of the samples you selected.
- The third set of buttons generate (partially!) pre-configured input files for use in dedicated ancient DNA pipeline such as nf-core/eager [@Fellows_Yates2021-jl], PALAEOMIX [@Schubert2014-ps], and/or and aMeta [@Pochon2022-hj].
- Finally, the fourth button generates a text file with (in most cases) all the citations of the data you downloaded, in a format accepted by most reference/citation managers.

It's important to note you are not necessarily restricted to `Curl` ([https://curl.se/](https://curl.se/)) for downloading the data.
AMDirT aims to add support for whatever tools or pipelines requested by the community.
For example, an already supported downloading tool alternative is the nf-core/fetchNGS ([https://nf-co.re/fetchngs](https://nf-co.re/fetchngs)) pipeline.
You can select these using the drop-down menus on the left hand-side.

For the next step of this tutorial, we will press the following buttons:

- Download AncientMetagenomeDir Library Table
- Download Curl sample download script
- Download nf-core/eager input TSV
- Download Citations as BibText

Your browser should then download two `.tsv` files, one `.sh` file and one `.bib` file into it's default location.

Once everything is downloaded, you can close the tab of the web browser, and in the terminal you can press <kbd>ctrl</kbd> + <kbd>c</kbd> to shutdown the tool.


:::{.callout-note}
Do not worry if you get a 'Citation information could not be resolved' warning! This is occasionally expected for some publications due to a CrossRef metadata information problem when requesting reference information about a DOI.
:::

### Inspecting AMDirT viewer Output

Lets look at the files that AMDirT has generated for you.

First we should `mv` our files from the directory that your web browser downloaded the files into into somewhere safer.

To start, we make sure we're still in the chapter's data directory (`/<path>/<to>/accessing-ancient-metagenomic-data/`), and move move the files into it.

```bash
mv ~/Downloads/AncientMetagenomeDir_* .
```

:::{.callout-warning}
The default download directory on most Unix desktops is normally something like `~/Downloads/`, however may vary on your machine!
:::

Then we can look inside the chapter's directory.
We should see at least the following four files.

```bash
ls
```
```{verbatim}
AncientMetagenomeDir_bibliography.bib
AncientMetagenomeDir_curl_download_script.sh
AncientMetagenomeDir_filtered_samples.tsv
AncientMetagenomeDir_nf_core_eager_input_table.tsv
```

We can simply run `cat` of the four files downloaded by AMDirT to look inside (the files starting with `AncientMetagenomeDir_`).
If you run `cat` on the curl download script, you should see a series of `curl` commands with the correct ENA links for you for each of the samples you wish to download.

```bash
cat AncientMetagenomeDir_curl_download_script.sh
```

```{verbatim}
#!/usr/bin/env bash
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/000/ERR4093860/ERR4093860_1.fastq.gz -o ERR4093860_1.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/000/ERR4093860/ERR4093860_2.fastq.gz -o ERR4093860_2.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/006/ERR4093846/ERR4093846.fastq.gz -o ERR4093846.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/005/ERR4093845/ERR4093845.fastq.gz -o ERR4093845.fastq.gz
curl -L ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR409/008/ERR4093838/ERR4093838.fastq.gz -o ERR4093838.fastq.gz
```

By providing this script for you, AMDirT facilitates fast download of files of interest by replacing the one-by-one download commands for each sample with a _single_ command!

```bash
bash AncientMetagenomeDir_curl_download_script.sh
```
```{verbatim}
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (28) Timeout was reached
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (28) Timeout was reached
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 7749k  100 7749k    0     0  12.5M      0 --:--:-- --:--:-- --:--:-- 12.6M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  836k  100  836k    0     0  1634k      0 --:--:-- --:--:-- --:--:-- 1631k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 4042k  100 4042k    0     0  7117k      0 --:--:-- --:--:-- --:--:-- 7117k

```

Running this command should result in progress logs of the downloading of the data of the four selected samples!

We can check the output by running `ls` to verify we have seven FASTQ files (five single end, and one paired end libraries).

Once the four samples are downloaded, AMDirT then facilitates fast processing of the data, as the _eager_ script can be given directly to nf-core/eager as input.
Importantly by including the library metadata (mentioned above), researchers can leverage the complex automated processing that nf-core/eager can perform when given such relevant metadata.

```bash
cat AncientMetagenomeDir_nf_core_eager_input_table.tsv
```
```{verbatim}
Sample_Name	Library_ID	Lane	Colour_Chemistry	SeqType	Organism	Strandedness	UDG_Treatment	R1	R2	BAM
VK388	ERR4093838	0	4	SE	Homo sapiens	double	none	ERR4093838.fastq.gz	NA	NA
VK515	ERR4093845	0	4	SE	Homo sapiens	double	none	ERR4093845.fastq.gz	NA	NA
VK515	ERR4093846	0	2	SE	Homo sapiens	double	none	ERR4093846.fastq.gz	NA	NA
VK515	ERR4093860	0	2	PE	Homo sapiens	double	none	ERR4093860_1.fastq.gz	ERR4093860_2.fastq.gz	NA
```

Finally, we can look into the BibTeX citations file (`*bib`) which will provide you with the citation information of all the downloaded data and AncientMetagenomeDir itself.

::: {.callout-warning}
The contents of this file is reliant on indexing of publications on CrossRef.
In some cases not all citations will be present (as per the warning), so this should be double checked!
:::

```bash
cat AncientMetagenomeDir_bibliography.bib
```
```{verbatim}
 @article{M_hlemann_2020, title={Diverse variola virus (smallpox) strains were widespread in northern Europe in the Viking Age}, volume={369}, ISSN={1095-9203}, url={http://dx.doi.org/10.1126/science.aaw8977}, DOI={10.1126/science.aaw8977}, number={6502}, journal={Science}, publisher={American Association for the Advancement of Science (AAAS)}, author={Mühlemann, Barbara and Vinner, Lasse and Margaryan, Ashot and Wilhelmson, Helene and de la Fuente Castro, Constanza and Allentoft, Morten E. and de Barros Damgaard, Peter and Hansen, Anders Johannes and Holtsmark Nielsen, Sofie and Strand, Lisa Mariann and Bill, Jan and Buzhilova, Alexandra and Pushkina, Tamara and Falys, Ceri and Khartanovich, Valeri and Moiseyev, Vyacheslav and Jørkov, Marie Louise Schjellerup and Østergaard Sørensen, Palle and Magnusson, Yvonne and Gustin, Ingrid and Schroeder, Hannes and Sutter, Gerd and Smith, Geoffrey L. and Drosten, Christian and Fouchier, Ron A. M. and Smith, Derek J. and Willerslev, Eske and Jones, Terry C. and Sikora, Martin}, year={2020}, month=jul }

 @article{Fellows_Yates_2021, title={Community-curated and standardised metadata of published ancient metagenomic samples with AncientMetagenomeDir}, volume={8}, ISSN={2052-4463}, url={http://dx.doi.org/10.1038/s41597-021-00816-y}, DOI={10.1038/s41597-021-00816-y}, number={1}, journal={Scientific Data}, publisher={Springer Science and Business Media LLC}, author={Fellows Yates, James A. and Andrades Valtueña, Aida and Vågene, Åshild J. and Cribdon, Becky and Velsko, Irina M. and Borry, Maxime and Bravo-Lopez, Miriam J. and Fernandez-Guerra, Antonio and Green, Eleanor J. and Ramachandran, Shreya L. and Heintzman, Peter D. and Spyrou, Maria A. and Hübner, Alexander and Gancz, Abigail S. and Hider, Jessica and Allshouse, Aurora F. and Zaro, Valentina and Warinner, Christina}, year={2021}, month=jan }

```

This file can be easily loaded into most reference managers and then have all the citations quickly added to your manuscripts.

:::{.callout-warning}
As in note above sometimes not all citation information can be retrieved using web APIs, so you should always validate all citations of all data you have downloaded are represented.
:::

### AMDirT download and convert

If you're less of a GUI person and consider yourself a command-line wizard, you can also use the `AMDirT download` and `AMDirT convert` commands instead of the GUI version.

Make a new directory called `cli`, and change into it.

```bash
mkdir cli && cd cli/
```

In this case you must supply your own filtered AncientMetagenomeDir samples table, and use command line options to specify which files to generate.

For this, we will download the samples table, use a bit of bash filtering, and then use the `AMDirT convert` command to generate the same downstream-ready files as we did with the GUI.

First we can download the ancientsinglegenome-hostassociated samples table with the following command.

```bash
AMDirT download -t ancientsinglegenome-hostassociated -y samples
```

This will produce a file called `ancientsinglegenome-hostassociated_samples_v24.03.0.tsv` in your current directory (by default).

```bash
ls
```
```{verbatim}
ancientsinglegenome-hostassociated_samples_v24.03.0.tsv
```

Then we can use a bit of bash to filter the table in the same way as we did in the GUI.
In this command, we tell `awk` that the column separator is a tab, print the row if either it's the first record (line of the file), or if column one matches `Muhlemann2020` and column seven matches `Norway`.

```bash
awk -F "\t" 'NR==1 || $1 == "Muhlemann2020" && $7 == "Norway"' ancientsinglegenome-hostassociated_samples_v24.03.0.tsv > ancientsinglegenome-hostassociated_samples_v24.03.0_filtered.tsv
```

Then, we can pass this filtered table to the `AMDirT convert` command to firstly retrieve the library-level metadata.

```bash
AMDirT convert --librarymetadata ancientsinglegenome-hostassociated_samples_v24.03.0_filtered.tsv ancientsinglegenome-hostassociated
```

This has downloaded a new file called `AncientMetagenomeDir_filtered_libraries.tsv`, which we can then further filter in the same away to match the desired libraries as we picked during the GUI section of the tutorial!

```bash
awk -F "\t" 'NR==1 || $14 == "Targeted-Capture"' AncientMetagenomeDir_filtered_libraries.tsv > AncientMetagenomeDir_filtered_libraries_capturedonly.tsv
```

Then with these two filtered files, one for samples,and one for libraries, we can supply them to the `convert` command to generate the same download scripts, eager input samplesheet, and citation `.bib` file as we did before!

```bash
AMDirT convert -o . --bibliography --curl --eager --libraries AncientMetagenomeDir_filtered_libraries_capturedonly.tsv ancientsinglegenome-hostassociated_samples_v24.03.0_filtered.tsv ancientsinglegenome-hostassociated
```

You should see a few messages saying `Writing <XYZ>`, and then if we run `ls`, you should see the same resulting files starting with `AncientMetagenomeDir_` as before with the GUI!

```bash
ls
```

```{verbatim}
AncientMetagenomeDir_bibliography.bib
AncientMetagenomeDir_curl_download_script.sh
AncientMetagenomeDir_filtered_libraries.tsv
AncientMetagenomeDir_filtered_libraries_capturedonly.tsv
AncientMetagenomeDir_nf_core_eager_input_table.tsv
ancientsinglegenome-hostassociated_samples_v24.03.0.tsv
ancientsinglegenome-hostassociated_samples_v24.03.0_filtered.tsv
```

Therefore the `convert` command route of AMDirT allows you to include AMDirT in more programmatic workflows to make downloading data more efficiently.

### AMDirT Practise

::: {.callout-tip title="Question" appearance="simple"}
Try to use your preferred AMDirT interface (GUI or CLI) to find the number of all single-stranded libraries of dental calculus _metagenomes_, published since 2021
:::

::: {.callout-note collapse="true" title="Answer"}
The answer is 16 libraries across 2 publications: Klapper et al. [-@Klapper2023-nv] and Fotani et al. [-@Fontani2023-io]

You can calculate this with the CLI method as follows.

```bash
## Download the ancient metagenome table
AMDirT download -t ancientmetagenome-hostassociated -y samples

## Filter to just dental calculus since 2021
awk -F "\t" 'NR==1 || $2 >= 2021 && $13 == "dental calculus"' ancientmetagenome-hostassociated_samples_v24.03.0.tsv > ancientmetagenome-hostassociated_samples_v24.03.0_filtered.tsv

## Get the library metadata of the dental calculus (might take a little bit of time)
AMDirT convert --librarymetadata ancientmetagenome-hostassociated_samples_v24.03.0_filtered.tsv ancientmetagenome-hostassociated

## Filter to just single stranded libraries
awk -F "\t" 'NR==1 || $9 == "single"' AncientMetagenomeDir_filtered_libraries.tsv > AncientMetagenomeDir_filtered_libraries_singlestranded.tsv

## Count the number of libraries, starting from row 2 to skip the header
tail -n +2 AncientMetagenomeDir_filtered_libraries_singlestranded.tsv | wc -l
```
:::

## Git Practise

A critical factor of AncientMetagenomeDir is that it is community-based.
The community curates all new submissions to the repository, and this all occurs with Git and GitHub.

The data is hosted and maintained on GitHub - new publications are evaluated on issues, submissions created on branches, made by pull requests, and PRs reviewed by other members of the community.

You can see the workflow in the image below (@fig-accessingdata-thedirworkflow) from the original AncientMetagenomeDir publication [@Fellows_Yates2021-rp], and read more about the workflow on the AncientMetagenomeDir wiki ([https://github.com/SPAAM-community/AncientMetagenomeDir/wiki](https://github.com/SPAAM-community/AncientMetagenomeDir/wiki)).

![Overview of the AncientMetagenomeDir contribution workflow. Potential publications are added as a GitHub issue where they undergo relevance evaluation. Once approved, a contributor makes a branch on the AncientMetagenomeDir GitHub repository, adds the new lines of metadata for the relevant samples and libraries, and once ready opens a pull request against the main repository. The pull request undergoes an automated consistent check against a schema and then undergoes a human-based peer review for accuracy against AncientMetagenomeDir guidelines. Once approved, the pull request is merged, and periodically the dataset is released on GitHub and automatically archived on Zenodo.](assets/images/chapters/accessing-ancient-metagenomic-data/fig-accessingdata-thedirworkflow.png){#fig-accessingdata-thedirworkflow fig-align="center" width="75%"}

As AncientMetagenomeDir is on GitHub, the means we can also use this repository to try out our Git skills we learnt in the chapter [Introduction to Git(Hub)](git-github.qmd)!

::: {.callout-tip title="Question" appearance="simple"}

Your task is to complete the following steps.
However, we've replaced the correct Git terminology with generic verbs, indicated with words in quotes.
Recreate the following, but also note down what the _correct_ Git terminology is.

1. Make a 'copy' the `jfy133/AncientMetagenomeDir` ([https://github.com/jfy133/AncientMetagenomeDir](https://github.com/jfy133/AncientMetagenomeDir)) GitHub repository to your account

:::{.callout-note}
We are forking a personal fork of the main repository to ensure we don't accidentally edit the main AncientMetagenomeDir repository!
:::

2. 'Download' the copied repo to your local machine
3. 'Change' into a new branch called `dev`
4. Modify the file `ancientsinglegenome-hostassociated_samples.tsv`. Add the following line to the end of the TSV
	
	Make sure your text editor doesn't replace tabs with spaces!

	```tsv
	Long2022	2022	10.1038/s42003-022-03527-1	Basilica of St. Domenico Maggiore	40.848	14.254	Italy	NASD1	Homo sapiens	400	10.1038/s42003-022-03527-1	bacteria	Escherichia coli	calcified nodule	chromosome	SRA	raw	PRJNA810725	SRS12115743
	```

5. 'Save', 'Record', and 'Send' back to Git(Hub)
6. Open a 'merge suggestion' proposing the changes to the original `jfy133/AncientMetagenomeDir` repo
    - Make sure to put 'Summer school' in the title of the 'Request'
:::

::: {.callout-note collapse="true" title="Answer"}
1. **Fork** the `jfy133/AncientMetagenomeDir` ([https://github.com/jfy133/AncientMetagenomeDir](https://github.com/jfy133/AncientMetagenomeDir)) repository to your account (@fig-acessingancientmetagenomicdata-githubclone)

![Screenshot of the top of the jfy133/AncientMetagenomeDir github repository, with the green Code button pressed and the SSH tab open](assets/images/chapters/accessing-ancient-metagenomic-data/fig-acessingdata-githubclone.png){#fig-acessingancientmetagenomicdata-githubclone}

2. **Clone** the copied repo to your local machine

	```bash
	git clone git@github.com:<YOUR_USERNAME>/AncientMetagenomeDir.Git
	cd AncientMetagenomeDir
	```

3. **Switch** to a new branch called `dev`

	```bash

	git switch -c dev
	```

4. Modify `ancientsinglegenome-hostassociated_samples.tsv`

	```bash
	echo "Long2022	2022	10.1038/s42003-022-03527-1	Basilica of St. Domenico Maggiore	40.848	14.254	Italy	NASD1	Homo sapiens	400	10.1038/s42003-022-03527-1	bacteria	Escherichia coli	calcified nodule	chromosome	SRA	raw	PRJNA810725	SRS12115743" >> ancientsinglegenome-hostassociated/samples/ancientsinglegenome-hostassociated_samples.tsv
	```

5. **Add**, **Commit** and **Push** back to your **Fork** on Git(Hub)

	```bash
	git add ancientsinglegenome-hostassociated/samples/ancientsinglegenome-hostassociated_samples.tsv
	git commit -m 'Add Long2022'
	git push
	```

6. Open a **Pull Request** adding changes to the original jfy133/AncientMetagenomeDir repo (@fig-acessingancientmetagenomicdata-prbutton)
    - Make sure to make the pull request against `jfy133/AncientMetagenomeDir` and NOT `SPAAM-community/AncientMetagenomeDir`
    - Make sure to put 'Summer school' in the title of the pull request!

![Screenshot of the top of the jfy133/AncientMetagenomeDir github repository, with the green pull request button displayed](assets/images/chapters/accessing-ancient-metagenomic-data/fig-accessingdata-prbutton.png){#fig-acessingancientmetagenomicdata-prbutton}

:::

## Summary

- Reporting of metadata messy! Consider when publishing your own work!
  - Use AncientMetagenomeDir as a template for supplementary tables!
- Use AncientMetagenomeDir and AMDirT to rapidly find and download public ancient metagenomic data
  - You can use it to generate templates for dowsntream processing pipelines!
- Contribute to AncientMetagenomeDir with git
  - It is community curated: it will be as good as you make it, the more people who contribute, the easier and better it is.

## (Optional) clean-up

Let's clean up your working directory by removing all the data and output from this chapter.

When closing your `jupyter` notebook(s), say no to saving any additional files.

Press <kbd>ctrl</kbd> + <kbd>c</kbd> on your terminal, and type <kbd>y</kbd> when requested. 
Once completed, the command below will remove the `/<PATH>/<TO>/accessing-metagenomic-data` directory **as well as all of its contents**. 

::: {.callout-tip}
## Pro Tip
Always be VERY careful when using `rm -r`.
Check 3x that the path you are specifying is exactly what you want to delete and nothing more before pressing ENTER!
:::

```bash
rm -r /<PATH>/<TO>/accessing-ancient-metagenomic-data*
```

Once deleted you can move elsewhere (e.g. `cd ~`).

We can also get out of the `conda` environment with

```bash
conda deactivate
```

To delete the conda environment

```bash
conda remove --name accessing-ancient-metagenomic-data --all -y
```

## Resources

- [AncientMetagenomeDir repository](https://github.com/SPAAM-community/AncientMetagenomeDir)
- [AMDirT web server](https://www.spaam-community.org/AMDirT/)
- [AMDirT documentation](https://amdirt.readthedocs.io/)
- [AMDirT repository](https://github.com/SPAAM-community/AMDirT)

## References