GenomeDelta is a software designed to discover horizontal transfer events in a species. By comparing a genome (in FASTQ or BAM format) with an assembly (in FASTA format) of the same species, GenomeDelta identifies novel genetic elements present in the assembly, but absent in the short-reads genome.
Its primary focus is on detecting transposable element invasions, while also offering the capability to unveil other HT events and other genomic alterations. An enormous advantage of GenomeDelta when dealing with transposable elements is that it does not require any reference library of transposons to identify the novel invaders.
If you are using a MacOS machine, download the setup.sh
file only
from:
https://github.com/rpianezza/GenomeDelta/blob/main/macOS/setup.sh.
You can make the file executable with the command below, then double clicking on the setup file will install GD in the root folder.
chmod +x Downloads/setup.sh
Alternatively, open a terminal and move to the folder where you want to
install GD, for example the Applications folder. From the Application
folder, call the downloaded setup.sh
file:
cd Applications
bash ../Downloads/setup.sh
For a manual installation, clone the GD repository to your local
folder. Then create the conda environment using the set-env.yml
.
git clone https://github.com/rpianezza/GenomeDelta.git
cd macOS
conda env create -f set-env.yml
Note that all of the dependencies needed to run GD are common bioinformatic tools and might be already installed on your machine, so creating a conda environment might not even be necessary. If you are having trouble in setting the conda environment, check the set-env.yml
file to assess if all the dependencies are already installed on your machine, then manually install the missing ones.
Download the GD repository using git clone, then create the conda
environment using the set-env.yml
.
git clone https://github.com/rpianezza/GenomeDelta.git
cd linux
conda-env create -f linux/set-env-linux.yml
Activate the conda environment:
conda activate GenomeDelta
When being into the GenomeDelta conda environment, to run the tool you can just type GenomeDelta (MacOS) or call the main script defining its path (if you followed a manual installation or you are on a Linux machine).
Example call (MacOS):
GenomeDelta --fq reads.fastq.gz --fa assembly.fa --of folder_path --prefix name --t 20
Example call (Linux or manual installation):
cd GenomeDelta/linux
bash main.sh --fq reads.fastq.gz --fa assembly.fa --of folder_path --prefix name --t 20
The above call is composed of:
bash
-> to specify that we are about to call a bash scriptmain.sh
-> the main script of GenomeDelta.--fq
-> the FASTQ file of the “old” genome.--fa
-> the FASTA file of the “new” genome (the assembly).--of
-> the output folder.--prefix
-> the prefix that all the output files will have.--t
-> the number of threads that will be used to parallelize the slowest steps.
Remember to index the FASTA assembly before the call!
bwa index assembly.fa
GD can also accept sorted BAM files as input instead of the FASTQ file. The BAM file should have been mapped to the same FASTA assembly specified in the call and sorted with samtools.
GenomeDelta --bam mapped.sorted.bam --fa assembly.fa --of folder_path --prefix name --t 20
GenomeDelta also has some other options, that can be used to refine or explore your findings:
--min_len
-> Set the minimum length of a low-coverage region to be included in the output files. Default = 1000--min_cov
-> Set the minimum coverage. Below that, a position is considered to be “low-coverage” and will be included in the next steps. Default = 1--d
-> Set the maximum distance between two low-coverage regions to be merged. If the distance between the two regions is below d, the two regions will be merged. Increasing this distance could create artifacts and chimeric sequences, but could find more fragmented regions (es. re-invading TEs). Default = 100.--min_bitscore
-> to find repetitive clusters, GD is using BLASTn. The output is filtered based on the bitscore value, with default set to 1000 to only consider high quality alignments. If you want to find small sequences, you may want to decrease this parameter.
-
GD-candidates.png
-> Visualization of the repetitive clusters found (candidate TEs). -
GD-candidates.fasta
-> Consensus sequences of the repetitive clusters. Represents a list of candidates of the invading TEs. Each sequence name is composed off: (i) cluster name, (ii) median coverage bias of the sequences in the cluster, (iii) number of sequences in the cluster.
-
GD.fasta
-> FASTA containing all the sequences with coverage lower than--min_cov
(default = 2) and longer than--min_len
(default = 1000 bases), merged together if their distance is lower than--d
(default = 100 bases). -
GD.bed
-> Chromosome, starting and ending positions of the sequences collected inGD.fasta
. -
GD-non_rep.fasta
-> Subset of theGD.fasta
file containing only sequences which were not included in the repetitive clusters, so either non repetitive sequences, huge gaps (>25.000 bp) or sequences which have some similarity with a repetitive clusters but with dimensions bigger the the rest of the cluster. -
GD-non_rep.png
-> Visualization of the non repetitive gaps found (candidate HTs). -
GD.blast
-> Output of the self BLAST ofGD.fasta
against itself. This file is then used by the program to find the repetitive clusters of sequences. -
GD-clusters
-> This folder contains, for each of the repetitive clusters found:- a FASTA file containing all the sequences clustering together.
- the consensus sequence of the cluster, then concatenated with the other consensus into the GD-candidates.fasta file.
Example call:
bash main.sh --bam reads.sorted.bam --fa assembly.fa --of folder_path --prefix name --t 20
To iterate over multiple FASTQ or BAM files in a folder and run GenomeDelta on all of them against a single assembly, you can use this one-liner structure:
For FASTQ:
for fq_file in folder/*.fastq.gz; do base_name=$(basename "$fq_file" .fastq.gz); file="$base_name"; bash main.sh --fq "$fq_file" --fa assembly.fa --of folder_path/"$file" --prefix name --t 20; done
For BAM (sorted):
for bam_file in folder/*.sorted.bam; do base_name=$(basename "$bam_file" .sorted.bam); file="$base_name"; bash main.sh --bam "$bam_file" --fa assembly.fa --of folder_path/"$file" --prefix name --t 20; done
These commands will generate a separate folder for each of the input files, named as the input file basename.
To iterate over multiple FASTA assemblies and run GenomeDelta on all of them against a single FASTQ file, you can use this loop structure:
for fa_file in /path/to/your/fasta/files/*.fa; do
base_name=$(basename "$fa_file" .fa)
file="$base_name"
bash main.sh --fq reads.fastq.gz --fa "$fa_file" --of /path/to/output/"$file" --prefix name --t 20
done
This command will generate a separate folder for each of the assemblies, named as the assembly file basename.
To iterate over multiple FASTA assemblies and run GenomeDelta on all of them against multiple FASTQ files, you can use this double loop structure:
for fq_file in /path/to/your/fastq/files/*.fastq.gz; do
base_name=$(basename "$fq_file" .fastq.gz)
for fa_file in /path/to/your/fasta/assemblies/*.fa; do
base_name_fa=$(basename "$fa_file" .fa)
file="$base_name"
file_fa="$base_name_fa"
bash main.sh --fq "$fq_file" --fa "$fa_file" --of /path/to/output/"$file"_"$file_fa" --prefix name --t 20
done
done
This command will generate a separate folder for each of the assembly-FASTQ combination, named as the FASTQ file basename and the assembly file basename separated by “_“. You can change this loop to make it iterate over BAM files instead of FASTQ. Note that you may need to adjust the extension”fa” to “fasta” based on the assemblies names.
The main output file for repetitive sequences is GD-candidates.fasta
,
which represents a list of candidates invaders in the time range going
from the old genome collection time to the recent assembly collection
time. However, not all the sequences found in the file are probably real
invaders.
Inside the file, you will probably find sequences of telomeric repeats: telomeres often have coverage gaps in sequencing efforts, thus a telomeric repetitive region could remain unsequenced in the FASTQ file and appear as coverage gap in the FASTA assembly, ending in our candidates list. It is thus important to check where are the repetitive sequences of the cluster (the FASTA name of the sequences contains the genomic position).
Also, an invading TE with a consistent sequence similarity to another (old, non recently invading) TE will result in reads mapping to the supposed “coverage gap” of the new TE. Thus, instead of having a clean coverage gap representing the new TE insertion, we will have some reads mapping on at least a part of the gap. In the GenomeDelta output, you may find different parts of the same, new TE in different “clusters”, so in different FASTA entries, each representing a “clean” gap in the coverage divided by a high coverage region (es. a part of the TE very similar to the old TE, where the reads are mapping).
Those problems could be solved by playing with the options provided by the software, but multiple run of GenomeDelta could be necessary to polish the results.
Furthermore, the quality of both the FASTQ file and the FASTA assembly have an impact on the quality of the results.
-
If the FASTQ file has a low coverage or/and a high gaps percentage, many coverage gaps will be identified by GenomeDelta when mapping to the assembly.
-
If the FASTA assembly has a low quality, for example many contigs, there could be contigs where the reads are not mapping at all (es. contaminations from other organisms). The quality of the long read assembly is thus crucial for a smooth GenomeDelta run.
In general, clusters with low coverage bias (close to 0), with a long consensus sequence and composed by many sequences are likely to be more valuable.