Skip to content

Genome Annotation

Karthik Nair edited this page Sep 5, 2019 · 2 revisions

Following Genome Assembly, it is essential to identify the genes and other genetic elements present in the genome. This process called Annotation, can be done using either an automated pipeline or a manual pipeline. Automated pipelines are easier to run, but have a higher error probability. Therefore, manual approach was adopted for this project.

Maker2 pipeline adapted from this repository was used to annotate the genome, with a few modifications. Given that we have an assembled genome, and transcriptome, it was ideal to do the annotation at his stage.

The following steps were followed for manual annotation using Maker2.

Step 1

Create Maker Control files.

maker -CTL

This creates a bunch of control files, of which, only maker_opts.ctl needs to be modified. Open maker_opts.ctl in a text editor and make the following modifications.

genome=reference.fasta

organism_type=eukaryotic

est=trinity.fasta

est2genome=1

Setting est2genome=1 enables gene prediction only on RNA evidence.

Run the following script in the same folder as the control files.

maker

The above command will create a reference.maker.output folder

Step 2

Snap Training

Using the results from Step 1, we will train SNAP, a gene predictor. For this we need to extract the results from the first maker run. Change directory to reference.maker.output folder and run the following lines.

gff3_merge -d reference_master_datastore_index.log

This creates a gff3 file containing the genes predicted in the first run.

Training SNAP requires the creation genome.ann and genome.dna files. Run the following line to do the same.

maker2zff reference.all.gff

genome.ann and genome.dna files contain gene sequences, including those of exons and introns and actual DNA sequences.

After this, the .dna and .ann files have to checked for possible errors, using the following line of code

fathom genome.ann genome.dna -validate > snap_validate_output.txt

Following this, the remaining input files for snap training need to be created. Run the following line for the same

fathom genome.ann genome.dna -categorize 1000
fathom uni.ann uni.dna -export 1000 -plus
forge export.ann export.dna

Now, train SNAP with hmm-assembler(part of the SNAP package):

hmm-assembler.pl reference.fasta . > reference.hmm

Maker2 Second Run

Predictions in the second maker run need to be done based on SNAP training. For this, the following changes need to be made to maker_opts.ctl file

snaphmm= reference.maker.output/reference.hmm

est2genome=0

Now run Maker2 in the same directory as the control files.

maker

Training SNAP again

Run SNAP Training and Maker2 Second Run again for increased accuracy. Care should be taken that it is not over trained. Two runs should be enough.

Step 3

Augustus is our second gene predictor, and it will use the data generated by SNAP to train itself. For this SNAP's zff files need to be converted to gbk files. This script should be used to do the conversion.

This script needs to be run in the same directory as the one containing export.ann and export.dna files.

chmod +x ./zff2augustus_gbk.pl #Provides executable permission for the perl script

cpan Bio::DB::Fasta

./zff2augustus_gbk.pl > augustus.gbk

This will create a augutus.gbk file. This file now needs to be split into training and testing set. My dataset had 26 genes. I split the set into half and assigned them to training and testing sets, using the following lines.

perl randomSplit.pl augustus.gbk 13

Following this, create a config folder in the reference.maker.output directory.

mkdir config

Then, copy all species in the augustus configuration files stored in rackham. This is important as students do not have read write permissions for shared UPPMAX repositories. Enable read and write access to the newly created folder using the following step:

cp -r /sw/apps/bioinfo/augustus/3.2.3/rackham/config/* config/
chmod -R 0777 species/human/

Export the augustus config path:

export AUGUSTUS_CONFIG_PATH="Path/to/reference.maker.output/config"

Then run to optimise the augustus parameters

optimize_augustus.pl --species=human augustus.gbk.train

Following this step, retrain augustus with optimised parameters.

etraining --species=human augustus.gbk.train
augustus --species=human augustus.gbk.test | tee second_training.out

Step4

Final Maker Run

Open the maker_opts.ctl file and make the following changes:

augustus_species=human

keep_preds=1

Now,run maker for one last time:

maker   

Once this is done, check for errors using:

less reference_master_datastore_index.log

Clone this wiki locally