-
Notifications
You must be signed in to change notification settings - Fork 0
Genome Annotation
Following Genome Assembly, it is essential to identify the genes and other genetic elements present in the genome. This process called Annotation, can be done using either an automated pipeline or a manual pipeline. Automated pipelines are easier to run, but have a higher error probability. Therefore, manual approach was adopted for this project.
Maker2 pipeline adapted from this repository was used to annotate the genome, with a few modifications. Given that we have an assembled genome, and transcriptome, it was ideal to do the annotation at his stage.
The following steps were followed for manual annotation using Maker2.
Create Maker Control files.
maker -CTLThis creates a bunch of control files, of which, only maker_opts.ctl needs to be modified. Open maker_opts.ctl in a text editor and make the following modifications.
genome=reference.fasta
organism_type=eukaryotic
est=trinity.fasta
est2genome=1Setting est2genome=1 enables gene prediction only on RNA evidence.
Run the following script in the same folder as the control files.
makerThe above command will create a reference.maker.output folder
Using the results from Step 1, we will train SNAP, a gene predictor. For this we need to extract the results from the first maker run. Change directory to reference.maker.output folder and run the following lines.
gff3_merge -d reference_master_datastore_index.logThis creates a gff3 file containing the genes predicted in the first run.
Training SNAP requires the creation genome.ann and genome.dna files. Run the following line to do the same.
maker2zff reference.all.gffgenome.ann and genome.dna files contain gene sequences, including those of exons and introns and actual DNA sequences.
After this, the .dna and .ann files have to checked for possible errors, using the following line of code
fathom genome.ann genome.dna -validate > snap_validate_output.txtFollowing this, the remaining input files for snap training need to be created. Run the following line for the same
fathom genome.ann genome.dna -categorize 1000
fathom uni.ann uni.dna -export 1000 -plus
forge export.ann export.dnaNow, train SNAP with hmm-assembler(part of the SNAP package):
hmm-assembler.pl reference.fasta . > reference.hmmPredictions in the second maker run need to be done based on SNAP training. For this, the following changes need to be made to maker_opts.ctl file
snaphmm= reference.maker.output/reference.hmm
est2genome=0Now run Maker2 in the same directory as the control files.
makerRun SNAP Training and Maker2 Second Run again for increased accuracy. Care should be taken that it is not over trained. Two runs should be enough.
Augustus is our second gene predictor, and it will use the data generated by SNAP to train itself. For this SNAP's zff files need to be converted to gbk files. This script should be used to do the conversion.
This script needs to be run in the same directory as the one containing export.ann and export.dna files.
chmod +x ./zff2augustus_gbk.pl #Provides executable permission for the perl script
cpan Bio::DB::Fasta
./zff2augustus_gbk.pl > augustus.gbkThis will create a augutus.gbk file. This file now needs to be split into training and testing set. My dataset had 26 genes. I split the set into half and assigned them to training and testing sets, using the following lines.
perl randomSplit.pl augustus.gbk 13Following this, create a config folder in the reference.maker.output directory.
mkdir configThen, copy all species in the augustus configuration files stored in rackham. This is important as students do not have read write permissions for shared UPPMAX repositories. Enable read and write access to the newly created folder using the following step:
cp -r /sw/apps/bioinfo/augustus/3.2.3/rackham/config/* config/
chmod -R 0777 species/human/Export the augustus config path:
export AUGUSTUS_CONFIG_PATH="Path/to/reference.maker.output/config"Then run to optimise the augustus parameters
optimize_augustus.pl --species=human augustus.gbk.trainFollowing this step, retrain augustus with optimised parameters.
etraining --species=human augustus.gbk.train
augustus --species=human augustus.gbk.test | tee second_training.outOpen the maker_opts.ctl file and make the following changes:
augustus_species=human
keep_preds=1Now,run maker for one last time:
maker Once this is done, check for errors using:
less reference_master_datastore_index.log