ensembl-vep

VEP (Variant Effect Predictor) predicts the functional effects of genomic variants.
Haplosaurus uses phased genotype data to predict whole-transcript haplotype sequences.
Variant Recoder translates between different variant encodings.

Installation and requirements

The VEP package requires:

gcc, g++ and make
Perl (>=5.10 recommended, tested on 5.10, 5.14, 5.18, 5.22, 5.26)
Perl libraries Archive::Zip and DBI

The remaining dependencies can be installed using the included INSTALL.pl script. Basic instructions:

git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
perl INSTALL.pl

The installer may also be used to check for updates to this and co-dependent packages, simply re-run INSTALL.pl.

See documentation for full installation instructions.

Additional CPAN modules

The following modules are optional but most users will benefit from installing them. We recommend using cpanminus to install.

DBD::mysql - required for database access (--database or --cache without --offline)
Set::IntervalTree - required for Haplosaurus, also confers speed updates to VEP
JSON - required for writing JSON output
PerlIO::gzip - faster compressed file parsing
Bio::DB::BigFile - required for reading custom annotation data from BigWig files

Docker

A docker image for VEP is available from DockerHub. See documentation for the Docker installation instructions.

Conda

Conda Ensembl VEP - the Ensembl VEP team did not produce the Conda install and we do not maintain it. You are welcome to use this package, but please be aware that we are not able to offer support for any issues related to the Conda installation you may encounter while using it.

VEP

Usage

./vep -i input.vcf -o out.txt -offline

See documentation for full command line instructions.

Please report any bugs or issues by contacting Ensembl or creating a GitHub issue

Known Bugs

Below are a list of known Ensembl VEP bugs, where possible, workarounds have been provided. While we prefer to fix all bugs, some involve code that is fundamental, and any changes have potential far reaching impacts and require extensive testing. These can take considerable time and effort to resolve, so the list below details the issues we are aware of but may not be able to resolve for some time.

Using the options –-hgvs and –-no_stats with multi allelic variants may result in incorrect CDS and protein positions, or Ensembl VEP may crash. The fix for this is complex and we don’t currently have a timeline for it, but there are workarounds:
- Separating the multiallelic variants on to individual input lines should result in the correct annotation.
- Don’t combine the –-hgvs and -–no_stats options.
Variants overlapping exon / intron boundaries can occasionally be annotated with incorrect consequences and the HGVS may not be fully correct. We currently do not have an effective workaround for such cases and we will continue to investigate a fix.
Variants overlapping incomplete terminal codons - when annotating these, Ensembl VEP may produce unreliable variant consequences.
- These transcripts have been annotated from protein fragments and thus may have incomplete open reading frames. We currently do not have a foolproof method to annotate variants overlapping these features, so the recommended workaround is to use an alternative, more complete transcript, where possible.
Minor Allele Frequecy (MAF) filtering bug. Using the --freq_freq, --freq_gt_lt, and --freq_filter to filter the output of Ensembl VEP can lead to some variants being incorrectly excluded or included, depending on the parameters used with the options. This only affecte insertions / deletions and is caused by an allele matching issue. We are investigating a potenital fix, but it will take time to fully assess and implement.
- In the meantime, the simplest workaround is to not filter while running Ensembl VEP and instead use the filter_vep script to filter your output files. In almost all circumstances this is the better approach as you will retain marginal cases in your unfiltered file that may be relevant.
- Please note that this bug also impacts the MAF filtering options on the web version of Ensembl VEP.
Non coding transcripts do not get their allele shifted when the --shift_3prime is enabled. This has consequence in custom annotation feature when used with the "type=exact" option enabled or any other logic that tries to match variant allele with the annotated feature.
Running --fork with --merged can cause discrepancies in results for HGNC ids due to internal caching. It is recommended to not run --fork when using --merged for obtaining HGNC ids.

Haplosaurus

haplo is a local tool implementation of the same functionality that powers the Ensembl transcript haplotypes view. It takes phased genotypes from a VCF and constructs a pair of haplotype sequences for each overlapped transcript; these sequences are also translated into predicted protein haplotype sequences. Each variant haplotype sequence is aligned and compared to the reference, and an HGVS-like name is constructed representing its differences to the reference.

This approach offers an advantage over VEP's analysis, which treats each input variant independently. By considering the combined change contributed by all the variant alleles across a transcript, the compound effects the variants may have are correctly accounted for.

haplo shares much of the same command line functionality with vep, and can use VEP caches, Ensembl databases, GFF and GTF files as sources of transcript data; all vep command line flags relating to this functionality work the same with haplo.

Usage

Input data must be a VCF containing phased genotype data for at least one individual and file must be sorted by chromosome and genomic position; no other formats are currently supported.

When using a VEP cache as the source of transcript annotation, the first time you run haplo with a particular cache it will spend some time scanning transcript locations in the cache.

./haplo -i input.vcf -o out.txt -cache

Output

The default output format is a simple tab-delimited file reporting all observed non-reference haplotypes. It has the following fields:

Transcript stable ID
CDS haplotype name
Comma-separated list of flags for CDS haplotype
Protein haplotype name
Comma-separated list of flags for protein haplotype
Comma-separated list of frequency data for protein haplotype
Comma-separated list of contributing variants
Comma-separated list of sample:count that exhibit this haplotype

The altered haplotype sequences can be obtained by switching to JSON output using --json which will display them by default. Each transcript analysed is summarised as a JSON object written to one line of the output file.

The JSON output structure matches the format of the transcript haplotype REST endpoint.

You may exclude fields in the JSON from being exported with --dont_export field1,field2. This may be used, for example, to exclude the full haplotype sequence and aligned sequences from the output with --dont_export seq,aligned_sequences.

Note JSON output does not currently include side-loaded frequency data.

REST service

The transcript haplotype REST endpoint. returns arrays of protein_haplotypes and cds_haplotypes for a given transcript. The default haplotype record includes:

population_counts: the number of times the haplotype is seen in each population
population_frequencies: the frequency of the haplotype in each population
contributing_variants: variants contributing to the haplotype
diffs: differences between the reference and this haplotype
hex: the md5 hex of this haplotype sequence
other_hexes: the md5 hex of other related haplotype sequences ( CDSHaplotypes that translate to this ProteinHaplotype or ProteinHaplotype representing the translation of this CDSHaplotype)
has_indel: does the haplotype contain insertions or deletions
type: the type of haplotype - cds, protein
name: a human readable name for the haplotype (sequence id + REF or a change description)
flags: flags for the haplotype
frequency: haplotype frequency in full sample set
count: haplotype count in full sample set

The REST service does not return raw sequences, sample-haplotype assignments and the aligned sequences used to generate differences by default.

Flags

Haplotypes may be flagged with one or more of the following:

indel: haplotype contains an insertion or deletion (indel) relative to the reference.
frameshift: haplotype contains at least one indel that disrupts the reading frame of the transcript.
resolved_frameshift: haplotype contains two or more indels whose combined effect restores the reading frame of the transcript.
stop_changed: indicates either a STOP codon is gained (protein truncating variant, PTV) or the existing reference STOP codon is lost.
deleterious_sift_or_polyphen: haplotype contains at least one single amino acid substitution event flagged as deleterious (SIFT) or probably damaging (PolyPhen2).

bioperl-ext

haplo can make use of a fast compiled alignment algorithm from the bioperl-ext package; this can speed up analysis, particularly in longer transcripts where insertions and/or deletions are introduced. The bioperl-ext package is no longer maintained and requires some tweaking to install. The following instructions install the package in $HOME/perl5; edit PREFIX=[path] to change this. You may also need to edit the export command to point to the path created for the architecture on your machine.

git clone https://github.com/bioperl/bioperl-ext.git
cd bioperl-ext/Bio/Ext/Align/
perl -pi -e"s|(cd libs.+)CFLAGS=\\\'|\$1CFLAGS=\\\'-fPIC |" Makefile.PL
perl Makefile.PL PREFIX=~/perl5
make
make install
cd -
export PERL5LIB=${PERL5LIB}:${HOME}/perl5/lib/x86_64-linux-gnu/perl/5.22.1/

If successful the following should print OK:

perl -MBio::Tools::dpAlign -e"print qq{OK\n}"

Variant Recoder

variant_recoder is a tool for translating between different variant encodings. It accepts as input any format supported by VEP (VCF, variant ID, HGVS), with extensions to allow for parsing of potentially ambiguous HGVS notations. For each input variant, variant_recoder reports all possible encodings including variant IDs from all sources imported into the Ensembl database and HGVS (genomic, transcript and protein), reported on Ensembl, RefSeq and LRG sequences.

Usage

variant_recoder depends on database access for identifier lookup, and cannot be used in offline mode as per VEP. The output format is JSON and the JSON perl module is required.

./variant_recoder --id [input_data_string]
./variant_recoder -i [input_file] --species [species]

Output

Output is a JSON array of objects, one per input variant, with the following keys:

input: input string
id: variant identifiers
hgvsg: HGVS genomic nomenclature
hgvsc: HGVS transcript nomenclature
hgvsp: HGVS protein nomenclature
spdi: Genomic SPDI notation
vcf_string: VCF format (optional)
var_synonyms: Variation synonyms (optional)
mane_select: MANE Select transcripts (optional)
warnings: Warnings generated e.g. for invalid HGVS

Use --pretty to pre-format and indent JSON output.

Example output:

./variant_recoder --id "AGT:p.Met259Thr" --pretty
[
   {
     "warnings" : [
         "Possible invalid use of gene or protein identifier 'AGT' as HGVS reference; AGT:p.Met259Thr may resolve to multiple genomic locations"
      ],
     "C" : {
        "input" : "AGT:p.Met259Thr",
        "id" : [
           "rs699",
           "CM920010",
           "COSV64184214"
        ],
        "hgvsg" : [
           "NC_000001.11:g.230710048A>G"
        ],
        "hgvsc" : [
           "ENST00000366667.6:c.776T>C",
           "ENST00000679684.1:c.776T>C",
           "ENST00000679738.1:c.776T>C",
           "ENST00000679802.1:c.776T>C",
           "ENST00000679854.1:n.1287T>C",
           "ENST00000679957.1:c.776T>C",
           "ENST00000680041.1:c.776T>C",
           "ENST00000680783.1:c.776T>C",
           "ENST00000681269.1:c.776T>C",
           "ENST00000681347.1:n.1287T>C",
           "ENST00000681514.1:c.776T>C",
           "ENST00000681772.1:c.776T>C",
           "NM_001382817.3:c.776T>C",
           "NM_001384479.1:c.776T>C"
        ],
        "hgvsp" : [
           "ENSP00000355627.5:p.Met259Thr",
           "ENSP00000505981.1:p.Met259Thr",
           "ENSP00000505063.1:p.Met259Thr",
           "ENSP00000505184.1:p.Met259Thr",
           "ENSP00000506646.1:p.Met259Thr",
           "ENSP00000504866.1:p.Met259Thr",
           "ENSP00000506329.1:p.Met259Thr",
           "ENSP00000505985.1:p.Met259Thr",
           "ENSP00000505963.1:p.Met259Thr",
           "ENSP00000505829.1:p.Met259Thr",
           "NP_001369746.2:p.Met259Thr",
           "NP_001371408.1:p.Met259Thr"
        ],
        "spdi" : [
           "NC_000001.11:230710047:A:G"
        ]
     }
   }
]

Options

variant_recoder shares many of the same command line flags as VEP. Others are unique to variant_recoder.

-id|--input_data [input_string]: a single variant as a string.
-i|--input_file [input_file]: input file containing one or more variants, one per line. Mixed formats disallowed.
--species: species to use (default: homo_sapiens).
--grch37: use GRCh37 assembly instead of GRCh38.
--genomes: set database parameters for Ensembl Genomes species.
--pretty: write pre-formatted indented JSON.
--fields [field1,field2]: limit output fields. Comma-separated list, one or more of: id, hgvsg, hgvsc, hgvsp, spdi.
--vcf_string : report VCF
--var_synonyms : report variation synonyms
--mane_select : report MANE Select transcripts in HGVS format
--host [db_host]: change database host from default ensembldb.ensembl.org (UK); geographic mirrors are useastdb.ensembl.org (US East Coast) and asiadb.ensembl.org (Asia). --user, --port and --pass may also be set.
--pick, --per_gene, --pick_allele, --pick_allele_gene, --pick_order: set and customise transcript selection process, see VEP documentation

Name		Name	Last commit message	Last commit date
Latest commit History 2,782 Commits
.github		.github
docker		docker
examples		examples
modules/Bio/EnsEMBL/VEP		modules/Bio/EnsEMBL/VEP
nextflow		nextflow
t		t
travisci		travisci
validator		validator
.travis.yml		.travis.yml
INSTALL.pl		INSTALL.pl
LICENSE		LICENSE
README.md		README.md
convert_cache.pl		convert_cache.pl
cpanfile		cpanfile
filter_vep		filter_vep
haplo		haplo
main.nf		main.nf
nextflow.config		nextflow.config
variant_recoder		variant_recoder
vep		vep

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ensembl-vep

Table of contents

Installation and requirements

Additional CPAN modules

Docker

Conda

VEP

Usage

Known Bugs

Haplosaurus

Usage

Output

REST service

Flags

bioperl-ext

Variant Recoder

Usage

Output

Options

About

Uh oh!

Releases 75

Packages

Uh oh!

Contributors 28

Languages

License

Ensembl/ensembl-vep

Folders and files

Latest commit

History

Repository files navigation

ensembl-vep

Table of contents

Installation and requirements

Additional CPAN modules

Docker

Conda

VEP

Usage

Known Bugs

Haplosaurus

Usage

Output

REST service

Flags

bioperl-ext

Variant Recoder

Usage

Output

Options

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 75

Packages 0

Uh oh!

Contributors 28

Languages

Packages