Minotaor

Minotaor is an amino acid sequence annotator for quickly identifying common protein tags and linkers in an ORF. Additionally, it can flag peptide motifs that are known to cause problems during translation. It uses Biopython.

Background

In the PROSITE nomenclature, a sequence motif is a description of the occurrence of amino acids (signature, fingerprint), and can be either a pattern or a profile. A pattern is a qualitative description of a motif in a regular expression-like syntax. A profile (or weight matrix) is a table of position-specific amino acid weights and gap costs. Many independent binary classifications can be applied to these motifs: is tag? is linker? is epitope? etc..

Install

pip install minotaor

Usage

import minotaor
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

protein = Seq("SYYHHHHHHHDYPTEGKSSGSGSESKST*EDINBURGHGENQMEFQUNDRY*")
protein_record = SeqRecord(protein, id="example", annotations={"molecule_type": "protein"})

protein_record = minotaor.annotate_record(protein_record)  # search is case sensitive

Plotting

Plotting requires DNA Features Viewer installed:

graphic_record = minotaor.MinotaorTranslator().translate_record(protein_record)
ax, _ = graphic_record.plot(figure_width=10, strand_in_label_threshold=7)
graphic_record.plot_sequence(ax)

The annotation classes are colored as follows:

Tags
Linkers
Warnings (potential issues, translation pauses etc.)
Errors and stop codons
All other classes (protein domain, epitope, CDR3 region, motif etc.)

Reference datasets

Minotaor can use custom reference pandas dataframes, specified with seq_dataset. The sequence and name columns are used for search and naming of the motifs. This is shown in the below examples of epitope datasets.

Immune Epitope Database

Download and unzip a CSV Metric Export of your choice from the IEDB website, then:

import pandas
iedb_file = 'epitope_full_v3.csv'
iedb = pandas.read_csv(iedb_file, skiprows=1, dtype='str')
iedb.rename(columns={"Epitope IRI": "name", "Description": "sequence"}, inplace=True)
# The dataframe can be used as shown above (note: this is a huge dataset):
protein_record = minotaor.annotate_record(protein_record, seq_dataset=iedb)

The epitopes can then be looked up with the link provided as their names. Alternatively, a more informative epitope name can be constructed from the other columns, as shown in the next section.

VDJdb

Download and unzip the latest VDJdb release, then:

import pandas
vdjdb_file = 'vdjdb-YYYY-MM-DD/vdjdb.slim.txt'
vdjdb = pandas.read_csv(vdjdb_file, sep='\t')
# Create a unique subset of the epitopes:
vdjdb_dataset = vdjdb.copy(deep=True)
vdjdb_dataset.drop_duplicates(subset=['antigen.gene'], inplace=True, ignore_index=True)
vdjdb_dataset['sequence'] = vdjdb_dataset['antigen.epitope']  # or 'cdr3' for antibodies
vdjdb_dataset['name'] = ['VDJdb epitope ' + str(antigen)
                         for antigen in vdjdb_dataset['antigen.gene'].to_list()]

protein_record = minotaor.annotate_record(protein_record, seq_dataset=vdjdb_dataset)

The motifs then can be looked up in VDJdb for more details. A similar approach can be used for the McPAS-TCR database.

iGEM

Sequences can also be annotated with parts in the Registry of the International Genetically Engineered Machine (iGEM) Foundation. A FASTA of all parts can be downloaded and prepared into a reference dataset as shown in examples/igem.py. Alternatively, the exported MySQL database can be prepared in a similar way.

PROSITE

A function is provided for reading ScanProsite results. Query your sequence with Biopython:

from Bio.ExPASy import ScanProsite
my_seq = 'MYHHHHHHYAGDLPGLMDGAAAGGGA'  # cannot contain '*'
scanprosite_handle = ScanProsite.scan(seq=my_seq, mirror='https://prosite.expasy.org/', output='xml')
scanprosite_record = ScanProsite.read(scanprosite_handle)

protein_record = SeqRecord(Seq(my_seq), id="my_seq", annotations={"molecule_type": "protein"})
protein_record = minotaor.add_scanprosite_results(protein_record, scanprosite_record)

Conversion

Convert between PROSITE and regex formats:

regex = minotaor.convert_prosite_to_regex("<A-[GV]-{PR}-[FYW](2)-{P}(4)-x-x(8)>.")
regex
# '^A[GV][^PR][FYW]{2}[^P]{4}[^\\*][^\\*]{8}$'
minotaor.convert_regex_to_prosite(regex)
# '<A-[GV]-{PR}-[FYW](2)-{P}(4)-x-x(8)>.'

InterPro

InterPro provides functional analysis of protein sequences, using several databases. We can annotate our protein with the results, as shown below:

import minotaor
from Bio import SearchIO, SeqIO

interpro = SearchIO.read(handle, 'interproscan-xml')  # handle is an InterProScan xml file
seqrecord = SeqIO.read("protein.fasta", "fasta")

seqrecord_annotated = minotaor.add_interpro(seqrecord, interpro, hit_types=['phobius'])

Short linear motifs (SLiMs)

The Eukaryotic Linear Motif (ELM) database is a manually curated collection of experimentally validated SLiMs. The ELM API returns a tab-separated values (TSV) text file that can be used for annotation:

seqrecord_annotated = minotaor.add_elm_tsv(seqrecord, elm_tsv="elm.tsv")

Alternatively, TSV files of the datasets can be downloaded then edited for use with annotate_record(seqrecord, seq_dataset=). Note that ELM indexing starts from 1.

Other

Given a DNA sequence, return amino acid sequences that may contain it:

bsmbi_site = "CGTCTC"
print(minotaor.convert_dna_to_aa_pattern(bsmbi_site))
# ['RL', '[DATRSICYNLFPHVG]V[S]', '[SPAT]S[RPLQH]', 'ET', '[MAT*RSPKLEVQGW]R[R]', '[G*R]D[DAVEG]']

Returns a regex for each of the 6 translation frames. Note that the search with regex is case sensitive.

Compute % content of selected amino acids:

minotaor.add_aa_content(seqrecord, aa=["D", "E"], window_size=10, cutoff=0.3, name="Acidic sequence"):

Useful for finding proline-rich (P) or acidic sequences (D/E) that may interfere with translation.

Versioning

Minotaor uses the semantic versioning scheme.

License = MIT

Minotaor is free software, which means the users have the freedom to run, copy, distribute, study, change and improve the software.

Minotaor was written at the Edinburgh Genome Foundry by Peter Vegh and is released under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
images		images
minotaor		minotaor
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pypi-readme.rst		pypi-readme.rst
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Minotaor

Background

Install

Usage

Plotting

Reference datasets

Immune Epitope Database

VDJdb

iGEM

PROSITE

Conversion

InterPro

Short linear motifs (SLiMs)

Other

Versioning

License = MIT

About

Uh oh!

Releases 4

Uh oh!

Languages

License

Edinburgh-Genome-Foundry/Minotaor

Folders and files

Latest commit

History

Repository files navigation

Minotaor

Background

Install

Usage

Plotting

Reference datasets

Immune Epitope Database

VDJdb

iGEM

PROSITE

Conversion

InterPro

Short linear motifs (SLiMs)

Other

Versioning

License = MIT

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Uh oh!

Languages