This repository collects contributions related to the "Annotations on Structures" topic in the COVID-19 Biohackathon April 5-11 2020.
The context is SWISS-MODEL's involvement in an EU project to combat COVID-19. To accelerate our plan to map relevant annotations onto those structures, we collect tools/platforms which can automatically generate such annotations based on the latest data.
We mainly hope to receive two types of contributions:
- Find/generate relevant sequence data (see issues list for inspirational ideas) to be displayed on structures (see section on SWISS-MODEL's annotation system). This should be scripted to enable automated fetching of the latest data.
- Write reusable scripts to map the sequence data onto the frame of reference of proteins (this might need translation from position on genome data to position on proteins of SARS-CoV-2 as listed here). These scripts are expected to be useful for the scripts in point 1.
Additional topics of interest:
- For visualization experts: alternative ways to visualize the protein structures.
- For RDF/JSON-LD experts: define an RDF ontology and map our json-data (example) to RDF to be used in other knowledge graph efforts. Some efforts exist from PDBj to map structures to RDF but they focus on experimental meta data while we consider structural coverage of the proteins more relevant. Probably SIFTS mappings are the better starting point here. With a minimal "@context" section referring to UniProt we might also be able to turn our existing json to valid json-ld.
- For protein modelling experts: custom modeling of proteins of interest (e.g. using careful expert-curated target-template alignments or combination of templates)
- Programming languages used within SWISS-MODEL: Python (3.6), C++
- Dealing with protein structure and sequence data: OpenStructure (example in wiki here)
Follow the biohackathon's code of conduct and this project's contributions guidelines.
NOTE: this is work-in-progress and subject to change.
The beta-server of SWISS-MODEL is used to allow users to upload annotations: https://beta.swissmodel.expasy.org/repository/covid_annotation_upload (a list of projects for registered users can be found here).
Both the user annotations and the display of the viral polyprotein (R1AB_SARS2) are still work-in-progress and may have bugs. If you find problems with those prototype SWISS-MODEL features, please add issues to this github project and we will try to address them as soon as possible.
The annotation format is a plain-text format:
-
One line per annotation
-
Each annotation will consist of 5 or 6 comma- or tab-separated values:
- ID (UniProtKB AC or MD5 checksum of the sequence)
- Start position (1-based)
- End position
- Color value
- Reference (optional)
- Annotation comment
-
Example:
P0DTD1 3400 3450 #FF00FF https://swissmodel.expasy.org/repository/ My Awesome Annotation P0DTC2 230 330 #FFA500 A text reference One more!
-
The Annotation class available in utils facilitates creation of new annotations:
from utils.sm_annotations import Annotation # generate example annotations annotation = Annotation() # Annotation of residue range with color red provided as RGB annotation.add("P0DTD1", (10, 20), (1.0, 0.0, 0.0), "red anno") # Again, annotating a range but this time we're adding a reference # and provide the color blue as hex annotation.add("P0DTD1", (21, 30), "#0000FF", "blue anno", reference = "https://swissmodel.expasy.org/") # Outputs plain text which is accepted on the covid annotation upload print(annotation) # Or directly do a post request (defaults to SWISS-MODEL beta) print("Visit the following url to see awesome things:") print(annotation.post(title="awesome things"))
The last line directly creates a new annotation project and prints its url. An example can be viewed here
-
UniProtKB ACs with links can be found in UniProtKB
- Our SARS-CoV-2 page shows mapping to mature proteins and the correspondence to RefSeq and GenBank.
- We also have a list of all SARS-CoV-2 proteins that shows an overview of the ACs and their structural coverage.
- For cleaved proteins, use the parent protein. For instance an annotation on nsp3 (Non-structural protein 3) must be reported on P0DTD1 (the "parent" protein) with an offset of 818 (as nsp3 start on position 819 of P0DTD1).
- ViralZone has a well described overview of the proteome here.
- We propose to ignore the shorter polyprotein (P0DTC1, R1A_SARS2) as it's cleaved into the same mature proteins as the longer one (P0DTD1, R1AB_SARS2) with the exception of a very short peptide (Non-structural protein 11 (nsp11), YP_009725312.1).
- Two proteins of unknown function (P0DTD2 and P0DTD3) are missing from our SARS-CoV-2 page but can safely be used to map annotations and we will provide structures if possible.
- Additionally to the SARS-CoV-2 proteins, it also makes sense to map annotations for Q9BYF1 (ACE2_HUMAN). So far this is the only virus-host-interaction for which we have structural information. More interactions have been proposed (e.g. here) but we don't have structures for them (yet).
Also we are actively working on extending the structural coverage of the SARS-CoV-2 proteome by using protein predictions from colleagues participating in CASP.
Protein structure predictions of SARS-CoV-2 have already proven useful to several research projects. To list a few examples which used our models:
- A potential role for integrins in host cell entry by SARS-CoV-2, Antiviral Research
- Targeting Novel Coronavirus 2019: A Systematic Drug Repurposing Approach to Identify Promising Inhibitors Against 3C-like Proteinase and 2'-O-Ribose Methyltransferase
- Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding, The Lancet
- Insilico Medicine publishes molecular structures for the key protein target of 2019-nCoV
- Targeting 2019-nCoV: GHDDI Info Sharing Portal
Thanks goes to these wonderful people (emoji key):
This project follows the all-contributors specification. Contributions of any kind welcome!