-
Notifications
You must be signed in to change notification settings - Fork 0
Project Blog
After presenting the proposed work I engaged in discussion with Lars Juhl-Jensen regarding UMLS licensing. Specifically, we discussed different options for converting the SemRep annotations into a resource that could be open for public download without violating the UMLS license agreement. From this discussion, I have decided to generate two different versions of annotations (see details below). This approach is favorable in that it allows me to complete the project I initially proposed, resulting in the generation of a linked resource in compliance with UMLS licensing as well as developed a version that is open and able to be integrated with PubAnnotation.
SemRep Annotation Versions:
- SemRepRDF-UMLS: UMLS-Only version containing UMLS licensed vocabularies.
- SemRepRDF-LOD: A Linked Open Data (LOD) version of the annotations that does not include any licensed vocabularies/terminologies. To create this version, we will leverage the UMLS concepts to map to other resources that are not subject to licensing restrictions.
Today's work focused on figuring out whether or not mapping to UMLS-generated annotations is a violation of the license agreement. Specific discussions/findings on this topic are now documented on a separate Wiki page dedicated to licensing. We then shifted our focus to reviewing the licensing for each of the vocabularies imported by UMLS, with the goal of determining which we can keep for inclusion in the open version of the annotations. The next steps will involve choosing ontologies to use for mapping to the UMLS concept annotations. Specific discussion on this topic will also be documented on a separate Wiki page dedicated to concept mapping.
Today's effort was heavily focused on finishing the review of the individual vocabulary license agreements. We were able to complete this work, identifying 15 source vocabularies that we could either use explicitly, or we could use internally as a means for mapping to other open ontologies and resources. To ensure transparency in this process, we provide two tables. On the readme and under the licensing section section of the Wiki we provide a table that summaries the vocabularies that we have decided to include. In addition to this, we have provided a link to a Google sheet that lists all source vocabularies in the UMLS with documentation regarding our inclusion decision. We were also able to finalize the schema for representing the annotations, ensuring we can generate output that is consistent with PubAnnotation. The updated schema can be found on the Home page of the Wiki. Per conversation with Jin-Dong, we have agreed to provide the following to him for inclusion in PubAnnotation:
- Two versions of JSON output using the format specified on PubAnnotation. See Figure 1 below. One version will contain only a single mapping to each annotation concept, the other will contain multiple mappings.
- A JSON file that contains only sentence annotations for each of the PubMED identifiers.
Figure 1. PubAnnotation Annotation Format
We will use the PubAnnotation built-in functionality to convert the JSON files to RDF. The primary goals for tomorrow will be drafting an email to send to Olivier explaining the decisions we made during the hackathon as well as finalizing mappings from the Semantic Network relations to the Relation and Basic Formal ontologies.
Today we worked on mapping the 68 Semantic Network relations used in the SemRep predications to relations from ontologies (specifically, we focused on trying to leverage the Relations Ontology (RO) and the Basic Formal Ontology (BFO)). We quickly realized that not all of the Semantic Network relations (e.g. *different_than, higher_than, lower_than) made sense to map to the RO or BFO. Specific details on the mapping can be found on the Resource Mapping Wiki page. The mapping task is one that is very important and may require several iterations to complete. Before finalizing the mapping, we will meet with an ontologist who will verify the relation mappings we have had created.
Since the hackathon, much thought was been directed towards figuring out how to ensure that the relations used, but more specifically how the triples, are biologically meaningful. To help with this process, Dr. Adrianne Stefanski was contacted and has agreed to help with the project. Dr. Stefanski will act as our biological domain expert, helping to ensure that we have represented everything in a consistent and meaningful way. Her most recent comments form our initial meeting are included below.
- "Improving the ability to identify biological relationships that are not simply DNA->RNA->Protein relationships. E.g. Bcr-Abl kinase. Using NLP it would be helpful to build a tool that could identify key terms that would allow additional/separate investigation of signaling, lipid, or carbohydrate mediated biological processes. Terms to query would include:
- "kinase", "phosphatase", "tyrosine/serine/threonine", "lipid", "steroid", "glycan", "glycoprotein" "glycolipid".
- To identify nontranscriptionally or central dogmatic regulated genes/processes/diseases".