Skip to content

101 add decision record and product roadmap into readthedocs #145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
421 changes: 388 additions & 33 deletions docs/source/appendices/design_decisions.rst

Large diffs are not rendered by default.

102 changes: 102 additions & 0 deletions docs/source/appendices/hyperintensional_catvars.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
.. _hyperintensiona_catvars:

Treatment of CatVars as ((Hyper)intensional) Set-Theoretic Objects
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


Catvars as Sets
@@@@@@@@@@@@@@@


**Decision:**
The group decided to formally model catvars as sets.


**Rationale:**
In `Patterson, Stats, Yin, and Mockus 2019
<https://www.nature.com/articles/s41698-018-0073-y>`_, it was observed that:

“The JAX-CKB also incorporates higher order variants to accommodate unspecified variants, such as EGFR mutant, EGFR act mut (for EGFR activating mutation), and EGFR exon 19 deletion. These higher order, or “category”, variants enable curation of data where the specific alteration present is not identified.”

This is one of the first concrete definitions of “category” variants, and where the term that later evolved into categorical variants (catvars) was coined. This notion of catvars as being unspecified (and in some sense higher-order) variants does help distinguish them from variants describing specific alterations, in what they aren’t: not specific in some relevant sense. Given that the problems relating to catvar representation occur most often in the context of genomics knowledgebases, it was also frequently noted that catvars are often the subjects (in the grammatical sense) of genomic knowledge statements, and serve as links to genomic knowledge statements. For example, the string “TP53 Loss” in the proposition below:


#. TP53 Loss is associated with increased risk of cancer.
#. [TP53 Loss]\ :sub:`Subject`\ [is associated with]\ :sub:`Predicate`\ [increased risk of cancer]\ :sub:`Object`\


However, this is a description of how catvars are used, not what they are. Thus it was one of the first priorities of the then-CatVar study group to attempt to describe what catvars are, by way of collecting examples. It soon became obvious that most commonly described catvars, such as TP53 Loss variants, EFGR Exon 18-21 Deletions, or BRAF V600E describe classes of possible assayed variants: variants directly observed from a sequencing assay. Indeed, many example catvars, contain hyper-cosmologically large numbers of possible members [#]_, or even infinitely large sets, such as any non-length constrained set of insertions. Even fairly specific variants in knowledgebases could be understood to in fact represent a set of assayed variants, such as a specific nucleotide variant corresponding to multiple variant records in the context of different transcripts or genome reference builds. Indeed, after much discussion, it became clear that all catvars could be described as sets of contextual variants [#]_, where in contrast to the member variants, the catvar itself is not contextualized to a patient or genome, but rather to a knowledgebase.


Catvars as Intensional Sets
@@@@@@@@@@@@@@@@@@@@@@@@@@@


**Decision:**
The group decided to model CatVars as intensional set objects rather than as extensional sets.


**Rationale:**

There are two ways to model these sets, as extensional set objects or intensional set objects.

.. figure:: ../images/intensional-vs-extensional-sets.png
:alt: Graphical description of defining sets extensionlly (by their members) vs intensionally (by the common properties of their members).
:align: center

Figure 1: Extensional vs intensional sets


**Catvars as Extensional Set Objects**
As seen in Figure 1, you *can* define a set by its members, its *extensions*. In terms of types and matching, this is often the simplest approach to take, as the members can be of any arbitrary type, and matching something to the set simply consists of checking the members of the set until you either find your target among them, in which case you know your thing is a member of the set, or until you run out of members to check, in which case you know the target is *not* a member of the set.

There are, however, a number of implementation challenges with purely extensional sets in this context. One clear problem, and in our regard sufficient to rule out this approach, is that, as we observed above, the cardinality of many catvars (the number of members in the set) are often hyper-cosmologically big, or not even finite at all. Attempting to loop through and check all the members of these sets are computationally impractical [#]_ and impossible, respectively. Therefore, an extensional approach to representing catvars is a nonstarter.

**Catvars as Intensional Set Objects**

An alternative is to represent catvars as intensional set objects, that is, to define the set according to the common properties of its members. This approach is more complex with respect to matching, but has the potential to be much more efficient than in the extensional case. Whereas matching to extensional sets simply meant comparing each of the set members against a target, matching to an intensional set requires matching the set’s properties (also called constraints on membership) against the target’s properties. This is more complicated because it requires additional abstract data types over these properties, and more sophisticated techniques to efficiently parse out and match to (sets of) properties in non-trivial cases.

However, the advantage here is that while many catvars have an infinite number of potential extensions, these same catvars are readily describable by a small (finite!) number of properties. As shown in Figure 2, this approach also proved effective in a previous typological analysis of catvars in the CIViC knowledgebase, where the columns correspond to a coarse-grained set of intensional properties required for catvar membership. Based on this and related considerations, it was proposed to model catvars as intensional set objects.


.. figure:: ../images/Typology-of-CatVars-in-CIViC.png
:alt: A graph showing how catvars in the CIViC knowledgebase compare or differ based on what sequence is being referenced (systemic, genomic, transcript, and/or protein) and how they differ from the reference (a change in sequence, location, reading frame quantity, or function).
:align: center

Figure 2: A coarse-grained typology of catvars in the CIViC knowledgebase


Catvars as Hyperintensional Set Objects
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


**Decision:**
The group decided to adopt a hyperintensional semantic model for catvars (as opposed to a merely intensional model).

**Rationale:**
`Hyperintensionality
<https://plato.stanford.edu/entries/hyperintensionality/>`_ is a property of a model wherein two sets, though identical with respect to their intensions, are treated as distinct sets that can be differentiated.

An intensional model will not suffice in the case of catvars, which reflects a more general trend in the formal modelling of human knowledge and beliefs. The issue relates to the fact that two catvars can have the same properties, and yet be useful to distinguish. To give an example from mathematics in order to illustrate the intuition, consider the following two propositions: ‘2 + 2 = 4’ and ‘Fermat’s last theorem is true.’ Both of these propositions share the same relevant property: They are each consistent with the axioms of number theory, and are true. However, between 1637 when Fermat published his theorem, and 1994 when it was proven to be true, no human knew if this was the case or not. This is important, because from the perspective of a merely intensional model of human knowledge, since, as it turns out, both propositions have the same properties, then knowing one, ‘that '2 + 2 = 4' is true’ would necessarily imply that one also knows the other, ‘that ‘fermat’s last theorem is true’ is true’. This was obviously not the case for three and a half centuries, so human knowledge cannot be modelled merely intensionally, but requires hyperintensionality. Another example of hyperintensionality in human belief comes from the following two, logically interchangeable beliefs: ‘The cup is half full’ versus ‘the cup is half empty’. Any situation where the first belief is obtained, the second is as well. And yet we usefully distinguish people who hold the former belief as optimists from those who espouse the latter as killjoys.


In the realm of genomics, the same facts hold. Catvars are labels created by humans to reflect certain beliefs about distinctions of clinical and/or methodological importance. Consider the following four catvars in example (1) below. All four labels represent the same identical underlying protein missense variant, in the context of different nomenclature systems.

#. Several catvars:
#. 7-14075336-A-T
#. NM_004333.6:6.1799T>A
#. rs113488022
#. BRAF p.V600E


While this may at first blush appear to be merely a use-case for variation normalization, different names for the same catvar can have vastly different kinds of knowledge associated with them. Suppose we have some three labels that, as with the above example, represent the same underlying genomic variation. One is an HGVS expression denoting a large nucleotide sequence deletion. The second is an ISCN string denoting a loss of a cytoband region, and the third label denotes a star-allele. Even though our rhetorical variant in the genome is the exact same in all three cases, these different labels imply very different things about that variant. The HGVS vs ISCN labels are clues as to which laboratory techniques (molecular vs cytogenetic) that can be used to observe this variant. In a similar way, the star allele label specifically implies a link to pharmacogenomic consequences of this variant, knowledge that may well not be attached to either of the first two labels. A hyperintensional model can represent these three catvars in parallel, and still be compatible with future normalization/harmonization. A merely intensional model would not afford us this flexibility.

Therefore, based on the above considerations, it was proposed to model catvars as hyperintensional set objects.

.. rubric:: Footnotes

.. [#] Consider for example the catvar “EGFR exon 19 sequence variants”, of which there are some 10\ :sup:`125` possible members, assuming only length-preserving variants. For a sense of scale, a liberal upper estimate on the number of atoms in the observable universe is on the order of 10\ :sup:`80` .

.. [#] Even a single, specific variant, for example, a novel genomic mutation that has only ever been recorded in one transcript and in one patient, can correspond to a categorical variant removed from the context of the individual patient, and described as a singleton set - a set with only a single member. Therefore, even cases of singletons can be handled via a set theoretic modelling approach.

.. [#] Returning to the example of “EGFR exon 19 sequence variants”, if every atom in the universe were a computer, and each of those computers were capable of checking 1 trillion members per second, it would still take approximately 2,500,000,000,000,000 times the current age of the universe to check every possible member of that set.
2 changes: 1 addition & 1 deletion docs/source/getting_involved.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,5 +19,5 @@ can get involved:

.. image:: images/cat-vrs-transparent-bg.png
:width: 50%
:alt: An irresistably cute kittynaut beckoning you to enter the Cat-VRS.
:alt: An irresistably cute cat-stronaut beckoning you to enter the Cat-VRS.
:align: center
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Caveat emptor.
concepts/index
impl-guide/index
releases/index
roadmap/index
appendices/index


Expand Down
62 changes: 62 additions & 0 deletions docs/source/roadmap.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
Cat-VRS Roadmap
!!!!!!!!!!!!!!!

Cat-VRS is under active development. More information on many of our ongoing, future, and past efforts are can be found below.

.. Active.

Active
@@@@@@


* Implement additional constraints, including the `CopyChangeConstraint <https://github.com/ga4gh/cat-vrs/issues/88>`_, the `FeatureContextConstraint <https://github.com/ga4gh/cat-vrs/issues/98>`_, and the `FunctionConstraint <https://github.com/ga4gh/cat-vrs/discussions/54>`_.

* `Incorporate boolean logic into the constraint model to cover complex categorical variants <https://github.com/ga4gh/cat-vrs/issues/92>`_

* `Implement all Trial Use classes in testbed <https://github.com/ga4gh/cat-vrs/issues/132>`_

* `expand Cat-VRS examples <https://github.com/ga4gh/gks-portal/issues/19>`_ at `GA4GH pre-Connect Hackathon <https://github.com/ga4gh/gks-portal/issues?q=is%3Aissue%20state%3Aopen%20label%3AGA4GH-Connect-2025>`_


.. Planned.

Planned
@@@@@@@

* 2025 Q3 - `Trial Use Review Ballot <https://github.com/ga4gh/cat-vrs/discussions/86>`_ for additional constraint classes (copy change, feature context, and function)

* 2025 Q3 - `“Call to Action” manuscript: <https://docs.google.com/document/d/1IRo2JlgIPERZeT35wFAUuldWvhk7LRM7hvAhZ98hRro/edit?tab=t.0#heading=h.llb8raw1flsa>`_ brief introduction and landscape analysis of categorical variants

* Representation of `fusions <https://github.com/ga4gh/cat-vrs/discussions/55>`_ and expression variants

* Explore forward compatibility for integration into Beacon v3 / VLM (e.g. parameterization)

.. Complete.

Complete
@@@@@@@@


* 2025 Q2 (May) - `GA4GH Product Approval <https://github.com/ga4gh/cat-vrs/issues/95>`_

* 2025 Q1 (Jan) - `Solicitation <https://docs.google.com/forms/d/1yNOvJdpp4byx1U4amZBsw72IdJZeSYLFX677HO0I3RY>`_ for `use cases and users projects <https://docs.google.com/spreadsheets/d/1N257x-PCKGZplcMVPE9j6412Ta4UyFVlXimG1l4q-Rw/edit?gid=0#gid=0>`_ for Cat-VRS implementation

* 2025 Q1 (Jan) - v0.2 draft release of reference implementation; `cat-vrs-python <https://github.com/ga4gh/cat-vrs-python/releases>`_

* 2024 Q4 (Nov) - `Trial Use Review Ballot (11/2024) <https://github.com/ga4gh/cat-vrs/releases/tag/1.0.0-ballot.2024-11.1>`_ for core classes (categorical variant, copy count constraint, defining allele constraint, and defining location constraint) and recipes (canonical allele and protein sequence consequence)

* 2024 Q4 (Nov)- v0.1 draft release of reference implementation; `cat-vrs-python <https://github.com/ga4gh/cat-vrs-python/releases>`_

* 2024 Q3 (Sept) - `Pre-release of specification for GA4GH Plenary Connect <https://github.com/ga4gh/cat-vrs/releases/tag/1.0.0-connect.2024-09.1>`_

* 2024 Q2 (May) - Adopted `constraint-based model <https://github.com/ga4gh/cat-vrs/discussions/22>`_ to describe categorical variants

* 2024 Q2 (Apr) - `Pre-release of specification for GA4GH Connect <https://github.com/ga4gh/cat-vrs/releases/tag/1.0.0.connect.2024-04.1>`_

* 2024 Q1 (Jan / Feb) - Defined `initial test set <https://docs.google.com/document/d/1aV-SqxdmuRN_EKvafzTSe0GoGC9yOzPsjrdWE0LXqYc/edit?tab=t.0>`_ of categories of variants that the Cat-VRS specification must describe

* 2023 Q4 (Oct) - Launch of `Categorical Variation study group <https://docs.google.com/document/d/1oI4ir4OzXFvhZNbMVEX-RHGAQ-d2K4lAKP-7lf-uzPc/edit?tab=t.0#heading=h.j3h3vz5k5o6l>`_ meetings

* 2023 Q4 (Oct) - `Defined initial scope of problem, <https://docs.google.com/document/d/12LMQu39hRiRATNwEYRlGqU5djRQHGar27szJTXxB3JE/edit?tab=t.0>`_ as well as potential contributing producers and consumers

* 2023 Q3 (Sept) - Categorical Variation Study Group formed at GA4GH 11th Plenary