`ChemicalEntity` use cases, questions, guidance. #2151

kheal · 2024-07-25T16:56:16Z

kheal
Jul 25, 2024
Collaborator

I'm starting this discussion to attempt to compile the different known use cases for the ChemicalEntity class.

By expanding these use cases I hope we can identify (and document) our guidelines and guardrails for populating instances of this class to avoid 1) duplication of records that refer to the same chemicals 2) inaccurate or insufficient records pertaining to the chemicals we intend to capture in NMDC's metadata 3) extra work that has been sufficiently captured by other existing chemical databases.

ChemicalEntity is part of the information captured on the substances_used slot on several MaterialProcessing classes to capture solvents, reagents, and proteolytic enzymes etc for sample processing before DataGeneration
- Use Case 1: Enzymes used for proteomics
  Enymzes used for proteomics will be captured in the ChemicalConversionProcess class through the substances_used slot. These proteolytic enzymes must be interptretable by the proteomics workflow, so the values need to be limited to only ChemicalEntitys that are proteolytic enzymes if the chemical_conversion_category == protease_cleavage.
  - These enzymes are well mapped out in the Proteomics Standards Initiative Ontology. They also (as far as I can tell) have CAS ids.
- Use Case 2: Extraction solvents used for NOM
  This will be captured in the DissolvingProcess class through the substances_used slot. These solvents are crucial for interpreting the NOM results and will be used for filtering and display on the UI (after MaterialProcessing metadata have been loaded).
  - These solvents and reagents are purchased, which means they should all have a CAS id.
ChemicalEntity is the range of the metabolite_identified slot on MetaboliteIdentification class to capture identified metabolites
- Use Case 3: Metabolites identified
  This will be captured in the MetaboliteIdentification class through the metabolite_identified slot. In the future we may wish to enable searching on these values to enable analyses that connect genes/proteins to metabolites (i.e. through a KEGG term). However, not all metabolites have 1:1 with KEGG ids.
  - These compounds do not necessarily have a CAS id as they sometimes represent unresolved structures (i.e. lipids with only the summed structure resolved). Nor can we assume they will all have KEGG ids. Refmet is a potential good source for these IDs as Refmet has already taken into account the level of structural annotation and is specifically designed for annotating results from mass-spectrometry based metabolomics analyses (our same use case). Other metabolomics-specific databases are listed here.
- Use Case 4: Reference free proteomics
  Need to expand here with help from proteomics folks

Tagging several folks for input:
@lamccue, @mslarae13, @SamuelPurvine, @corilo, @sierra-moxon, @turbomam

kheal · 2024-07-25T18:04:12Z

kheal
Jul 25, 2024
Collaborator Author

I am pretty comfortable using PubChem's API to get a plethora of chemical identifiers from a search, so once we decide which ones we'd like to prioritize, I can help to write functions to start with a chemical name, gather possible IDs, check mongo for matching records, and write new records if needed.

0 replies

turbomam · 2024-07-25T18:37:33Z

turbomam
Jul 25, 2024
Maintainer

extremely well written discussion @kheal

2 replies

turbomam Jul 25, 2024
Maintainer

Would your proposed PubChem API search over structural space at some point?

kheal Jul 25, 2024
Collaborator Author

Do you mean searching by SMILES/INCI etc? That's pretty simple with PubChems API, so we could do that with not much effort. There are already existing python packages that help traverse PubChem, and I've done this with different input (formula, INCHIKEY, name, CASid, etc).

turbomam · 2024-07-25T18:40:54Z

turbomam
Jul 25, 2024
Maintainer

I will have some ideas to share but I don't think we can finalize a decision until @cmungall has weighted in, possibly towards the middle of next week.

0 replies

corilo · 2024-07-30T21:29:50Z

corilo
Jul 30, 2024
Collaborator

@SamuelPurvine @mslarae13 @pdpiehowski @lamccue @brynnz22 please review and add to this draft.

To enhance the representation of chemical identification, we suggest creating an abstract MolecularData class instead of using a chemical entity in use cases 3 and 4. This distinction is crucial because solvents or enzymes are tangible entities used in experiments, plausible real objects, such as a bottle of solvent. In contrast, chemical information in various analysis types (e.g., metabolomics, metaproteomics, lipidomics, and natural organic matter)are inferences of chemical entities.

The proposed MolecularData class will have subtypes for different levels of chemical characterization, such as chemical formula, chemical structure, chemical classes, and overall categories. Each subclass, such as metaproteomics, general metabolomics, lipidomics, etc., will include appropriate identifiers that align with the concept of inference, such as ChEBI, CAS, InChI, and InChIKeys.

Use Case 3: Metabolites Identified

To better capture the nature of identified metabolites in the Metabolite Identification class, we propose using the MolecularData class in the metabolite_identified slot. This class will include various identifiers, like ChEBI, INCHI, CAS to account for different levels of structural annotation. Doing so ensures a more accurate representation of the chemical entities identified in metabolomics analyses.

Use Case 4: Reference-Free Proteomics

We recommend using the MolecularData class to represent inferred chemical entities for the proteomics use case. This class should accommodate different levels of chemical characterization relevant to proteomics, such as razor protein and all protein. Each subclass will include identifiers like UniProt IDs, PeptideAtlas IDs, and other proteomics-specific identifiers to reflect the inferred nature of the data.

5 replies

brynnz22 Jul 30, 2024
Collaborator

Just speaking about use cases 1 and 2 based on @corilo suggestion for Use Cases 3 & 4:

It sounds like if we peel out a new class (and associated subclasses) of MolecularData, then use cases 1 and 2 (maybe) can be a lot more simplified. Can known_as in the PortionOfSubstance class (which is the range for substances_used) for use case 1 and 2 have a range of ControlledIdentifiedTermValue instead of the a range of ChemicalEntity- like what we do for the environmental terms? This would allow the substances for enzymes and solvents to be associated with ontologies. Perhaps a CAS Id slot could be added to the PortionOfSubstance class (range of substances_used) to allow for the CAS Ids to be included.

SamuelPurvine Jul 30, 2024
Collaborator

I would submit that limiting Use Case 4 to only reference free proteomics would create a dichotomy in the way we treat proteomics data, and if we were to shift to a MolecularData class we would want to work to keep the subclasses as similar as possible. In a very real sense, the two approaches are highly similar: both use a collection of proteins, albeit derived from different sources; both attempt to match peptides, which are the biomolecules being detected in the instrument, to spectra within some false discovery controlled tolerance; both attempt to infer the protein that is best represented by the peptides identified; both have the ability to ascribe the presence of a functionally annotated entity (pfam, KEGG, COG, EC number, etc.) to the sample being studied via the protein inferences; both can have relative-comparison quantitative data derived from the peptide biomarkers (spectral counts or area-under-the-elution-curve); and both can have other statistics provided for the user to help filter or winnow results as per their wont.

The primary difference between the two approaches is the names being used for the proteins from which who's presence is inferred by the peptides matched to spectra. I 'believe' calling out reference free proteomics was to denote that while we are currently able to use GeneProduct for metagenome-dependent identifications, and thus use the modeling built around that class for free, we will need to populate some or all of that modelling when choosing a metagenome-sequencing-free mode, and ChemicalEntity will be partly or wholly inadequate (or maybe just inappropriate) to that task.

I'd also note that the above set of similarities encompasses all of the "bloat" that persists in mongo vis-à-vis proteomics data... If/when we flesh out the MolecularData class for proteomics, it may end up looking fairly similar to what we currently put into mongo. This is by way of saying that if we do in fact choose to more precisely model the molecules that are derived via detection using instrumentation, as opposed to design via protocol, we may need to re-think what is and isn't bloat, and what we actually want (and reasonably can) do with the data. If all we can/want to do with mongo, and therefore the Portal, is to denote presence of a functionally annotated entity in a given sample, then the modelling is very simple, and de-bloating can continue down the lines already proposed. But if there is vision to do more with the portal, then re-thinking would be advised.

Of course, we can always just add it back in later :) All of that data is in the static results files...

turbomam Aug 1, 2024
Maintainer

OK, there is lots of good input here. I'm looking forward to the hackathon tomorrow.

Please com prepared to work collaboratively in a new branch, in which new modeling changes would be added in baby steps, each with accompanying valid and invalid data example files.

It will also be important to demonstrate how our modelling is similar to or different from other established models. I will try to dig something up but am enthusiastic to see what others consider an established reference

turbomam Aug 1, 2024
Maintainer

@cmungall will you be able to attend? @sierra-moxon

turbomam Aug 1, 2024
Maintainer

The proposed MolecularData class sure sounds like a DataObject, which would mention a ChemicalEntty.

kheal · 2024-08-02T18:35:17Z

kheal
Aug 2, 2024
Collaborator Author

Some decisions:

Let's try to use ChemicalEntity class for use case 1 and 2
We need to remove ChemicalEntity as the ranges of the slots associated for cases 3 and 4. These data need to be remodeled, likely as a slot on a ConclusionsBasedOnData or something like that.
Make nmdc unique identifiers for ChemicalEntity class

Let's try to use node normalizer for populating and searching ChemicalEntity instances in mongo. Katherine will work with Sierra to try to move this forward a bit.

0 replies

turbomam · 2024-08-02T18:37:49Z

turbomam
Aug 2, 2024
Maintainer

0 replies

kheal · 2024-08-02T18:38:08Z

kheal
Aug 2, 2024
Collaborator Author

Issues related to this discussion:
#2153

0 replies

kheal · 2024-08-02T19:25:56Z

kheal
Aug 2, 2024
Collaborator Author

NOW:

Use ChemicalEntity as modeled to represent the solvents/reagents/etc in the MaterialProcessing classes. @sierra-moxon and @kheal will work together to come up with a proposed SOP for populated mongo with ChemicalEntity's with robust checks for duplicates using existing tools.

NEXT:

Document use cases for what types of searches of the processed NOM, Metabolite, Lipid, and Proteomics processed data may want to be searchable on the data portal. See discussion here: (@sierra-moxon please add link when you've got one).

LATER:

Use the use cases above to model summarized outputs of NOM, Metabolite, Lipid, and Proteomics processed data that exposes elements of these data that are needed for the use cases.

Great discussion all. @corilo @lamccue @SamuelPurvine @turbomam @sierra-moxon please speak up if this doesn't capture the essence of what we discussed and decided today.

I'm closing this discussion "issue" thing, since the remaining discussion will be about something different [not ChemicalEntity]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`ChemicalEntity` use cases, questions, guidance. #2151

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

ChemicalEntity use cases, questions, guidance. #2151

kheal Jul 25, 2024 Collaborator

Replies: 8 comments · 7 replies

kheal Jul 25, 2024 Collaborator Author

turbomam Jul 25, 2024 Maintainer

turbomam Jul 25, 2024 Maintainer

kheal Jul 25, 2024 Collaborator Author

turbomam Jul 25, 2024 Maintainer

corilo Jul 30, 2024 Collaborator

brynnz22 Jul 30, 2024 Collaborator

SamuelPurvine Jul 30, 2024 Collaborator

turbomam Aug 1, 2024 Maintainer

turbomam Aug 1, 2024 Maintainer

turbomam Aug 1, 2024 Maintainer

kheal Aug 2, 2024 Collaborator Author

turbomam Aug 2, 2024 Maintainer

kheal Aug 2, 2024 Collaborator Author

kheal Aug 2, 2024 Collaborator Author

`ChemicalEntity` use cases, questions, guidance. #2151

kheal
Jul 25, 2024
Collaborator

Replies: 8 comments 7 replies

kheal
Jul 25, 2024
Collaborator Author

turbomam
Jul 25, 2024
Maintainer

turbomam Jul 25, 2024
Maintainer

kheal Jul 25, 2024
Collaborator Author

turbomam
Jul 25, 2024
Maintainer

corilo
Jul 30, 2024
Collaborator

brynnz22 Jul 30, 2024
Collaborator

SamuelPurvine Jul 30, 2024
Collaborator

turbomam Aug 1, 2024
Maintainer

turbomam Aug 1, 2024
Maintainer

turbomam Aug 1, 2024
Maintainer

kheal
Aug 2, 2024
Collaborator Author

turbomam
Aug 2, 2024
Maintainer

kheal
Aug 2, 2024
Collaborator Author

kheal
Aug 2, 2024
Collaborator Author