Replies: 8 comments 7 replies
-
I am pretty comfortable using PubChem's API to get a plethora of chemical identifiers from a search, so once we decide which ones we'd like to prioritize, I can help to write functions to start with a chemical name, gather possible IDs, check mongo for matching records, and write new records if needed. |
Beta Was this translation helpful? Give feedback.
-
extremely well written discussion @kheal |
Beta Was this translation helpful? Give feedback.
-
I will have some ideas to share but I don't think we can finalize a decision until @cmungall has weighted in, possibly towards the middle of next week. |
Beta Was this translation helpful? Give feedback.
-
@SamuelPurvine @mslarae13 @pdpiehowski @lamccue @brynnz22 please review and add to this draft. To enhance the representation of chemical identification, we suggest creating an abstract MolecularData class instead of using a chemical entity in use cases 3 and 4. This distinction is crucial because solvents or enzymes are tangible entities used in experiments, plausible real objects, such as a bottle of solvent. In contrast, chemical information in various analysis types (e.g., metabolomics, metaproteomics, lipidomics, and natural organic matter)are inferences of chemical entities. The proposed MolecularData class will have subtypes for different levels of chemical characterization, such as chemical formula, chemical structure, chemical classes, and overall categories. Each subclass, such as metaproteomics, general metabolomics, lipidomics, etc., will include appropriate identifiers that align with the concept of inference, such as ChEBI, CAS, InChI, and InChIKeys. Use Case 3: Metabolites Identified To better capture the nature of identified metabolites in the Metabolite Identification class, we propose using the MolecularData class in the metabolite_identified slot. This class will include various identifiers, like ChEBI, INCHI, CAS to account for different levels of structural annotation. Doing so ensures a more accurate representation of the chemical entities identified in metabolomics analyses. Use Case 4: Reference-Free Proteomics We recommend using the MolecularData class to represent inferred chemical entities for the proteomics use case. This class should accommodate different levels of chemical characterization relevant to proteomics, such as razor protein and all protein. Each subclass will include identifiers like UniProt IDs, PeptideAtlas IDs, and other proteomics-specific identifiers to reflect the inferred nature of the data. |
Beta Was this translation helpful? Give feedback.
-
Some decisions:
Let's try to use node normalizer for populating and searching |
Beta Was this translation helpful? Give feedback.
-
Issues related to this discussion: |
Beta Was this translation helpful? Give feedback.
-
NOW:
NEXT:
LATER:
Great discussion all. @corilo @lamccue @SamuelPurvine @turbomam @sierra-moxon please speak up if this doesn't capture the essence of what we discussed and decided today. I'm closing this discussion "issue" thing, since the remaining discussion will be about something different [not |
Beta Was this translation helpful? Give feedback.
-
I'm starting this discussion to attempt to compile the different known use cases for the
ChemicalEntity
class.By expanding these use cases I hope we can identify (and document) our guidelines and guardrails for populating instances of this class to avoid 1) duplication of records that refer to the same chemicals 2) inaccurate or insufficient records pertaining to the chemicals we intend to capture in NMDC's metadata 3) extra work that has been sufficiently captured by other existing chemical databases.
ChemicalEntity
is part of the information captured on thesubstances_used
slot on severalMaterialProcessing
classes to capture solvents, reagents, and proteolytic enzymes etc for sample processing beforeDataGeneration
Enymzes used for proteomics will be captured in the
ChemicalConversionProcess
class through thesubstances_used
slot. These proteolytic enzymes must be interptretable by the proteomics workflow, so the values need to be limited to onlyChemicalEntity
s that are proteolytic enzymes if thechemical_conversion_category
== protease_cleavage.This will be captured in the
DissolvingProcess
class through thesubstances_used
slot. These solvents are crucial for interpreting the NOM results and will be used for filtering and display on the UI (afterMaterialProcessing
metadata have been loaded).ChemicalEntity
is the range of themetabolite_identified
slot onMetaboliteIdentification
class to capture identified metabolitesThis will be captured in the
MetaboliteIdentification
class through themetabolite_identified
slot. In the future we may wish to enable searching on these values to enable analyses that connect genes/proteins to metabolites (i.e. through a KEGG term). However, not all metabolites have 1:1 with KEGG ids.Need to expand here with help from proteomics folks
Tagging several folks for input:
@lamccue, @mslarae13, @SamuelPurvine, @corilo, @sierra-moxon, @turbomam
Beta Was this translation helpful? Give feedback.
All reactions