Proposed actions for de-bloat of metaproteomics data in mongo DB #2152

SamuelPurvine · 2024-08-01T17:55:37Z

SamuelPurvine
Aug 1, 2024
Collaborator

Motivation

currently proteomics workflow has as part of its analysis activity report a large number of pieces of information that are not used by the portal in any way, and are not likely to be in any near-term fashion. The primary desire is to remove this data, and associated classes, and only keep behind what is necessary to associate the results from a proteomics instrument datafile with searchable functional annotation in the Portal. Additionally, we would like to harmonize the nomenclature for parsimoniously-chosen-single-protein-mapped-to-a-peptide away from best_protein to razor_protein.

Status/ As things are now:

Proteomics produces a large number of pieces of information that are placed into the json that loads data into mongo:
- id which is a NAPA compliant workflow identifier minted by the production API, example "id": "nmdc:wfmp-11-9qjy9903.1"
- name which describes the activity and references the workflow id, example "name": "Metaproteomics Analysis Activity for nmdc:wfmp-11-9qjy9903.1"
- started_at_time and ended at time which should be ISO 1601 format timestamps
- was_informed_by which lists the omics processing / data generation that provided the raw instrument file, example "was_informed_by": "nmdc:omprc-11-gqqn9s76"
- execution_resource describing the institution where the workflow was run, example "execution_resource": "EMSL"
- git_url that lists where the code fo the workflow resides, example "git_url": https://github.com/microbiomedata/metaPro/releases/tag/v1.2.1
- has_input which lists the 6 files that proteomics workflow uses to run:
  - Instrument rawfile
  - Metagenome protein file (.faa)
  - Metagenome functional annotation (.gff)
  - MSGFPlus parameter file (.txt)
  - MASIC parameter file (.xml)
  - Contaminants protein collection (.fasta)
- has_output which lists the 4 files that are created by the workflow and that are loaded to minio as dataobjects
- type which denotes the enum-controlled workflow activity type, example "type": "nmdc:MetaproteomicsAnalysisActivity" (which needs to change for Berkeley to "type": "nmdc:MetaproteomicsAnalysis")
- version which denotes the workflow pipeline version that was run, example "version": "v1.2.1"
- has_peptide_quantifications which lists all of the peptides found in the workflow operation, with the following slots:
  - all_proteins which lists all of the proteins in the search fasta file that map to the reported peptide, example:
    "nmdc:wfmgan-11-b0r8xe54.1_0000001_251240_252433",
    "nmdc:wfmgan-11-b0r8xe54.1_0000001_268350_269543",
    "nmdc:wfmgan-11-b0r8xe54.1_0000117_2_124"
  - best_protein which is the protein associated with the peptide that is chosen due to parsimony rules, example "best_protein": "nmdc:wfmgan-11-b0r8xe54.1_0000004_113773_115989", for which the logic follows:
    • if a peptide maps to only one protein (i.e. is unique), then that is the best_protein
    • if a peptide belongs to more than one protein AND one and only one of those proteins has at least one other uniquely mapped peptide, than that is the best_protein
    • if a peptide maps to more than one protein and none of those proteins has a uniquely mapped peptide, then the protein with the most other mapped peptides is the best_protein
    • if a peptide maps to more than one protein, none of those proteins has a uniquely mapped peptides, and more than one of those proteins has a similar maximum number of mapped peptides, then the first protein in the search fasta (by virtue of indexing the search fasta and taking the lowest index number) is the best_protein
    • only non-unique peptides that map to more than one protein wherein at least two of those proteins have at least one uniquely mapping peptide will not be assigned to a best_protein (and ‘should’ not be being loaded into mongo)
  - min_q_value which is the lowest Q-Value assigned to all of the spectra associated with the reported peptide, example "min_q_value": 0.015646
  - “clean” peptide_sequence which is what was found by MSGFPlus wherein the prefix and suffix amino acids have been removed along with any modification symbols or verbiage, example "peptide_sequence": "IFDPFFTTK"
  - peptide_spectral_count which denotes how many MS/MS spectra (a.k.a. tandem mass spectra or MS2 spectra) were associated with this peptide in the given instrument file, example "peptide_spectral_count": 1
  - peptide_sum_masic_abundance wherein the area-under-the-LC-elution-curve abundance as observed in MS1 spectra and extracted using MASIC’s StatMomentsArea are summed in non-log-transformed form, example "peptide_sum_masic_abundance": 253820000
Almost none of the above information is reported in the portal, maybe actually none
The has_peptide_quantifications information is distilled by the metaproteomics aggregation table generating tool written by Shane and maintained by Alicia:
- The all_proteins lists for all of the peptides are combined and de-replicated for mapping to GeneProducts that were inserted into mongo by the metagenome annotations
- The number of peptides for each of those proteins are counted
- If any of the all_proteins list is also listed as a best_protein then it is denoted with a Boolean (named best_protein) that is specific to the aggregation table and is in contrast/collision with best_protein listed above which is a string (this may already have been fixed by calling it is_best_protein)
- The connections between the proteins found in a workflow and the annotations from the metagenome effort are imported into the portal database to allow users to search for a functional annotation and have proteomics results that have that annotation entity mapped be returned
There is a class named PeptideQuantification which has slots (Core.yaml, line 290-298):
- all_proteins
- best_protein
- min_q_value
- peptide_sequence
- peptide_spectral_count
- peptide_sum_masic_abundance
There is a class named ProteinQuantification which has slots (Core.yaml, line 300-312):
- `all_protiens
- `best_protein
- `peptide_sequence_count
- `protein_spectral_count
- `protein_sum_masic_abundance
- MetaproteomicsAnalysisActivity also has a slot called has_peptide_quantifications with a range of the PeptideQuantification class (Workflow_execution_activity.yaml, lines 363-366)

Proposed changes:

Remove the slot all_proteins and all associated data from mongo
- Core.yaml, line 293
- Core.yaml, line 303
Remove the slot min_q_value and all associated data from mongo
- Core.yaml, line 295
Remove the slot peptide_sequence and all associated data from mongo
- Core.yaml, line 296
Remove the slot peptide_spectral_count and all associated data from mongo
- Core.yaml, line 297
Remove the slot peptide_sum_masic_abundance and all associated data from mongo
- Core.yaml, line 298
Remove the slot peptide_sequence_count and all associated data from mongo
- Core.yaml, line 305
Remove the slot protein_spectral_count and all associated data from mongo
- Core.yaml, line 306
Remove the slot peptide_sum_masic_abundance and all associated data from mongo
- Core.yaml, line 307
Remove the slot has_peptide_quantifications
- Workflow_execution_activity.yaml, line 331
- Workflow_execution_activity.yaml, lines 363-366
Remove the class PeptideQuantification (any associated data should be removed as per above)
- Core.yaml, line 290-298
Remove the class ProteinQuantifications (any associated data should be removed as per above)
- Core.yaml, line 300-312
Create a slot named razor_protein, not sure where to associate it once ProteinQuantifications and PeptideQuantificaitons are removed, possibly a MolecularData class as being proposed elsewhere
Change the workflow Peptide_Report.tsv to report razor_protein everywhere the best_protein is mentioned
Change the workflow Protein_Report.tsv to report razor_protein everywhere the best_protein is mentioned
Change the workflow QC_metrics.tsv value of BestProtein_count to RazorProtein_count
Change the analysis_activity.json report that is used to import results into mongo
- Remove all_protein and related entries
- Remove min_q_value
- Remove peptide_spectral_count
- Remove peptide_sum_masic_abundance
- Add a dereplicated list of all razor_proteins found in the workflow run
- Looks like:
  {
  "id": "nmdc:wfmp-11-q5yxmy05.1",
  "name": "Metaproteomics Analysis Activity for nmdc:wfmp-11-q5yxmy05.1",
  "started_at_time": "2024-03-13T13:47:33-07:00",
  "ended_at_time": "2024-03-13T21:49:55+00:00",
  "was_informed_by": "nmdc:omprc-11-tcapc615",
  "execution_resource": "EMSL",
  "git_url": https://github.com/microbiomedata/metaPro/releases/tag/2.0.0,
  "has_input": [
  "nmdc:dobj-11-xvmb4058",
  "nmdc:dobj-11-mmqbd689",
  "nmdc:dobj-11-1r9da293",
  "nmdc:dobj-11-h9637w90",
  "nmdc:dobj-11-hfx93f93",
  "nmdc:dobj-11-sprrem27"
  ],
  "has_output": [
  "nmdc:dobj-11-zywq6931",
  "nmdc:dobj-11-hkhn5580",
  "nmdc:dobj-11-zv03kg59",
  "nmdc:dobj-11-1zwp8m61"
  ],
  "type": "nmdc:MetaproteomicsAnalysis",
  "razor_protein": [
  "nmdc:wfmgas-11-anseqn83.1_5230_c1_102_1628",
  "nmdc:wfmgas-11-anseqn83.1_79553_c1_1_540",
  "nmdc:wfmgas-11-anseqn83.1_22938_c1_1_1035"
  ]
  }
- Change the metaproteomics aggregation table to only include records necessary to map functional annotations:
  - Workflow id
  - Razor_protein
  - Possibly explicitly list/upload the functional annotations if/when metagenome-free proteomics (a.k.a. Kaiko, version 2) is implemented?

Considerations

The “Now/Next/Later” Now of the above de-bloats the mongo database and retains the current functionality of finding proteomics results that map to functional annotations (if indeed that is working…)
There is energy around populating the proposed new class of MolecularData with some of the information listed above such as peptide sequences, protein groups, and inferred post-translational modifications (`ChemicalEntity` use cases, questions, guidance. #2151 (comment)) and possibly more, although Paul Piehowski is not in favor of this
Essentially all of the proposed mongo data proposed for removal is mirrored in the static tsv files that are produced by the workflow, so the results ARE still available to users
It is likely we can re-populate mongo with some or all of this information on an as-needed basis as new capabilities are brought online
One proposal by Paul Piehowski is to include some form of the all_proteins list to allow users to look for specific proteins of interest, either via the API or portal or both.

SamuelPurvine · 2024-08-01T18:04:21Z

SamuelPurvine
Aug 1, 2024
Collaborator Author

@lamccue @pdpiehowski @aclum @mslarae13 @kheal @brynnz22 @turbomam @corilo @shreddd @picowatt please feel free to edit the above for mistakes, misunderstandings, misstatements, or of course add comments as per your wont!

2 replies

kheal Aug 2, 2024
Collaborator

Hi @SamuelPurvine and @pdpiehowski I have a couple questions so I can catch up.

Are ProteinQuantification instances currently populated in mongo?
How does the ProteinQuantification class connect with other classes?
What is the proposed "razor proteins" slot's range? Maybe a GeneProduct? That is the current range of the existing all_proteins and best_protein slots (as well as some slots on the FunctionalAnnotation and GenomeFeature classes).

SamuelPurvine Aug 2, 2024
Collaborator Author

I don't believe there are ProteinQuantification instances populated in mongo. It was proposed back when we made the PeptideQuantitation class, more or less as a place holder for future possibilities, but we only ever populate the PeptideQuantification class with the analysis activity as outlined above
I believe the only connection for ProteinQuantification is the best_protein and all_proteins connections to GeneProduct
razor_protein's range would be GeneProduct if it were limited to only proteins coming from the metagenome analyses, so it would be an exact replacement (simple rename) of best_protein. However: as we move into metagenome free peptide identification (which uses de-novo sequencing of spectra, BLASTing against a very large collection of organisms' proteins like Uniprot, determination of organisms in the sample, and assembly of a protein collection from those organisms) we will no longer be tied directly to the GeneProduct that comes from metagenome annotation efforts. The range will still be GeneProduct -esque, but that specific class may not fit unless it is also untied from metagenome annotations.

kheal · 2024-08-02T16:05:24Z

kheal
Aug 2, 2024
Collaborator

Below is a schematic of my understanding of the current berkeley schema for the three classes you mention (MetaproteomicsAnalysis, PeptideQuantification and ProteinQuantification). Please let me know if this is inaccurate and I'll update accordingly. Hopefully we can use this as a starting place for re-designing this corner of the schema.

1 reply

SamuelPurvine Aug 2, 2024
Collaborator Author

This is accurate for both the current production schema and Berkeley schema, as we chose not to break this model up until after roll out of the Berkeley schema. As mentioned above, the "just de-bloat the darn thing" thinking is to delete the Quantification classes, find a home for the best_protein-renamed-to-razor_protein (MolecularData class specific to proteomics activities?), and only populate mongo with those razor_protein entries for a given MetaproteomicAnalysis. This is being re-thought, however, wherein we would populate mongo with both razor_protein and all_proteins to allow users to have access to all of the proteins mapped to a given protein collection.

aclum · 2024-08-07T18:40:09Z

aclum
Aug 7, 2024
Collaborator

@kheal we have some results results post-reiding in mongo prod. relevant collections are metaproteomics_analysis_activity_set and metaproteomics_analysis_activity_set

0 replies

SamuelPurvine · 2024-08-13T22:07:45Z

SamuelPurvine
Aug 13, 2024
Collaborator Author

Point to capture: we are considering getting rid of all_proteins entirely from what we load into MongoDB, and if so we may as well drop it from the protein report .tsv file (keep it in the peptide report .tsv) as well.

0 replies

SamuelPurvine · 2024-08-13T22:19:13Z

SamuelPurvine
Aug 13, 2024
Collaborator Author

Additionally, this would be a good time to re-factor the "ficus analysis" script to move away from using SQL to wrangle things into more native python language.

0 replies

kheal · 2024-08-13T22:35:18Z

kheal
Aug 13, 2024
Collaborator

Notes from meeting. Remove PeptideQuantification and ProteinQuantification classes altogether, migrate information from PeptideQuantification.best_protein slot to new slot on MetaproteomicsAnalysis class

2 replies

kheal Sep 3, 2024
Collaborator

@aclum encourages team protein to keep some sort of quantitative information in mongo for visualization purposes (see discussion here: microbiomedata#242 (comment))

After discussion, we've decided to keep spectra counts and number of unique peptides mapped to each protein, which are on the existing ProteinQuantification class, so we will not deprecate these slots.

Here is the updated modeling associated with maintaining that information in mongo. @SamuelPurvine and @pdpiehowski, I welcome feedback on this modeling. I can still implement this through a migrator for the existing data in mongo, so our plan to make schema changes + modify existing mongo data can still go forward.

pdpiehowski Sep 3, 2024
Collaborator

For clarity I think it would be good to include unique in the name since that is the terminology in the field, so unique_peptide_count instead of peptide_sequence_count.

kheal · 2024-08-13T23:12:41Z

kheal
Aug 13, 2024
Collaborator

I filed some issues:

schema
#2157

aggregator
microbiomedata/nmdc-aggregator#13

0 replies

kheal · 2024-08-14T18:19:42Z

kheal
Aug 14, 2024
Collaborator

After discussing with @aclum, here's my current understanding of the tasks to deal with data currently in mongo

Change schema with regards to MetaProteomicsAnalysis refactoring described above and implement migrator for Berkeley fork
Remove PeptideQuantification class, migrate information from PeptideQuantification best_protein slot to new razor_protein slot on MetaproteomicsAnalysis class #2157
Add class for MetaproteomicsAggregation similar to current FunctionalAnnotationAggMember class and implement migrator for berkely fork to make data in mongo compliant with schema. These data exist in Mongo, but there is no corresponding representation in the schema!
Edit FunctionalAnnotationAggMember to work with MetaP data #1253
Change the aggregation script to comply with the schema changes implemented in 1) and 2)
Update generate_metap_agg.py script to source ids from new slot nmdc-aggregator#13
Coordinate with @naglepuff regarding the use of the MetaproteomicsAggregation data for data portal.
After Berkeley becomes production Purge existing MetaproteomicsAggregation records
After Berkeley becomes production Check that chron job from aggregation repopulates the MetaproteomicsAggregation records and that those updated records are working with the data portal

2 replies

kheal Aug 14, 2024
Collaborator

1-4 can happen on "berkeley" branches/forks of the existing repos, but 5 and 6 only make sense after berkeley is production (though we could test in berkeley branches/forks)

kheal Aug 28, 2024
Collaborator

PR in for #1 here:
microbiomedata#236

PR in for #2 here:
microbiomedata#242

kheal · 2024-11-13T19:20:43Z

kheal
Nov 13, 2024
Collaborator

We have decided to take a different path after discussing with both the proteomics and infrastructure teams.

ADR can be found here: https://github.com/microbiomedata/issues/blob/main/decisions/0017-metaP-mongodata.md

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed actions for de-bloat of metaproteomics data in mongo DB #2152

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 9 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Proposed actions for de-bloat of metaproteomics data in mongo DB #2152

SamuelPurvine Aug 1, 2024 Collaborator

Motivation

Status/ As things are now:

Proposed changes:

Considerations

Replies: 9 comments · 7 replies

SamuelPurvine Aug 1, 2024 Collaborator Author

kheal Aug 2, 2024 Collaborator

SamuelPurvine Aug 2, 2024 Collaborator Author

kheal Aug 2, 2024 Collaborator

SamuelPurvine Aug 2, 2024 Collaborator Author

aclum Aug 7, 2024 Collaborator

SamuelPurvine Aug 13, 2024 Collaborator Author

SamuelPurvine Aug 13, 2024 Collaborator Author

kheal Aug 13, 2024 Collaborator

kheal Sep 3, 2024 Collaborator

pdpiehowski Sep 3, 2024 Collaborator

kheal Aug 13, 2024 Collaborator

kheal Aug 14, 2024 Collaborator

kheal Aug 14, 2024 Collaborator

kheal Aug 28, 2024 Collaborator

kheal Nov 13, 2024 Collaborator

SamuelPurvine
Aug 1, 2024
Collaborator

Replies: 9 comments 7 replies

SamuelPurvine
Aug 1, 2024
Collaborator Author

kheal Aug 2, 2024
Collaborator

SamuelPurvine Aug 2, 2024
Collaborator Author

kheal
Aug 2, 2024
Collaborator

SamuelPurvine Aug 2, 2024
Collaborator Author

aclum
Aug 7, 2024
Collaborator

SamuelPurvine
Aug 13, 2024
Collaborator Author

SamuelPurvine
Aug 13, 2024
Collaborator Author

kheal
Aug 13, 2024
Collaborator

kheal Sep 3, 2024
Collaborator

pdpiehowski Sep 3, 2024
Collaborator

kheal
Aug 13, 2024
Collaborator

kheal
Aug 14, 2024
Collaborator

kheal Aug 14, 2024
Collaborator

kheal Aug 28, 2024
Collaborator

kheal
Nov 13, 2024
Collaborator