Proposed actions for de-bloat of metaproteomics data in mongo DB #2152
Replies: 9 comments 7 replies
-
@lamccue @pdpiehowski @aclum @mslarae13 @kheal @brynnz22 @turbomam @corilo @shreddd @picowatt please feel free to edit the above for mistakes, misunderstandings, misstatements, or of course add comments as per your wont! |
Beta Was this translation helpful? Give feedback.
-
Below is a schematic of my understanding of the current berkeley schema for the three classes you mention ( |
Beta Was this translation helpful? Give feedback.
-
@kheal we have some results results post-reiding in mongo prod. relevant collections are metaproteomics_analysis_activity_set and metaproteomics_analysis_activity_set |
Beta Was this translation helpful? Give feedback.
-
Point to capture: we are considering getting rid of all_proteins entirely from what we load into MongoDB, and if so we may as well drop it from the protein report .tsv file (keep it in the peptide report .tsv) as well. |
Beta Was this translation helpful? Give feedback.
-
Additionally, this would be a good time to re-factor the "ficus analysis" script to move away from using SQL to wrangle things into more native python language. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
I filed some issues: schema aggregator |
Beta Was this translation helpful? Give feedback.
-
After discussing with @aclum, here's my current understanding of the tasks to deal with data currently in mongo
|
Beta Was this translation helpful? Give feedback.
-
We have decided to take a different path after discussing with both the proteomics and infrastructure teams. ADR can be found here: https://github.com/microbiomedata/issues/blob/main/decisions/0017-metaP-mongodata.md |
Beta Was this translation helpful? Give feedback.
-
Motivation
currently proteomics workflow has as part of its analysis activity report a large number of pieces of information that are not used by the portal in any way, and are not likely to be in any near-term fashion. The primary desire is to remove this data, and associated classes, and only keep behind what is necessary to associate the results from a proteomics instrument datafile with searchable functional annotation in the Portal. Additionally, we would like to harmonize the nomenclature for parsimoniously-chosen-single-protein-mapped-to-a-peptide away from best_protein to razor_protein.
Status/ As things are now:
id
which is a NAPA compliant workflow identifier minted by the production API, example "id": "nmdc:wfmp-11-9qjy9903.1"name
which describes the activity and references the workflow id, example "name": "Metaproteomics Analysis Activity for nmdc:wfmp-11-9qjy9903.1"started_at_time
and ended at time which should be ISO 1601 format timestampswas_informed_by
which lists the omics processing / data generation that provided the raw instrument file, example "was_informed_by": "nmdc:omprc-11-gqqn9s76"execution_resource
describing the institution where the workflow was run, example "execution_resource": "EMSL"git_url
that lists where the code fo the workflow resides, example "git_url": https://github.com/microbiomedata/metaPro/releases/tag/v1.2.1has_input
which lists the 6 files that proteomics workflow uses to run:has_output
which lists the 4 files that are created by the workflow and that are loaded to minio as dataobjectstype
which denotes the enum-controlled workflow activity type, example "type": "nmdc:MetaproteomicsAnalysisActivity" (which needs to change for Berkeley to "type": "nmdc:MetaproteomicsAnalysis")version
which denotes the workflow pipeline version that was run, example "version": "v1.2.1"has_peptide_quantifications
which lists all of the peptides found in the workflow operation, with the following slots:all_proteins
which lists all of the proteins in the search fasta file that map to the reported peptide, example:"nmdc:wfmgan-11-b0r8xe54.1_0000001_251240_252433",
"nmdc:wfmgan-11-b0r8xe54.1_0000001_268350_269543",
"nmdc:wfmgan-11-b0r8xe54.1_0000117_2_124"
best_protein
which is the protein associated with the peptide that is chosen due to parsimony rules, example "best_protein": "nmdc:wfmgan-11-b0r8xe54.1_0000004_113773_115989", for which the logic follows:• if a peptide maps to only one protein (i.e. is unique), then that is the best_protein
• if a peptide belongs to more than one protein AND one and only one of those proteins has at least one other uniquely mapped peptide, than that is the best_protein
• if a peptide maps to more than one protein and none of those proteins has a uniquely mapped peptide, then the protein with the most other mapped peptides is the best_protein
• if a peptide maps to more than one protein, none of those proteins has a uniquely mapped peptides, and more than one of those proteins has a similar maximum number of mapped peptides, then the first protein in the search fasta (by virtue of indexing the search fasta and taking the lowest index number) is the best_protein
• only non-unique peptides that map to more than one protein wherein at least two of those proteins have at least one uniquely mapping peptide will not be assigned to a best_protein (and ‘should’ not be being loaded into mongo)
min_q_value
which is the lowest Q-Value assigned to all of the spectra associated with the reported peptide, example "min_q_value": 0.015646peptide_sequence
which is what was found by MSGFPlus wherein the prefix and suffix amino acids have been removed along with any modification symbols or verbiage, example "peptide_sequence": "IFDPFFTTK"peptide_spectral_count
which denotes how many MS/MS spectra (a.k.a. tandem mass spectra or MS2 spectra) were associated with this peptide in the given instrument file, example "peptide_spectral_count": 1peptide_sum_masic_abundance
wherein the area-under-the-LC-elution-curve abundance as observed in MS1 spectra and extracted using MASIC’s StatMomentsArea are summed in non-log-transformed form, example "peptide_sum_masic_abundance": 253820000has_peptide_quantifications
information is distilled by the metaproteomics aggregation table generating tool written by Shane and maintained by Alicia:all_proteins
lists for all of the peptides are combined and de-replicated for mapping toGeneProduct
s that were inserted into mongo by the metagenome annotationsall_proteins
list is also listed as abest_protein
then it is denoted with a Boolean (namedbest_protein
) that is specific to the aggregation table and is in contrast/collision withbest_protein
listed above which is a string (this may already have been fixed by calling itis_best_protein
)PeptideQuantification
which has slots (Core.yaml, line 290-298):all_proteins
best_protein
min_q_value
peptide_sequence
peptide_spectral_count
peptide_sum_masic_abundance
ProteinQuantification
which has slots (Core.yaml, line 300-312):MetaproteomicsAnalysisActivity
also has a slot calledhas_peptide_quantifications
with a range of thePeptideQuantification
class (Workflow_execution_activity.yaml, lines 363-366)Proposed changes:
all_proteins
and all associated data from mongomin_q_value
and all associated data from mongopeptide_sequence
and all associated data from mongopeptide_spectral_count
and all associated data from mongopeptide_sum_masic_abundance
and all associated data from mongopeptide_sequence_count
and all associated data from mongoprotein_spectral_count
and all associated data from mongopeptide_sum_masic_abundance
and all associated data from mongohas_peptide_quantifications
PeptideQuantification
(any associated data should be removed as per above)ProteinQuantifications
(any associated data should be removed as per above)razor_protein
, not sure where to associate it onceProteinQuantifications
andPeptideQuantificaitons
are removed, possibly aMolecularData
class as being proposed elsewhereall_protein
and related entriesmin_q_value
peptide_spectral_count
peptide_sum_masic_abundance
razor_proteins
found in the workflow run{
"id": "nmdc:wfmp-11-q5yxmy05.1",
"name": "Metaproteomics Analysis Activity for nmdc:wfmp-11-q5yxmy05.1",
"started_at_time": "2024-03-13T13:47:33-07:00",
"ended_at_time": "2024-03-13T21:49:55+00:00",
"was_informed_by": "nmdc:omprc-11-tcapc615",
"execution_resource": "EMSL",
"git_url": https://github.com/microbiomedata/metaPro/releases/tag/2.0.0,
"has_input": [
"nmdc:dobj-11-xvmb4058",
"nmdc:dobj-11-mmqbd689",
"nmdc:dobj-11-1r9da293",
"nmdc:dobj-11-h9637w90",
"nmdc:dobj-11-hfx93f93",
"nmdc:dobj-11-sprrem27"
],
"has_output": [
"nmdc:dobj-11-zywq6931",
"nmdc:dobj-11-hkhn5580",
"nmdc:dobj-11-zv03kg59",
"nmdc:dobj-11-1zwp8m61"
],
"type": "nmdc:MetaproteomicsAnalysis",
"razor_protein": [
"nmdc:wfmgas-11-anseqn83.1_5230_c1_102_1628",
"nmdc:wfmgas-11-anseqn83.1_79553_c1_1_540",
"nmdc:wfmgas-11-anseqn83.1_22938_c1_1_1035"
]
}
Considerations
Beta Was this translation helpful? Give feedback.
All reactions