Skip to content

How to capture the lifecycle of a predicted and then later curated mapping? #437

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cthoyt opened this issue Apr 25, 2025 · 1 comment
Labels
question Further information is requested

Comments

@cthoyt
Copy link
Member

cthoyt commented Apr 25, 2025

Let's say I generate an exact match using a lexical mapping. My mapping tool gives a confidence of 0.7. So I get SSSOM like

subject_id subject_label predicate_id object_id object_label mapping_justification confidence mapping_tool
CHEBI:134180 leucomethylene blue skos:exactMatch mesh:C011010 hydromethylthionine semapv:LexicalMatching 0.7 generate_chebi_mesh_mappings.py

Then, I review this mapping. I say that it's correct with 0.95 confidence. How do I represent this? Here are some options I thought of:

  1. Add an author_id column with my ORCID, and swap the mapping justification to semapv:ManualMappingCuration. Overwrite the confidence from 0.7 to 0.95
  2. Add a reviewer_id column with my ORCID. But then, how do I represent that I have a confidence as a reviewer? Do I throw away the mapping tool's confidence? What if I want to keep track of this?
  3. Some other way? Please also let me know if I've misunderstood how to use author_id/creator_id/reviewer_id

The use case for this question is Biomappings, since we do lexical predictions and curate them, and want to keep track of this provenance.

Given the answer to this question, it will also be possible to generalize the Biomappings curation interface to be a generic SSSOM curation interface

@cthoyt cthoyt added the question Further information is requested label Apr 25, 2025
@cthoyt cthoyt transferred this issue from mapping-commons/sssom-py Apr 26, 2025
@matentzn
Copy link
Collaborator

This issue is a bit debated; last time we tried to do this we didn't reach a definite conclusion: #345

In a nutshell:

  1. Separating mapping processes during the curation life cycle was not a primary concern of the design of SSSOM, so it was all mushed together into one record
  2. The idea is that 1 single score is there to tell a downstream user "how sure they can be"
  3. If you absolutely want to represent the life cycle, you will have to create intermediate mapping sets, so you say:
    1. Mapping set 1 derived from lexical matching (semapv:LexicalMatching)
    2. Mapping set 2 reviewers mapping set 1 (semapv:MappingReview), and sets mapping_set_source to mapping set 1
    3. Mapping set 3 derived from 1 and 2, referring to both, generating a composite score and using sempav:CompositeMatching or some such as a justification. This last set is the only one you publish the the world.

None of this is super awesome. Another option to make this a bit cleaner would be to push for #359 and then a new slot source_mapping that you can use to point specifically to the mappings used for deriving a particular new mapping..

None of this is normative, just spitballing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants