Adding detail to RetrievalSource provenance #1624

mbrush · 2025-10-26T17:27:34Z

Exploring some modeling that would support capturing a couple additional retrieval source provenance details on a per edge basis. To discuss on upcoming MUTT/DINGO call:

added an ingest_source permissible value - to help capture which source the data was actually ingested from (and made the RetrievalSoruce.resoruce_role slot multivalued - to allow indicating that a particular primary or aggregator source was also the 'ingest_source')
also tested an alternative pattern to capture this info - that defines a separate slot to capture the ingest_source - ingest_source: boolean
- this pattern may make it easier to parse out this important data, and allow us to make capturing this info required if we want
- if we decide to capture this type of metadata, chose one of the two implemented patterns
added an ingest_files slot to RetrievalSource - for use in the RetrievalSource object for the ingest_source, to report files(s) from which the data used to create the edge were retrieved. This provides more complete provenance, and supports various downstream activities:
- developer debugging (lets us better trace edges back to the source data
- manual QA efforts (help reviewers organize edge types by file source - e.g. very helpful for CTD)
- more precise provenance for end users to understand where the edge came from
- identifying edges that may need to be updated/reviewed if a source updates its data/files

. . . If not at the edge level in the data, perhaps making it standard to put this info in the RIG for each 'EdgeType' object?

finally, I added defs to ResourceRoleEnum values - which I think we should keep even if we don't adopt the other proposals above

- added defs to `ResourceRoleEnum` values - added an `ingest_source` permissible value - to help capture which source the data was actually ingested from - made the `RetrievalSoruce.resoruce_role` slot multivalued - to allow indicating that a particular primary or aggregator source was also the 'ingest_source' - also implemented an alternative pattern to capture this info - that defines a separate slot to capture the ingest_source - "ingest_source: boolean" - this pattern may make it easier to parse out this important data, and allow us to make capturing this info required if we want - if we decide to capture this type of metadata, chose one of the two implemented patterns - added an ingest_files slot to RetrievalSource, to capture files(s from which the data used to create the edge were retrieved. This provides more complete provenance, and supports various downstream activities: - manual QA efforts (help reviewer organize edge types by file source) - internal developer debugging - identifying edges that may need to be updated/reviewed if a source updates its data/files - more precise provenance for end users to understand where the edge came from

mbrush added 4 commits October 26, 2025 10:25

Update biolink-model.yaml

846d43e

Update biolink-model.yaml

b2ba312

Update biolink-model.yaml

1cb432e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding detail to RetrievalSource provenance #1624

Adding detail to RetrievalSource provenance #1624

Uh oh!

mbrush commented Oct 26, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Adding detail to RetrievalSource provenance #1624

Are you sure you want to change the base?

Adding detail to RetrievalSource provenance #1624

Uh oh!

Conversation

mbrush commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mbrush commented Oct 26, 2025 •

edited

Loading