Skip to content

Conversation

@mbrush
Copy link
Collaborator

@mbrush mbrush commented Oct 26, 2025

Exploring some modeling that would support capturing a couple additional retrieval source provenance details on a per edge basis. To discuss on upcoming MUTT/DINGO call:

  • added an ingest_source permissible value - to help capture which source the data was actually ingested from (and made the RetrievalSoruce.resoruce_role slot multivalued - to allow indicating that a particular primary or aggregator source was also the 'ingest_source')

  • also tested an alternative pattern to capture this info - that defines a separate slot to capture the ingest_source - ingest_source: boolean

    • this pattern may make it easier to parse out this important data, and allow us to make capturing this info required if we want
    • if we decide to capture this type of metadata, chose one of the two implemented patterns
  • added an ingest_files slot to RetrievalSource - for use in the RetrievalSource object for the ingest_source, to report files(s) from which the data used to create the edge were retrieved. This provides more complete provenance, and supports various downstream activities:

    • developer debugging (lets us better trace edges back to the source data
    • manual QA efforts (help reviewers organize edge types by file source - e.g. very helpful for CTD)
    • more precise provenance for end users to understand where the edge came from
    • identifying edges that may need to be updated/reviewed if a source updates its data/files

. . . If not at the edge level in the data, perhaps making it standard to put this info in the RIG for each 'EdgeType' object?

  • finally, I added defs to ResourceRoleEnum values - which I think we should keep even if we don't adopt the other proposals above

- added defs to `ResourceRoleEnum` values
- added an `ingest_source` permissible value - to help capture which source the data was actually ingested from
- made the `RetrievalSoruce.resoruce_role` slot multivalued - to allow indicating that a particular primary or aggregator source was also the 'ingest_source'
- also implemented an alternative pattern to capture this info - that defines a separate slot to capture the ingest_source  - "ingest_source: boolean"
   - this pattern may make it easier to parse out this important data, and allow us to make capturing this info required if we want
    - if we decide to capture this type of metadata, chose one of the two implemented patterns

- added an ingest_files slot to RetrievalSource, to capture files(s from which the data used to create the edge were retrieved. This provides more complete provenance, and supports various downstream activities:
    - manual QA efforts (help reviewer organize edge types by file source)
    - internal developer debugging
    - identifying edges that may need to be updated/reviewed if a source updates its data/files
    - more precise provenance for end users to understand where the edge came from
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants