Skip to content

Attempt to reuse previously materialized datasets #20718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Jul 31, 2025

The conditions are:

  • the replacement dataset exists in the same object store
  • the replacement dataset exists in a history that the user owns
  • the replacement dataset has a HDA of the same datatype
  • the dataset hash or dataset source hash matches
  • the replacement dataset is not purged or deleted and in OK state
  • the same transform is applied

Providing hashes is currently a bit of a niche thing, but we do use this
in BRC analytics (in particular all fastq files contain hashes).
This should help a lot with demo'ing things.

We should also be able to make use of this in planemo, where this would be a realistic path to enable "invocation resume" functionality.

We might eventually allow this for public datasets as well, but perhaps this should be a little more explicit. We could for instance include cache hints in the dataset request syntax (maybe something like cache_strategy: own, cache_strategy: public, cache_strategy: never) and a top level setting for the workflow request.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@github-actions github-actions bot added this to the 25.1 milestone Jul 31, 2025
@mvdbeek mvdbeek force-pushed the deduplicate_materialization branch from b2b8b5a to 64443d5 Compare July 31, 2025 16:59
mvdbeek added 3 commits July 31, 2025 19:30
The conditions are:
- the replacement dataset exists in the same object store
- the replacement dataset exists in a history that the user owns
- the replacement dataset has a HDA of the same datatype
- the dataset hash or dataset source hash matches
- the replacement dataset is not purged or deleted and in OK state
- the same transform is applied

Providing hashes is currently a bit of a niche thing, but we do use this
in BRC analytics (in particular all fastq files contain hashes).

Should help a lot with demo'ing things.
if url and hash are provided.
@mvdbeek mvdbeek force-pushed the deduplicate_materialization branch from 64443d5 to cf21104 Compare July 31, 2025 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant