Skip to content

[Sharing my use-case] Remote (and secure) object-store [No Pulsar] #20748

@vladvisan

Description

@vladvisan

In our case, we don't store datasets on Galaxy, and instead store them directly on the computing cluster. Why?

  • Privacy/ownership of data
  • Size : our Galaxy machine has little space, and files are commonly big (>100MB, even multi-GB) and numerous

We also cannot use Pulsar since;

  • the cluster admin does not want us to install any services, especially none that would have root access.
  • we want to be "plug and play" as much as possible for any future cluster

Furthermore, we access this data (when needed) using a REST protocol that requires tokens per user (which we store in user_extra_preferences in our Hashicorp Vault). (Also, this REST protocol supports HTTP-Range)

This implies several customisations

  1. Users need to specify the output_path (or the folder/regex) for each output => we have a tool creation interface that automatically creates a mandatory field per output
  2. A custom object store
    • whose get_data method uses the REST protocol instead of os.open(). I did not inherit from the Cache object store, because we don't want to store anything on Galaxy even temporarily
    • who uses the output_path (see above) per output
  3. A REST Job runner that submits the job asynchronously, and modifies the job script to make the references valid on the cluster
  4. Synchronising the job_workdir between Galaxy and the cluster by using a job wrapper (on top of the existing Galaxy's job wrapper) (uses rsync import then export)
  5. Making all accesses to the dataset not use os.fopen() but instead object_store.get_data()
    • The "eye" button (dataset view/preview)
    • The count_lines() method in post-processing for .txt files (and potentially for other extension with set_metadata)
  6. Accessing the tokens: We need them at several points:
    • the job runner, the "eye" button (=> objectstore), the "count_lines" method (=>objectstore), etc.
    • we need access to app.vault()
    • some of these have access to app, others don't => our get_token_of_user method cannot require an app input
    • We tried webapps.galaxy.api.get_app() but it gives None => we manually instantiate a Vault instead by using the VaultFactory.get_vault_from_app() method (with a fake Vault using SimpleNamespaces)
  7. A fake "upload" tool to get an input dataset in Galaxy from a cluster file path. Since the output datasets already exist in Galaxy thanks to our custom Object Store.
    • We might also be able to use the Data Library "upload as symlink" combined with some kind of mount, but
      1. in our initial tests, slow
      2. We can no longer use sshfs => need to find an alternate "mount" using our existing REST protocol

Hopefully some of this can be useful to other people!

Thanks for your time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions