Skip to content

Preemption support #504

@rhc54

Description

@rhc54

As we start to work on defining preemption support, we have to consider several aspects of the problem:

  • how to specify the local preemption policies. I suggest that this is not relevant to this organization - it really is a problem for the local host environment, each of which may choose to do it differently.

  • how to query what the local preemption policy is, what options are definable by the app/tool, and how it is implemented. This is largely a question of attribute definition for query support. Should include indication of whether preemption is a “ctrl-z” (i.e., pause but remaining in memory) or a complete shutdown and removal - or both (perhaps selectable by app, maybe as required by replacement job)

  • how to communicate an app’s preemption support to the host environment. For example, if I can support preemption, what handshake do I understand, what constraints exist on my support

  • the preemption handshake itself. How does the host alert the app to proposed preemption, can the app respond with a counterproposal (e.g., take part of my allocation but leave some part of me running, I need N seconds to prepare, …), desired/required restart mechanism (e.g., restore from checkpoint), etc.

We currently have the following relevant definitions in pmix_common.h:

Session control attributes:

PMIX_SESSION_PREEMPT   (bool) preempt indicated jobs (given in accompanying pmix_info_t
                       via the PMIX_NSPACE attribute) in the specified session and recover
                       all their resources. If no PMIX_NSPACE is specified, then preempt
                       all jobs in the session.

Attributes relating to allocation requests:

PMIX_ALLOC_PREEMPTIBLE   (bool) by default, all jobs in the resulting allocation are
                         to be considered preemptible (overridable at per-job level)

Attributes relating to spawn requests:

PMIX_JOB_CTRL_PREEMPTIBLE    (bool) job can be pre-empted

Events:

PMIX_JCTRL_PREEMPT_ALERT    monitored by client to detect RM intends to preempt

PMIX_JCTRL_CHECKPOINT_COMPLETE   sent by client and monitored by server to
                        notify that a checkpoint operation has completed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions