Skip to content

dws and port_distributor jobtap plugins may crash broker when unloaded #337

@jameshcorbett

Description

@jameshcorbett

In a comment on #336 , @grondo pointed out that any dynamically-loaded plugin that does asynchronous work may cause the broker to crash:

If the plugin is unloaded while the watcher is still active, this could cause a broker crash since the epilog_timeout_cb symbol won't exist anymore.

Unfortunately, in these situations, the plugin needs to keep a list of outstanding watchers so it can stop them and/or destroy them on exit. i.e. you can't rely on using the job aux item for this purpose.

I asked

How could I trigger cleanup on plugin removal? By using flux_plugin_set_aux or something, with the destructor set to my cleanup function?

and @grondo said:

Yeah, I think the approach would be to have a global context stored in the plugin aux cache and add a list of objects to that ctx that would need to be freed on plugin removal.

Actually looking at existing plugins in core, many of them suffer from this same issue 🤦. Feel free to add an issue to address this at some point in the future. For now we'll have to be careful reloading most plugins that do asynchronous work. Maybe we can add something to the API to make this more manageable and obvious.

Both the dws-jobtap and cray_pals_port_distributor plugins have this vulnerability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions