Skip to content

feat(ocap-kernel): Support multiple subclusters #530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

sirtimid
Copy link
Contributor

@sirtimid sirtimid commented Jun 2, 2025

Closes #391 #401

Currently the kernel only supports a single subcluster at a time.
With this PR we enable users to create, retrieve, update and delete multiple subclusters, each with their own configuration. Also updated the control panel to support viewing and managing these subclusters through the interface.

Screen.Recording.2025-06-04.at.12.34.47.mov

@sirtimid sirtimid requested a review from a team as a code owner June 2, 2025 10:49
grypez
grypez previously requested changes Jun 3, 2025
Copy link
Contributor

@grypez grypez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design LGTM. I summarize the key decisions below.

  • enforced one-to-many subcluster-to-vat relation
  • reloadSubcluster reloads all of the configured vats, even if they were terminated
  • special rogue vat status for vats not in subclusters

Code looks good overall. I make some small recommendations.

@rekmarks rekmarks linked an issue Jun 3, 2025 that may be closed by this pull request
@sirtimid
Copy link
Contributor Author

sirtimid commented Jun 3, 2025

Design LGTM. I summarize the key decisions below.

  • enforced one-to-many subcluster-to-vat relation
  • reloadSubcluster reloads all of the configured vats, even if they were terminated
  • special rogue vat status for vats not in subclusters

Code looks good overall. I make some small recommendations.

@grypez I just want to point out that in the Kernel we have two distinct operations for handling vats/subclusters: restart and reload. The key difference is that restart is a "pause and resume" operation, it stops the vat but preserves its state and identity, then immediately starts it again with the same configuration. On the other hand, reload is a "clean slate" operation, it fully terminates the vat/subcluster, creates a new instance with a fresh state and assigns a new ID. When you restart a vat it's like putting it to sleep and waking it up, while reload is like creating a brand new instance from scratch. This is why reloadSubcluster loads all of the configured vats, even if they were terminated. I didn't create a restartSubcluster since we can run restartVat individually, but it is easy to implement if needed

@sirtimid sirtimid force-pushed the sirtimid/subclusters branch from 59eabea to af76f27 Compare June 3, 2025 20:41
@sirtimid sirtimid requested review from grypez and FUDCo June 3, 2025 20:42
@FUDCo
Copy link
Contributor

FUDCo commented Jun 3, 2025

special rogue vat status for vats not in subclusters

If I'm following this correctly (caveat: not entirely sure I am) I don't see the purpose at all in supporting a vat not in a subcluster. A lone vat is just a subcluster with a single vat in it, which I always expected to be an entirely normal thing. As I suggested above, the main reason for introducing the subcluster abstraction was to allow a vat's configuration to be coordinated with the configurations of other vats (e.g., because they are to be born in some kind of cooperative relationship with each other). Basically, it's a another layer of nested braces in the config file. Being able to do things like terminate a subcluster seems like a really useful leveraging of that concept (i.e., these things are together, so it's nice to provide a safe and convenient affordance for doing something to all of them collectively instead of having to futz with them individually), but it's not fundamental.

@grypez
Copy link
Contributor

grypez commented Jun 3, 2025

@grypez I just want to point out that in the Kernel we have two distinct operations for handling vats/subclusters: ... restart is a "pause and resume" operation, it stops the vat but preserves its state and identity, then immediately starts it again with the same configuration. On the other hand, reload is a "clean slate" operation, it fully terminates the vat/subcluster, creates a new instance with a fresh state and assigns a new ID. ... I didn't create a restartSubcluster since we can run restartVat individually, but it is easy to implement if needed

That makes sense, and explains why reloadSubcluster depends only on the configuration, not the state.

@grypez grypez dismissed their stale review June 3, 2025 22:43

Addressed

@grypez
Copy link
Contributor

grypez commented Jun 3, 2025

If I'm following this correctly (caveat: not entirely sure I am) I don't see the purpose at all in supporting a vat not in a subcluster. A lone vat is just a subcluster with a single vat in it, which I always expected to be an entirely normal thing. ...

I support single vat subclusters pulling their own bootstraps. This is already possible.

grypez
grypez previously approved these changes Jun 4, 2025
Copy link
Contributor

@grypez grypez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sirtimid
Copy link
Contributor Author

sirtimid commented Jun 4, 2025

special rogue vat status for vats not in subclusters

If I'm following this correctly (caveat: not entirely sure I am) I don't see the purpose at all in supporting a vat not in a subcluster. A lone vat is just a subcluster with a single vat in it, which I always expected to be an entirely normal thing. As I suggested above, the main reason for introducing the subcluster abstraction was to allow a vat's configuration to be coordinated with the configurations of other vats (e.g., because they are to be born in some kind of cooperative relationship with each other). Basically, it's a another layer of nested braces in the config file. Being able to do things like terminate a subcluster seems like a really useful leveraging of that concept (i.e., these things are together, so it's nice to provide a safe and convenient affordance for doing something to all of them collectively instead of having to futz with them individually), but it's not fundamental.

You're absolutely right, treating a lone vat as simply a subcluster with one vat makes a lot of sense, and I agree it would simplify the model overall. What you're suggesting would mean removing support for launching vats outside of a subcluster entirely (i.e., removing kernel.launchVat), which I think is the right direction long-term.

That said, since my current task was scoped specifically to adding support for multiple subclusters, I’ve kept the existing behavior for now to avoid widening the scope too much, especially given that a lot of our integration tests in @ocap/kernel-test rely on launching standalone vats.

To make sure we follow up on this properly, I’ve created a new task to migrate the kernel to fully enforce subcluster-based vat launching. I'll pick that up next.

Copy link
Contributor

@FUDCo FUDCo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Extend kernel to support multiple subclusters Add Support for Multiple Sub-clusters
3 participants