Skip to content

Carcosa Interface Components#2632

Open
nab880 wants to merge 2 commits intosstsimulator:develfrom
nab880:devel
Open

Carcosa Interface Components#2632
nab880 wants to merge 2 commits intosstsimulator:develfrom
nab880:devel

Conversation

@nab880
Copy link
Contributor

@nab880 nab880 commented Feb 12, 2026

This PR adds the Carcosa interface layer and supporting components so Hali can sit between CPUs/sensors and the memory hierarchy, coordinate fault injection via a shared registry, and support both sensor/ring workloads and Vanadis MMIO coordination (e.g. ping-pong). It also adds the MemHierarchy PortModule faultInjectorMemH, the PMDataRegistry for manager–injector communication, and tests that verify Hali, the manager logic, and fault injection behavior.

Hali – Main interface component.
Sits on the memHierarchy (highlink = CPU side, lowlink = cache/memory side). Forwards MemEvents
Uses a FaultInjManager to attach data that PortModules can read to modify behavior.
Example of this working exists in the added faultInjector faultInjectorMemH. Ideally this would be abstracted out into the general "FaultInjector API" but I leave that to future work.
Working Features:
MMIO mode when control_addr_base/control_addr_size are set: intercepts loads/stores to a control region for Vanadis coordination (command/status registers), with optional ring-based “done” sync (see ping-ping example in Carcosa tests).

FaultInjManagerAPI / FaultInjManager – Subcomponent used by Hali. Queues highlink/lowlink PortModule requests. Uses event ID to coordinate behavior through PMDataRegistry.

PMDataRegistry – Shared state between manager and injectors.

CarcosaMemCtrl – MemHierarchy-style memory controller with optional iflLinks_N. Used for backing manipulation.

hyades.h – Single-header API for Vanadis to talk to Hali’s MMIO control region.
Applications that use hyades should have a jump table Upon accessing the MMIO control region the Hali interface intercepts the memory access and returns a jump table index. See pingpong.c in carcosa/tests for example. This will likely need to be expanded for more complex use cases.

testCarcosaPingPong.py – Two Vanadis cores, one VanadisNodeOS, two processes (same binary, roles 0 and 1). Puts Hali in each core’s data path (CPU → Hali → dTLB → L1D), MMIO params

@nab880 nab880 assigned nab880 and unassigned nab880 Feb 12, 2026
@nab880 nab880 requested a review from mgoldstein322 February 12, 2026 09:05
@sst-autotester
Copy link
Contributor

Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection is Not Necessary for this Pull Request.

@sst-autotester
Copy link
Contributor

Status Flag 'Pull Request AutoTester' - Failure: Timed out waiting for job SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements to start: Total Wait = 303

  • Other jobs have been previously started - We must stop them...

@sst-autotester
Copy link
Contributor

Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects:

Pull Request Auto Testing STARTING (click to expand)

Build Information

Test Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements

  • Build Num: 2518
  • Status: STARTED

Build Information

Test Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements_Make-Dist

  • Build Num: 1411
  • Status: STARTED

Build Information

Test Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements_MT-2

  • Build Num: 2461
  • Status: STARTED

Build Information

Test Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements_MR-2

  • Build Num: 2463
  • Status: STARTED

Build Information

Test Name: SST__AutotestGen2_NewFW_OSX-15-XC15-ARM2_OMPI-4.1.6_PY3.10_sst-elements

  • Build Num: 1011
  • Status: STARTED

Using Repos:

Repo: ELEMENTS (nab880/sst-elements)
  • Branch: devel
  • SHA: f7c627f
  • Mode: TEST_REPO
Repo: SQE (sstsimulator/sst-sqe)
  • Branch: devel
  • SHA: f30c768973a466c3a0a22652d1eead26f647f836
  • Mode: SUPPORT_REPO
Repo: CORE (sstsimulator/sst-core)
  • Branch: devel
  • SHA: 8e483403e9843ec18c03e2d7576d98d3c35cb0fc
  • Mode: SUPPORT_REPO
Repo: MACRO (sstsimulator/sst-macro)
  • Branch: devel
  • SHA: 489f035111339bbd94e9db661a68c1b2cd7faf77
  • Mode: SUPPORT_REPO

Pull Request Author: nab880

@nab880 nab880 self-assigned this Feb 12, 2026
@sst-autotester
Copy link
Contributor

Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED

Pull Request Auto Testing has PASSED (click to expand)

Build Information

Test Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements

  • Build Num: 2518
  • Status: PASSED

Build Information

Test Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements_Make-Dist

  • Build Num: 1411
  • Status: PASSED

Build Information

Test Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements_MT-2

  • Build Num: 2461
  • Status: PASSED

Build Information

Test Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements_MR-2

  • Build Num: 2463
  • Status: PASSED

Build Information

Test Name: SST__AutotestGen2_NewFW_OSX-15-XC15-ARM2_OMPI-4.1.6_PY3.10_sst-elements

  • Build Num: 1011
  • Status: PASSED

@sst-autotester
Copy link
Contributor

Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging
NO REVIEWS HAVE BEEN PERFORMED ON THIS PULL REQUEST!

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

18 similar comments
@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

3 similar comments
@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

Copy link
Contributor

@mgoldstein322 mgoldstein322 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments:

  • I'm not sure I really understand the PMDataRegistry. I can see the registry resolver maps a registry id to the actual registry; then the registry maps an event id to a port module command, but what are these commands? Is this how you instruct a PM to change its fault/swap out its injector?
  • I know you said a decent amount of this was hardcoded to make the ping pong work, but I'm still pretty confused on the Hali code. My understanding is this: each Hali instance is a node in a ring of Hali instances. These nodes pass messages around the ring, but I'm not sure what the ring's purpose is. I've left various comments at my points of confusion in the Hali.cc file for more specific things, but I don't think I understand much beyond the MMIO coordination checking for specific commands at the magic address to see what to do.
  • CarcosaMemCtrl: is this where most of Hali actually works? I see the custom event handlers and that they're open subcomponent slots, so I assume these are for the FaultInjEvents, HaliEvents, and CpuEvents. How does this differentiate between these events? Likewise, would CarcosaMemCtrl be loaded as the backend for a memory controller in SST?
  • FaultInjectorMemH should be rewritten as a subclass of FaultInjectorBase. It contains a lot of old code from when I was just starting the Injector implementation that's either been completely changed or outright removed. I marked most of it, but probably not all of it.

bool found;
eventsToSend_ = params.find<unsigned>("eventsToSend", 0, found);
if (!found) {
eventsToSend_ = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why check if it's found if it's going to be set to 1 by default? Couldn't you just make the default 1 above?

} else {
// Event from another component - record and forward
neighbors_.insert(event->getStr());
eventsToSend_ += 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is registering a neighbor increasing this by 5?

// Calculate events to send based on neighbor count
eventsToSend_ /= (neighbors_.size() + 1);
out_->output(" %s will send %u events to each other component.\n", getName().c_str(), eventsToSend_);
eventsToSend_ *= neighbors_.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this relates to my previous comment, but I'm confused on what it's doing.

primaryComponentOKToEndSim();
} else {
if (cpuLink_ && verbose_) out_->output("HaliSensorEvent: sending CPUEvent\n");
if (cpuLink_) cpuLink_->send(new CpuEvent("data"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placeholder?

if (eventsSent_ != eventsToSend_) {
leftHaliLink_->send(new HaliEvent(*iter_));
eventsSent_++;
iter_++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just sending neighbor names around the ring?


SST_ELI_DOCUMENT_PARAMS(
{"installDirection", "Flag which direction the injector should read from on a port. Valid optins are \'Send\', \'Receive\', and \'Both\'. Default is \'Receive\'."},
{"injectionProbability", "The probability with which an injection should occur. Valid inputs range from 0 to 1. Default = 0.5."},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove 54-56

}
ImplementSerializable(SST::Carcosa::FaultInjectorMemH)

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove 128-144

SST::PortModule::serialize_order(ser);
// serialize parameters like `SST_SER(<param_member>)`
SST_SER(installDirection_);
SST_SER(injectionProbability_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove 123-124

@@ -0,0 +1,158 @@
/*
* Hyades - Vanadis control abstraction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the commitment to the naming :)

// information, see the LICENSE file in the top level directory of the
// distribution.

#ifndef CARCOSA_PMDATAREGISTRY_H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate file?

@sst-autotester
Copy link
Contributor

Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging
THE LAST COMMIT TO THIS PULL REQUEST HAS BEEN REVIEWED, BUT NOT ACCEPTED OR REQUIRES CHANGES!

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@nab880
Copy link
Contributor Author

nab880 commented Feb 17, 2026

Hi Micheal,

These are good suggestions and I update the pull request in a minute.
In the meantime:

  • I'm not sure I really understand the PMDataRegistry. I can see the registry resolver maps a registry id to the actual registry; then the registry maps an event id to a port module command, but what are these commands? Is this how you instruct a PM to change its fault/swap out its injector?
    **PMDataRegistry contains "commands" that the PortModule has to interpret. The PortModule 'injector' interprets the commands and then modifies its behavior. Upon receiving a memory request the Injector will check the PMDataRegistry to see if a the FaultInjectionManager has a message corresponding to this MemEvent. **
  • I know you said a decent amount of this was hardcoded to make the ping pong work, but I'm still pretty confused on the Hali code. My understanding is this: each Hali instance is a node in a ring of Hali instances. These nodes pass messages around the ring, but I'm not sure what the ring's purpose is. I've left various comments at my points of confusion in the Hali.cc file for more specific things, but I don't think I understand much beyond the MMIO coordination checking for specific commands at the magic address to see what to do.
    Hali is connected to Cores, so for every core there should be a corresponding Hali compnent. The Hali components are connected in a ring in the event that they need to broadcast information to each other. While we don't really have a use-case for this yet, I imagine it will be useful later on. i.e., we want to turn off all fault injection in a specific case.
  • CarcosaMemCtrl: is this where most of Hali actually works? I see the custom event handlers and that they're open subcomponent slots, so I assume these are for the FaultInjEvents, HaliEvents, and CpuEvents. How does this differentiate between these events? Likewise, would CarcosaMemCtrl be loaded as the backend for a memory controller in SST?
    **CarcosaMemCtrl is largely just an extension of the MemHierarchy MemoryController. I added additional functionality where we can exchange how the backend data is stored. The motivation for this was that if we want to reduce the overall memory requirements for SST but still want to have different pages of memory depending on the application. A Vanadis node might perform kernel A and then communicate with Hali it's done. Then Hali can change the contents of the memory backed before issuing the next kernel to be computed. **
  • FaultInjectorMemH should be rewritten as a subclass of FaultInjectorBase. It contains a lot of old code from when I was just starting the Injector implementation that's either been completely changed or outright removed. I marked most of it, but probably not all of it.
    **Yes, you are right this needs to be updated. **

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

10 similar comments
@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur...

@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However PR is now STALE, and must be retested. Set the AT: RETEST Label to force retest....

1 similar comment
@sst-autotester
Copy link
Contributor

All Jobs Finished; status = PASSED, However PR is now STALE, and must be retested. Set the AT: RETEST Label to force retest....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments