Conversation
|
Status Flag 'Pre-Test Inspection' - Auto Inspected - Inspection is Not Necessary for this Pull Request. |
|
Status Flag 'Pull Request AutoTester' - Failure: Timed out waiting for job SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements to start: Total Wait = 303
|
|
Status Flag 'Pull Request AutoTester' - Testing Jenkins Projects: Pull Request Auto Testing STARTING (click to expand)Build InformationTest Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements
Build InformationTest Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements_Make-Dist
Build InformationTest Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements_MT-2
Build InformationTest Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements_MR-2
Build InformationTest Name: SST__AutotestGen2_NewFW_OSX-15-XC15-ARM2_OMPI-4.1.6_PY3.10_sst-elements
Using Repos:
Pull Request Author: nab880 |
|
Status Flag 'Pull Request AutoTester' - Jenkins Testing: all Jobs PASSED Pull Request Auto Testing has PASSED (click to expand)Build InformationTest Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements
Build InformationTest Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements_Make-Dist
Build InformationTest Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements_MT-2
Build InformationTest Name: SST__AutotestGen2_NewFW_sst-test_OMPI-4.1.4_PY3.9_sst-elements_MR-2
Build InformationTest Name: SST__AutotestGen2_NewFW_OSX-15-XC15-ARM2_OMPI-4.1.6_PY3.10_sst-elements
|
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
18 similar comments
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
3 similar comments
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
mgoldstein322
left a comment
There was a problem hiding this comment.
General comments:
- I'm not sure I really understand the PMDataRegistry. I can see the registry resolver maps a registry id to the actual registry; then the registry maps an event id to a port module command, but what are these commands? Is this how you instruct a PM to change its fault/swap out its injector?
- I know you said a decent amount of this was hardcoded to make the ping pong work, but I'm still pretty confused on the Hali code. My understanding is this: each
Haliinstance is a node in a ring ofHaliinstances. These nodes pass messages around the ring, but I'm not sure what the ring's purpose is. I've left various comments at my points of confusion in theHali.ccfile for more specific things, but I don't think I understand much beyond the MMIO coordination checking for specific commands at the magic address to see what to do. - CarcosaMemCtrl: is this where most of Hali actually works? I see the custom event handlers and that they're open subcomponent slots, so I assume these are for the FaultInjEvents, HaliEvents, and CpuEvents. How does this differentiate between these events? Likewise, would CarcosaMemCtrl be loaded as the backend for a memory controller in SST?
- FaultInjectorMemH should be rewritten as a subclass of FaultInjectorBase. It contains a lot of old code from when I was just starting the Injector implementation that's either been completely changed or outright removed. I marked most of it, but probably not all of it.
| bool found; | ||
| eventsToSend_ = params.find<unsigned>("eventsToSend", 0, found); | ||
| if (!found) { | ||
| eventsToSend_ = 1; |
There was a problem hiding this comment.
Why check if it's found if it's going to be set to 1 by default? Couldn't you just make the default 1 above?
| } else { | ||
| // Event from another component - record and forward | ||
| neighbors_.insert(event->getStr()); | ||
| eventsToSend_ += 5; |
There was a problem hiding this comment.
Why is registering a neighbor increasing this by 5?
| // Calculate events to send based on neighbor count | ||
| eventsToSend_ /= (neighbors_.size() + 1); | ||
| out_->output(" %s will send %u events to each other component.\n", getName().c_str(), eventsToSend_); | ||
| eventsToSend_ *= neighbors_.size(); |
There was a problem hiding this comment.
I assume this relates to my previous comment, but I'm confused on what it's doing.
| primaryComponentOKToEndSim(); | ||
| } else { | ||
| if (cpuLink_ && verbose_) out_->output("HaliSensorEvent: sending CPUEvent\n"); | ||
| if (cpuLink_) cpuLink_->send(new CpuEvent("data")); |
| if (eventsSent_ != eventsToSend_) { | ||
| leftHaliLink_->send(new HaliEvent(*iter_)); | ||
| eventsSent_++; | ||
| iter_++; |
There was a problem hiding this comment.
Just sending neighbor names around the ring?
|
|
||
| SST_ELI_DOCUMENT_PARAMS( | ||
| {"installDirection", "Flag which direction the injector should read from on a port. Valid optins are \'Send\', \'Receive\', and \'Both\'. Default is \'Receive\'."}, | ||
| {"injectionProbability", "The probability with which an injection should occur. Valid inputs range from 0 to 1. Default = 0.5."}, |
| } | ||
| ImplementSerializable(SST::Carcosa::FaultInjectorMemH) | ||
|
|
||
| /** |
| SST::PortModule::serialize_order(ser); | ||
| // serialize parameters like `SST_SER(<param_member>)` | ||
| SST_SER(installDirection_); | ||
| SST_SER(injectionProbability_); |
| @@ -0,0 +1,158 @@ | |||
| /* | |||
| * Hyades - Vanadis control abstraction | |||
There was a problem hiding this comment.
I love the commitment to the naming :)
| // information, see the LICENSE file in the top level directory of the | ||
| // distribution. | ||
|
|
||
| #ifndef CARCOSA_PMDATAREGISTRY_H |
|
Status Flag 'Pre-Merge Inspection' - - This Pull Request Requires Inspection... The code must be inspected by a member of the Team before Testing/Merging |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
Hi Micheal, These are good suggestions and I update the pull request in a minute.
|
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
10 similar comments
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However Inspection must be performed before merge can occur... |
|
All Jobs Finished; status = PASSED, However PR is now STALE, and must be retested. Set the AT: RETEST Label to force retest.... |
1 similar comment
|
All Jobs Finished; status = PASSED, However PR is now STALE, and must be retested. Set the AT: RETEST Label to force retest.... |
This PR adds the Carcosa interface layer and supporting components so Hali can sit between CPUs/sensors and the memory hierarchy, coordinate fault injection via a shared registry, and support both sensor/ring workloads and Vanadis MMIO coordination (e.g. ping-pong). It also adds the MemHierarchy PortModule faultInjectorMemH, the PMDataRegistry for manager–injector communication, and tests that verify Hali, the manager logic, and fault injection behavior.
Hali – Main interface component.
Sits on the memHierarchy (highlink = CPU side, lowlink = cache/memory side). Forwards MemEvents
Uses a FaultInjManager to attach data that PortModules can read to modify behavior.
Example of this working exists in the added faultInjector faultInjectorMemH. Ideally this would be abstracted out into the general "FaultInjector API" but I leave that to future work.
Working Features:
MMIO mode when control_addr_base/control_addr_size are set: intercepts loads/stores to a control region for Vanadis coordination (command/status registers), with optional ring-based “done” sync (see ping-ping example in Carcosa tests).
FaultInjManagerAPI / FaultInjManager – Subcomponent used by Hali. Queues highlink/lowlink PortModule requests. Uses event ID to coordinate behavior through PMDataRegistry.
PMDataRegistry – Shared state between manager and injectors.
CarcosaMemCtrl – MemHierarchy-style memory controller with optional iflLinks_N. Used for backing manipulation.
hyades.h – Single-header API for Vanadis to talk to Hali’s MMIO control region.
Applications that use hyades should have a jump table Upon accessing the MMIO control region the Hali interface intercepts the memory access and returns a jump table index. See pingpong.c in carcosa/tests for example. This will likely need to be expanded for more complex use cases.
testCarcosaPingPong.py – Two Vanadis cores, one VanadisNodeOS, two processes (same binary, roles 0 and 1). Puts Hali in each core’s data path (CPU → Hali → dTLB → L1D), MMIO params