In the p9 error injection trial, it took 20 minutes to generate a system checkstop #223

Amay258 · 2023-05-10T10:08:49Z

In the p9 error injection trial, there is a probabilistic problem. When executing the error injection instruction, such as 'putscom - c 0x8 0x07010A0D 0x0000000003AF0000', it took 20 minutes to generate a system checkstop and successfully deconfigure the DIMM that had the error inject.

Preliminary judgment shows that after the input control register (000000 07010A0D) was written, the corresponding fault isolation register (000000 07010A00) value was not modified, resulting in no Checkstop.

Directly writing to the corresponding FIR register can trigger a checkstop and successfully deconfigure the DIMM.

May I ask why the FIR value only changed after more than 20 minutes.

dcrowell77 · 2023-05-10T13:49:45Z

What interface are you using for the putscom? I don't recognize the syntax above.

My understanding of the way 0x07010A0D works is that it places errors into the hardware, but those errors are not surfaced until memory behind that memory controller is actually accessed. Therefore, unless you are explicitly forcing all of mainstore to be accessed (e.g. by running an exercisor of some kind) there will be some non-determinate results.

I also think you might be missing some bits that have to be set to control the injection.
Bits 0:36 : EICR_ADDRESS: Error is injected when read address matches the EICR address, up to fields masked by the EICR region.
0 = dimm select
1:2 = mrank(0:1)
3:5 = srank(0:2)
6:7 = bank_group(0:1)
8:10 = bank(0:2)
11:28 = row(0:17)
29:36 = col(2:9)
Without those bits set there will never be a match to trigger the inject.

Amay258 · 2023-05-11T06:34:33Z

Putscom - c 0x0 0x07010A0D 0x00000000003AF0000, this instruction is an error injection for CPU0_C0D0
Putscom - c 0x8 0x07010A0D 0x00000000003AF0000, this instruction is an error injection for CPU1_C0D0

After executing the injection error instruction, the normal situation is to immediately trigger checkstop, and the injection error is successful. But the current situation is that after executing the injection error command, sometimes it takes 20 minutes to trigger the checkstop，and the injection error is successful, but why do we need to wait for 20 minutes?

dcrowell77 · 2023-05-11T18:56:53Z

What do you mean by "the normal situation"? Have you seen other behavior with this specific injection? I still am under the belief that it won't fail until the memory is physically accessed, which is non-deterministic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In the p9 error injection trial, it took 20 minutes to generate a system checkstop #223

In the p9 error injection trial, it took 20 minutes to generate a system checkstop #223

Amay258 commented May 10, 2023

dcrowell77 commented May 10, 2023

Amay258 commented May 11, 2023

dcrowell77 commented May 11, 2023

In the p9 error injection trial, it took 20 minutes to generate a system checkstop #223

In the p9 error injection trial, it took 20 minutes to generate a system checkstop #223

Comments

Amay258 commented May 10, 2023

dcrowell77 commented May 10, 2023

Amay258 commented May 11, 2023

dcrowell77 commented May 11, 2023