Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In the p9 error injection trial, it took 20 minutes to generate a system checkstop #223

Open
Amay258 opened this issue May 10, 2023 · 3 comments

Comments

@Amay258
Copy link

Amay258 commented May 10, 2023

In the p9 error injection trial, there is a probabilistic problem. When executing the error injection instruction, such as 'putscom - c 0x8 0x07010A0D 0x0000000003AF0000', it took 20 minutes to generate a system checkstop and successfully deconfigure the DIMM that had the error inject.

Preliminary judgment shows that after the input control register (000000 07010A0D) was written, the corresponding fault isolation register (000000 07010A00) value was not modified, resulting in no Checkstop.

Directly writing to the corresponding FIR register can trigger a checkstop and successfully deconfigure the DIMM.

May I ask why the FIR value only changed after more than 20 minutes.

@dcrowell77
Copy link
Collaborator

What interface are you using for the putscom? I don't recognize the syntax above.

My understanding of the way 0x07010A0D works is that it places errors into the hardware, but those errors are not surfaced until memory behind that memory controller is actually accessed. Therefore, unless you are explicitly forcing all of mainstore to be accessed (e.g. by running an exercisor of some kind) there will be some non-determinate results.

I also think you might be missing some bits that have to be set to control the injection.
Bits 0:36 : EICR_ADDRESS: Error is injected when read address matches the EICR address, up to fields masked by the EICR region.
0 = dimm select
1:2 = mrank(0:1)
3:5 = srank(0:2)
6:7 = bank_group(0:1)
8:10 = bank(0:2)
11:28 = row(0:17)
29:36 = col(2:9)
Without those bits set there will never be a match to trigger the inject.

@Amay258
Copy link
Author

Amay258 commented May 11, 2023

Putscom - c 0x0 0x07010A0D 0x00000000003AF0000, this instruction is an error injection for CPU0_C0D0
Putscom - c 0x8 0x07010A0D 0x00000000003AF0000, this instruction is an error injection for CPU1_C0D0

After executing the injection error instruction, the normal situation is to immediately trigger checkstop, and the injection error is successful. But the current situation is that after executing the injection error command, sometimes it takes 20 minutes to trigger the checkstop,and the injection error is successful, but why do we need to wait for 20 minutes?

@dcrowell77
Copy link
Collaborator

What do you mean by "the normal situation"? Have you seen other behavior with this specific injection? I still am under the belief that it won't fail until the memory is physically accessed, which is non-deterministic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants