Integrate the latest merged changes #1

AvaNaik · 2023-11-21T19:41:23Z

Integrate the latest merged changes

when kbuf data is broken, kbuffer_next_event() may move kbuf->index back to the current kbuf->index position, causing dead loop. In this situation, rasdaemon will repeatedly parse an invalid event, and print warning like "ug! negative record size -8!", pushing cpu utilization rate to 100%. when kbuf data is broken, discard current page and continue reading next page kbuf. Signed-off-by: hubin <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

When building rasdaemon with autoreconf, on certain distros we see the following error message. Makefile.am: error: required file './README' not found Autoreconf looks for README file instead of README.md Fix this by passing 'foreign' to AM_INIT_AUTOMAKE. Signed-off-by: Ayush Jain <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

…istd.h The return value type of read/write function from unistd.h is ssize_t. It's signed normally, and return -1 on error. Fix incorrect use in the function read_ras_event_all_cpus(). BTW, make setting buffer_percent as a separate function. Fixes: 94750bcf9309 ("rasdaemon: Fix poll() on per_cpu trace_pipe_raw blocks indefinitely") Signed-off-by: Xiaofei Tan <[email protected]> Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

…move redundant header file 1. The return value of ARRAY_SIZE() is unsigned integer. It isn't right to compare it with a signed integer. This patch fix them. 2. Remove redundant header file and adjust the header files sequence. Signed-off-by: Xiaofei Tan <[email protected]> Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

1. Support for create/open the vendor error tables at rasdaemon startup. 2. Make changes in the HiSilicon error handling code for the same. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

…records to the broken-down time format Add common function to convert the timestamp in the CXL event records in nanoseconds to the broken-down time format. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Add common function to get the timestamp for the event reported. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Add support to log and record the CXL overflow events. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Add support to log and record the CXL generic events. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Add support to log and record the CXL general media events. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Add support to log and record the CXL dram events. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Add support to log and record the CXL memory module events. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

When the number of CPUs detected is greater than the number of CPUs in the system, rasdaemon will crash when it receives some events. Looking deeper, we also fail to use the poll method for similar reasons in this case. All of this can be prevented by checking to see how many CPUs are currently online (sysconf(_SC_NPROCESSORS_ONLN)) instead of how many CPUs the current kernel was configured to support (sysconf(_SC_NPROCESSORS_CONF)). For the kernel side of the discussion, see https://lore.kernel.org/lkml/CAM6Wdxft33zLeeXHhmNX5jyJtfGTLiwkQSApc=10fqf+rQh9DA@mail.gmail.com/T/ Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Signed-off-by: Mauro Carvalho Chehab <[email protected]>

All prints except disk are preceded by a colon Signed-off-by: weidong <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Update, reword some existing SMCA bank type error descriptions to extend SMCA error decoding functionality for modern AMD processors. Additionally, also add new error descriptions for missing SMCA bank types. Signed-off-by: Avadhut Naik <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Currently, on AMD systems with Scalable MCA (SMCA), each machine check error of a SMCA bank type has an associated bit position in the bank's control (CTL) register used for enabling / disabling reporting of the very error. An error's bit position in the CTL register is also used during error decoding for offsetting into the corresponding bank's error description structure. As new errors are being added in newer AMD systems for existing SMCA bank types, the underlying SMCA architecture guarantees that the bit positions of existing errors are not altered. However, on some AMD systems viz. Genoa, some of the existing bit definitions in the CTL register of the Coherent Slave (CS) SMCA bank type are reassigned without defining new HWID and McaType. Consequently, the very errors whose bit definitions have been reassigned in the CTL register are being erroneously decoded. As a solution, create a new software defined SMCA bank type by utilizing one of the hardware-reserved values for HWID. The new SMCA bank type will only be employed for CS error decoding on affected CPU models. Additionally, since the existing error description structure for the CS SMCA bank type is still valid, add new error description structure to compensate for the reassigned bit definitions. Signed-off-by: Avadhut Naik <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Currently, the rasdaemon performs detailed error decoding of received MCA errors on the system only whence it is running, either as a daemon or in the foreground. As such, error decoding cannot be undertaken for any MCA errors received whence the rasdaemon wasn't running. Additionally, if the error decoding modules like edac_mce_amd too have not been loaded, error records in the demsg buffer might correspond to raw values in associated MSRs, compelling users to undertake decoding manually. The scenario seems more plausible on AMD systems with Scalabale MCA (SMCA) with plans in place to remove SMCA Extended Error Descriptions from the edac_mce_amd module in an effort to offload SMCA Error Decoding to the rasdaemon. As such, add support to post-process and decode MCA Errors received on AMD SMCA systems from raw MSR values. Support for post-processing and decoding of MCA Errors received on CPUs of other vendors can be added in the future, as needed. Suggested-by: Yazen Ghannam <[email protected]> Signed-off-by: Avadhut Naik <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Add HWID and McaType values for new SMCA bank types and error decoding for those new SMCA banks. Signed-off-by: Muralidhara M K <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

On some AMD systems some of the existing bit definitions in the CTL register of SMCA bank type are reassigned without defining new HWID and McaType. Consequently, the errors whose bit definitions have been reassigned in the CTL register are being erroneously decoded. Add new error description structure to compensate for the reassigned bit definitions, by new software defined SMCA bank type by utilizing the hardware-reserved values for HWID. The new SMCA bank type will only be employed for UMC error decoding on affected models and the existing error description structure for UMC bank type is still valid. Signed-off-by: Muralidhara M K <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Some AMD systems have 4 dies in each socket and Die ID represents whether the error occured on cpu die or gpu die. Also, respective Die used for FRU identification. Signed-off-by: Muralidhara M K <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

On AMD systems with Scalable MCA (SMCA), the (HWID, MCATYPE) tuple from the MCA_IPID MSR, bits 43:32 and 63:48 respectively, are used for SMCA bank type decoding. On occurrence of an SMCA error, the cached tuples are compared against the tuple read from the MCA_IPID MSR to determine the SMCA bank type. Currently however, all high 32 bits of the MCA_IPID register are cached in the rasdaemon for all SMCA bank types. Bits 47:44 which do not play a part in bank type decoding are zeroed out. Likewise, when an SMCA error occurs, all high 32 bits of the MCA_IPID register are read and compared against the cached values in smca_hwid_mcatypes array. This can lead to erroneous bank type decoding since the bits 47:44 are not guaranteed to be zero. They are either reserved or, on some modern AMD systems viz. Genoa, denote the InstanceIdHi value. The bits therefore, should not be associated with SMCA bank type decoding. Import the HWID_MCATYPE macro from the kernel to ensure that only the relevant fields i.e. (HWID, MCATYPE) tuples are used for SMCA bank type decoding on occurrence of an SMCA error. Signed-off-by: Avadhut Naik <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

It is more reasonable log non_standard_event in one line exclude errors dump. So you can easily to get decoded non_standard_event log in one line if you implement a decoder like other event. Signed-off-by: Ruidong Tian <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Add a new non-standard error decoder to decode THead YiTian error section. Put all related code to a new source file. Signed-off-by: Ruidong Tian <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Add support for the THead YiTian DDRC register dump event. Signed-off-by: Ruidong Tian <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Signed-off-by: weidongkl <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Signed-off-by: Delgado Vargas, Daniel <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

…elds Modify check for valid HiSilicon KunPeng9xx error fields. Fixes an error data is not printed when it's value is 0. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Overflows may happen in the `threshold_string` and `cycle_string` arrays. If the PAGE_CE_THRESHOLD value in page isolation is set to 50 bits, there is a risk of array overflow. Because sprintf is an insecure function, use snprintf instead. An error is reported when the AddressSanitizer is used. rasdaemon: Improper PAGE_CE_ACTION, set to default soft rasdaemon: Page offline choice on Corrected Errors is soft ================================================================= ==221920==ERROR: AddressSanitizer: stack-buffer-overflow on address 0xffffdd91d932 at pc 0xffffa24071c4 bp 0xffffdd91d720 sp 0xffffdd91ced8 WRITE of size 55 at 0xffffdd91d932 thread T0 #0 0xffffa24071c0 in vsprintf (/usr/lib64/libasan.so.6+0x5c1c0) #1 0xffffa24073cc in sprintf (/usr/lib64/libasan.so.6+0x5c3cc) mchehab#2 0x459558 in parse_env_string /home/rasdaemon/ras-page-isolation.c:185 mchehab#3 0x4596f4 in page_isolation_init /home/rasdaemon/ras-page-isolation.c:202 mchehab#4 0x459934 in ras_page_account_init /home/rasdaemon/ras-page-isolation.c:211 mchehab#5 0x40f700 in handle_ras_events /home/rasdaemon/ras-events.c:902 mchehab#6 0x405b8c in main /home/rasdaemon/rasdaemon.c:211 mchehab#7 0xffffa20b6f38 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 mchehab#8 0xffffa20b7004 in __libc_start_main_impl ../csu/libc-start.c:409 mchehab#9 0x4038ec in _start (/home/rasdaemon/rasdaemon+0x4038ec) Address 0xffffdd91d932 is located in stack of thread T0 at offset 82 in frame #0 0x459574 in page_isolation_init /home/rasdaemon/ras-page-isolation.c:190 This frame has 2 object(s): [32, 82) 'threshold_string' (line 191) [128, 178) 'cycle_string' (line 192) <== Memory access at offset 82 partially underflows this variable HINT: this may be a false positive if your program uses some custom stack unwind mechanism, swapcontext or vfork (longjmp and C++ exceptions *are* supported) SUMMARY: AddressSanitizer: stack-buffer-overflow (/usr/lib64/libasan.so.6+0x5c1c0) in vsprintf Shadow bytes around the buggy address: 0x200ffbb23ad0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x200ffbb23ae0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x200ffbb23af0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x200ffbb23b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x200ffbb23b10: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 =>0x200ffbb23b20: 00 00 00 00 00 00[02]f2 f2 f2 f2 f2 00 00 00 00 0x200ffbb23b30: 00 00 02 f3 f3 f3 f3 f3 00 00 00 00 00 00 00 00 0x200ffbb23b40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0x200ffbb23b50: f1 f1 f1 f1 f1 f1 04 f2 00 00 f2 f2 00 00 00 00 0x200ffbb23b60: 00 00 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 f2 0x200ffbb23b70: f2 f2 f2 f2 00 00 00 00 00 00 00 00 f2 f2 f2 f2 Shadow byte legend (one shadow byte represents 8 application bytes): Addressable: 00 Partially addressable: 01 02 03 04 05 06 07 Heap left redzone: fa Freed heap region: fd Stack left redzone: f1 Stack mid redzone: f2 Stack right redzone: f3 Stack after return: f5 Stack use after scope: f8 Global redzone: f9 Global init order: f6 Poisoned by user: f7 Container overflow: fc Array cookie: ac Intra object redzone: bb ASan internal: fe Left alloca redzone: ca Right alloca redzone: cb Shadow gap: cc ==221920==ABORTING Signed-off-by: Mauro Carvalho Chehab <[email protected]>

hubin and others added 29 commits October 23, 2023 10:43

rasdaemon: Add common function to get timestamp for the event

7be2edb

Add common function to get the timestamp for the event reported. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

rasdaemon: Add support for the CXL overflow events

f73ed45

Add support to log and record the CXL overflow events. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

rasdaemon: Add support for the CXL generic events

e0cde0e

Add support to log and record the CXL generic events. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

rasdaemon: Add support for the CXL general media events

53c682f

Add support to log and record the CXL general media events. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

rasdaemon: Add support for the CXL dram events

9a2f618

Add support to log and record the CXL dram events. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

rasdaemon: Add support for the CXL memory module events

f63b4c9

Add support to log and record the CXL memory module events. Signed-off-by: Shiju Jose <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Add label for mainboard: GIGABYTE model MZ62-HD0-00

acf74cd

Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Add label for mainboard: ASUSTeK COMPUTER INC. Model: Z9PH-D16 Series

4d66a6a

Signed-off-by: Mauro Carvalho Chehab <[email protected]>

add ':' before error output

9bd84ae

All prints except disk are preceded by a colon Signed-off-by: weidong <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

rasdaemon: Add new MA_LLC, USR_DP, and USR_CP bank types.

1f74a59

Add HWID and McaType values for new SMCA bank types and error decoding for those new SMCA banks. Signed-off-by: Muralidhara M K <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

rasdaemon: ras-mc-ctl: Add support to display the THead vendor errors

160adcf

Add support for the THead YiTian DDRC register dump event. Signed-off-by: Ruidong Tian <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

Add a space between "diskerror_event" and "store"

885e546

Signed-off-by: weidongkl <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

rasdaemon: Add Emerald Rapids support

a996299

Signed-off-by: Delgado Vargas, Daniel <[email protected]> Signed-off-by: Mauro Carvalho Chehab <[email protected]>

AvaNaik closed this Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate the latest merged changes #1

Integrate the latest merged changes #1

AvaNaik commented Nov 21, 2023

Integrate the latest merged changes #1

Integrate the latest merged changes #1

Conversation

AvaNaik commented Nov 21, 2023