Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86: Respect disableAVX2/512 options and check XCR0 for OS support #20691

Merged
merged 2 commits into from
Feb 20, 2025

Conversation

BradleyWood
Copy link
Member

@BradleyWood BradleyWood commented Nov 27, 2024

@BradleyWood BradleyWood added depends:omr Pull request is dependent on a corresponding change in OMR comp:jit arch:x86 labels Nov 27, 2024
@BradleyWood BradleyWood force-pushed the disablecpufeature branch 3 times, most recently from ba7cf9a to 7271349 Compare December 6, 2024 17:38
@BradleyWood BradleyWood changed the title x86: use OMR option to disable CPU features x86: Respect disableAVX2/512 options and check XCR0 for OS support Dec 6, 2024
@BradleyWood BradleyWood marked this pull request as ready for review December 11, 2024 01:31
@BradleyWood BradleyWood requested a review from dsouzai as a code owner December 11, 2024 01:31
@BradleyWood
Copy link
Member Author

@0xdaryl See the force-push

@0xdaryl
Copy link
Contributor

0xdaryl commented Jan 23, 2025

Jenkins test sanity xlinux,win,osx jdk21

@0xdaryl
Copy link
Contributor

0xdaryl commented Jan 24, 2025

Windows nodes are offline. As this PR does have a Windows-specific component to it, can you confirm it's been built and tested on Windows?

@BradleyWood
Copy link
Member Author

@0xdaryl can you relaunch?

@0xdaryl
Copy link
Contributor

0xdaryl commented Jan 28, 2025

Jenkins test sanity.functional,sanity.openjdk win jdk21

@0xdaryl
Copy link
Contributor

0xdaryl commented Feb 4, 2025

Just a ping on triaging the Windows CI failures.

@BradleyWood
Copy link
Member Author

I don't think any of the windows tests ran. The only obvious error I see is this:

[2025-01-28T19:51:24.425Z] /bin/sh: -c: line 3: syntax error: unexpected end of file
[2025-01-28T19:51:24.425Z] make[2]: *** [settings.mk:383: setup_testList] Error 1

@0xdaryl
Copy link
Contributor

0xdaryl commented Feb 4, 2025

Jenkins test sanity.functional,sanity.openjdk win jdk21

@0xdaryl
Copy link
Contributor

0xdaryl commented Feb 5, 2025

I sniffed at the CI failures and see at least one JVM crash with an illegal instruction exception:

https://openj9-jenkins.osuosl.org/job/Test_openjdk21_j9_sanity.openjdk_x86-64_windows_Personal_testList_1/65/consoleFull

@BradleyWood
Copy link
Member Author

Probably issue #21000? All the crashes happen in byte array hashcode with -XX:-UseCompressedOops. I will try to confirm soon...

@0xdaryl
Copy link
Contributor

0xdaryl commented Feb 7, 2025

I will try to confirm soon...
Have you confirmed that all the failures are known issues?

@BradleyWood
Copy link
Member Author

@0xdaryl The other issue did not manifest as an illegal instruction. That makes me a little concerned because the issue in #21000 shouldn't cause that. Can you confirm what microarchitecture that test machine has?

@0xdaryl
Copy link
Contributor

0xdaryl commented Feb 10, 2025

Can you confirm what microarchitecture that test machine has?

Not off the top of my head. I suggest you request access to the machine to check. Ask @AdamBrousseau how to do that.

@BradleyWood
Copy link
Member Author

@0xdaryl I got a little concerned because because the crash in jenkins would indicate to me that this does not fix the issue. However, the client had indicated that the fix works. I created a linux vm on an AVX-512 machine and modified the kernel to always disable OPMASK, Hi16_ZMM, and ZMM_Hi256 bits in XCR0 register. I was able to reproduce the SIGILL crash easily without this fix, and was unable make it crash with this fix.

As for windows, I didn't get access to that machine yet, but can't think of a reason this PR wouldn't fix the issue on windows server 2012. I looked through the build history of that machine, and I see the classic manifestation of #21000, wrong result, no SIGILL. That makes no sense as windows server 2012 does not support AVX-512 which is required to reproduce that crash. On the other hand, I found this build, which reproduced the SIGILL (which is what I would expect). So at the very least, I don't think this PR actually makes the issue worse...

Can you manually launch a grinder test on win2012x64-openj9-1? I just need to see jitdump.

On another note, I really wish the crash info would dump 16 bytes from [RIP] to the console to make it easier to see what instruction was generated.

@BradleyWood
Copy link
Member Author

@0xdaryl Can you launch tests again?

@0xdaryl
Copy link
Contributor

0xdaryl commented Feb 18, 2025

Jenkins test sanity.functional,sanity.openjdk win jdk21

@BradleyWood
Copy link
Member Author

Failure is #21000

@0xdaryl 0xdaryl merged commit 04d2901 into eclipse-openj9:master Feb 20, 2025
10 of 12 checks passed
@mpirvu
Copy link
Contributor

mpirvu commented Feb 20, 2025

As far as I can tell, _xgetbv(unsigned int ecx) is defined only for Windows, but its usage is not protected by an ifdef.

@BradleyWood
Copy link
Member Author

As far as I can tell, _xgetbv(unsigned int ecx) is defined only for Windows, but its usage is not protected by an ifdef.

It is protected by ifdef? With msvc we use intrinsic, otherwise define this function.

#if defined(OMR_OS_WINDOWS) && defined(TR_TARGET_X86)
#include <intrin.h>
#elif defined(TR_TARGET_X86)
inline unsigned long long _xgetbv(unsigned int ecx)
   {
   unsigned int eax, edx;
   __asm__ __volatile__("xgetbv" : "=a"(eax), "=d"(edx) : "c"(ecx));
   return ((unsigned long long)edx << 32) | eax;
   }
#endif

@0xdaryl
Copy link
Contributor

0xdaryl commented Feb 20, 2025

@BradleyWood : see @mpirvu comment above

@dsouzai
Copy link
Contributor

dsouzai commented Feb 20, 2025

As far as I can tell, _xgetbv(unsigned int ecx) is defined only for Windows, but its usage is not protected by an ifdef.

It is protected by ifdef? With msvc we use intrinsic, otherwise define this function.

#if defined(OMR_OS_WINDOWS) && defined(TR_TARGET_X86)
#include <intrin.h>
#elif defined(TR_TARGET_X86)
inline unsigned long long _xgetbv(unsigned int ecx)
   {
   unsigned int eax, edx;
   __asm__ __volatile__("xgetbv" : "=a"(eax), "=d"(edx) : "c"(ecx));
   return ((unsigned long long)edx << 32) | eax;
   }
#endif

That means that _xgetbv is only defined for TR_TARGET_X86, so the uses of _xgetbv also need to be appropriately guarded. Currently there are build failures (eg, see #21158).

@BradleyWood
Copy link
Member Author

I'm going to assume there is a good reason that perform specific code in this file is guarded by if statements and not #ifdef so I will just guard the small block of code. See #21160

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch:x86 comp:jit depends:omr Pull request is dependent on a corresponding change in OMR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants