Skip to content

[LoongArch64] Handle LASX/LSX/FP context depending on ISA availability. #118007

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

LuckyXu-HF
Copy link
Contributor

@LuckyXu-HF LuckyXu-HF commented Jul 24, 2025

@LuckyXu-HF
Copy link
Contributor Author

Please NOTE:

  • Higher versions of clang for LA64 enabled LSX by default (while LASX is not enabled by default for now).
  • Therefore, when building relevant elf and .a files in the SDK to run on 2K embedded platforms which not support LSX/LASX, the clang compilation flags -mno-lsx -mno-lasx must also be added to disable LSX/LASX support.

@LuckyXu-HF
Copy link
Contributor Author

Hi @driver1998 , does this solve your problem?
cc @shushanhf

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jul 24, 2025
@driver1998
Copy link

driver1998 commented Jul 24, 2025

cc @xen0n for review.
I have a similar patch locally driver1998/dotnet-nolsx@e608ccb and that works. This should just work I guess.

You may also need to test the 2K1000/2000/3000 series, those have LSX but no LASX.

(I didn't post my patch upstream because I am still waiting for my 2K3000 board to arrive lol)

@driver1998
Copy link

Although IIRC HWCAP should be preferred over CPUCFG, since there might be situations where the CPU supports LSX/LASX but the kernel disabled them.

see https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6772567 for example.

@LuckyXu-HF
Copy link
Contributor Author

cc @xen0n for review. I have a similar patch locally driver1998/dotnet-nolsx@e608ccb and that works. This should just work I guess.

You may also need to test the 2K1000/2000/3000 series, those have LSX but no LASX.

(I didn't post my patch upstream because I am still waiting for my 2K3000 board to arrive lol)

Yes, this patch can handle no LSX/LASX, only LSX , LSX && LASX cases on the coreclr side.

On the clang side, AFAIK lsx start enabled by default on clang19, so if we are using clang version>=19, we will need to add clang compilation option -mno-lsx for the platform which do not support LSX/LASX.

@LuckyXu-HF
Copy link
Contributor Author

Although IIRC HWCAP should be preferred over CPUCFG, since there might be situations where the CPU supports LSX/LASX but the kernel disabled them.

see https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6772567 for example.

Thanks, then we should consider similar judgment approach.
Does .NET have demand in such environments?

@driver1998
Copy link

Does .NET have demand in such environments?

Not sure, but AFAIK a VM running non-LASX kernel on something like a 3A6000 is used for simulating a 2K3000 in distro testing.

@am11
Copy link
Member

am11 commented Jul 24, 2025

Although IIRC HWCAP should be preferred over CPUCFG, since there might be situations where the CPU supports LSX/LASX but the kernel disabled them.

see https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6772567 for example.

Good point. We can use similar approach taken here for riscv64 #113676 (see src/native/minipal/cpufeatures.c changes). You can add both old and new instructions in separate asm macros and at the call-site, use cpufeatures detection.

@LuckyXu-HF
Copy link
Contributor Author

Although IIRC HWCAP should be preferred over CPUCFG, since there might be situations where the CPU supports LSX/LASX but the kernel disabled them.
see https://chromium-review.googlesource.com/c/libyuv/libyuv/+/6772567 for example.

Good point. We can use similar approach taken here for riscv64 #113676 (see src/native/minipal/cpufeatures.c changes). You can add both old and new instructions in separate asm macros and at the call-site, use cpufeatures detection.

Thanks very much. This approach is good to detect cpu features to add atomic/hwintrinsic/simd ISAs instead of inline assembly. I will add this in cpufeatures.c.

@jkotas
Copy link
Member

jkotas commented Jul 24, 2025

The current version of the change looks good to me. Is it ok to merge it?

Thanks very much. This approach is good to detect cpu features to add atomic/hwintrinsic/simd ISAs instead of inline assembly. I will add this in cpufeatures.c.

This can be done in a follow up.

Copy link

@xen0n xen0n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please properly separate logical changes, and say what you did, e.g. "Handle LASX/LSX/FP context depending on ISA availability" in the commit title. "Fix crash on 2K series SoC" should appear in the commit body because it's providing helpful additional context.

@@ -71,6 +83,76 @@ LOCAL_LABEL(Restore_CONTEXT_FLOATING_POINT):
xvld $xr30, $a0, CONTEXT_FPU_OFFSET + 32*30
xvld $xr31, $a0, CONTEXT_FPU_OFFSET + 32*31

LOCAL_LABEL(Restore_CONTEXT_LSX):
Copy link

@xen0n xen0n Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe by not skipping over the following blocks, work is duplicated and important information (upper halves of XRs) is lost, because as it currently stands, the context layout doesn't guarantee a 256-bit XR is stored separately irrespective of LASX/LSX availability.

Example:

  • If LASX is present:
    • CONTEXT_FPU_OFFSET + [0, 15]: xr0 lower half
    • CONTEXT_FPU_OFFSET + [16, 31]: xr0 upper half
  • If only LSX is present:
    • CONTEXT_FPU_OFFSET + [0, 15]: vr0
    • CONTEXT_FPU_OFFSET + [16, 31]: vr1

On LoongArch cores with LASX (LSX is implied by LASX so it's always present in this case), xr0's upper half is NOT xr1, so with this context layout, information is lost by re-loading in the LSX way. Same for vr0's upper half vs f0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes thanks, I also noticed this yesterday, will update.

Add HWCAP get in minipal_getcpufeatures().

Co-authored-by: driver1998 <[email protected]>
@LuckyXu-HF LuckyXu-HF changed the title [LoongArch64] Fix SDK running on 2K series SoC. [LoongArch64] Handle LASX/LSX/FP context depending on ISA availability. Jul 25, 2025
@LuckyXu-HF
Copy link
Contributor Author

LuckyXu-HF commented Jul 25, 2025

If we had to run on the assumed environment which CPU supports LSX/LASX but the kernel disabled them (#118007 (comment)). Then we should call minipal_getcpufeatures to check instead of the cpucfg instruction, this should affect the performance. getauxval() does not involve syscall.

diff
diff --git a/src/coreclr/pal/src/arch/loongarch64/context2.S b/src/coreclr/pal/src/arch/loongarch64/context2.S
index 4e747c767d1..64403508453 100644
--- a/src/coreclr/pal/src/arch/loongarch64/context2.S
+++ b/src/coreclr/pal/src/arch/loongarch64/context2.S
@@ -37,13 +37,16 @@ LOCAL_LABEL(Restore_CONTEXT_FLOATING_POINT):
     andi $t1, $r21, (1 << CONTEXT_FLOATING_POINT_BIT)
     beqz $t1, LOCAL_LABEL(No_Restore_CONTEXT_FLOATING_POINT)
 
-    ori $r21, $zero, 2
-    cpucfg $r21, $r21
-    // CPUCFG.2.LASX[bit7]
-    andi $t1, $r21, 0x80
+    PROLOG_SAVE_REG_PAIR_INDEXED  22, 1, 24
+    st.d  $a0, $sp, 16
+    bl  C_FUNC(minipal_getcpufeatures)
+    ori  $r21, $a0, 0
+    ld.d  $a0, $sp, 16
+    EPILOG_RESTORE_REG_PAIR_INDEXED  22, 1, 24
+
+    andi $t1, $r21, LoongArch64IntrinsicConstants_LASX
     bnez $t1, LOCAL_LABEL(Restore_CONTEXT_LASX)
-    // CPUCFG.2.LSX[bit6]
-    andi $t1, $r21, 0x40
+    andi $t1, $r21, LoongArch64IntrinsicConstants_LSX
     bnez $t1, LOCAL_LABEL(Restore_CONTEXT_LSX)
 
     // 64-bits FPU. Not Support LSX/LASX
@@ -300,13 +303,16 @@ LOCAL_LABEL(Done_CONTEXT_INTEGER):
     andi  $t3, $t1, (1 << CONTEXT_FLOATING_POINT_BIT)
     beqz  $t3, LOCAL_LABEL(Done_CONTEXT_FLOATING_POINT)
 
-    ori $t1, $zero, 2
-    cpucfg $t1, $t1
-    // CPUCFG.2.LASX[bit7]
-    andi $t3, $t1, 0x80
+    PROLOG_SAVE_REG_PAIR_INDEXED  22, 1, 24
+    st.d  $a0, $sp, 16
+    bl  C_FUNC(minipal_getcpufeatures)
+    ori  $t1, $a0, 0
+    ld.d  $a0, $sp, 16
+    EPILOG_RESTORE_REG_PAIR_INDEXED  22, 1, 24
+
+    andi $t3, $t1, LoongArch64IntrinsicConstants_LASX
     bnez $t3, LOCAL_LABEL(Store_CONTEXT_LASX)
-    // CPUCFG.2.LSX[bit6]
-    andi $t3, $t1, 0x40
+    andi $t3, $t1, LoongArch64IntrinsicConstants_LSX
     bnez $t3, LOCAL_LABEL(Store_CONTEXT_LSX)
 
     // 64-bits FPU. Not Support LSX/LASX

@driver1998
Copy link

driver1998 commented Jul 25, 2025

Maybe the value can be cached somewhere? I don't expect it will be changed at runtime.
Or maybe the performance penalty is not actually that high.

Also this can be delayed for later as well, no need to block this PR.

@LuckyXu-HF
Copy link
Contributor Author

LuckyXu-HF commented Jul 25, 2025

Or maybe the performance penalty is not actually that high.

I just debugged the getauxval() and it does not involve syscall, so I think call minipal_getcpufeatures here is feasible.

Also this can be delayed for later as well, no need to block this PR.

Agree. I will also try to find an environment where hardware supports LSX/LASX but the kernel disabled them.

@risc-vv
Copy link

risc-vv commented Jul 25, 2025

@dotnet/samsung Could you please take a look? These changes may be related to riscv64.

@am11
Copy link
Member

am11 commented Jul 25, 2025

@LuckyXu-HF, I agree with @jkotas that we can keep cpufeatures changes out until the time they are actually used in C/C++ code. I see that you have CPU feature detection duplicated in asm and new code in cpufeatures.c is never exercised (i.e. minipal_getcpufeatures() not called and LoongArch64IntrinsicConstants_LSX and _LASX are unused). If you are not planing to use them, lets revert src/native/minipal for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-loongarch64 area-PAL-coreclr community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants