-
Notifications
You must be signed in to change notification settings - Fork 60
Description
Various PeleLMeX cases are failing when compiled with SYCL on Aurora at ALCF. 3D Hot Bubble, 2D & 3D FlameSheet, Taylor-Green all fail (and likely most others). The only case I've tested that works is 2D Hot Bubble.
Reproducing the error: I’m running on an interactive compute node with the only change to the default environment being:
module load cmake
Clone PeleLMeX and its dependencies:
git clone --recursive https://github.com/AMReX-Combustion/PeleLMeX.git
Go to the HotBubble case:
cd PeleLMeX/Exec/RegTests/HotBubble
Build third party libraries (SUNDIALS):
make TPL COMP=intel-llvm USE_SYCL=TRUE
Build PeleLMeX:
make -j DIM=3 COMP=intel-llvm USE_SYCL=TRUE
Try to run:
mpiexec -n 2 ./PeleLMeX3d.sycl.MPI.ex input.3d-regt
I get failures that look like:
x4219c5s3b0n0.hsn.cm.aurora.alcf.anl.gov: rank 0 died from signal 11
The error can also be obtained running the code in evaluate mode, which just initializes data then runs one of the affected kernels (and saves the results to a plot file if it gets that far):
mpiexec -n 2 ./PeleLMeX3d.sycl.MPI.ex input.3d-regt peleLM.run_mode=evaluate peleLM.evaluate_vars="velForce density"
Debugging Progress
The first error has been isolated to this line in the makeVelForces kernel:
amrex::Real rho_lcl = 0.0;
if (is_incomp != 0) {
rho_lcl = rho_incomp;
} else {
rho_lcl = rho(i, j, k); // this line fails
}
In the cases being run is_incomp is always 0, but this is determined at runtime during initialization. The code works as expected if it is just hard-coded to 0/false. Removing the conditional and just setting rho_lcl = rho(i, j, k); also works. Leaving the conditional in place but making a trivial modification like rho_lcl = rho(i, j, k) + 1e-15; also works. Defining is_incomp to be something that is always false, like (i< -10000) but gets evaluated at runtime, does not work.
The same general code pattern appears several places and if this one is modified so it doesn't fail, the code fails at later instances like here:
Real soverrho =
(incompressible) != 0 ? a_dt / rho : a_dt / rho_arr(i, j, k);