Skip to content

Code failures with SYCL on Aurora #552

@baperry2

Description

@baperry2

Various PeleLMeX cases are failing when compiled with SYCL on Aurora at ALCF. 3D Hot Bubble, 2D & 3D FlameSheet, Taylor-Green all fail (and likely most others). The only case I've tested that works is 2D Hot Bubble.

Reproducing the error: I’m running on an interactive compute node with the only change to the default environment being:
module load cmake
Clone PeleLMeX and its dependencies:
git clone --recursive https://github.com/AMReX-Combustion/PeleLMeX.git
Go to the HotBubble case:
cd PeleLMeX/Exec/RegTests/HotBubble
Build third party libraries (SUNDIALS):
make TPL COMP=intel-llvm USE_SYCL=TRUE
Build PeleLMeX:
make -j DIM=3 COMP=intel-llvm USE_SYCL=TRUE
Try to run:
mpiexec -n 2 ./PeleLMeX3d.sycl.MPI.ex input.3d-regt
I get failures that look like:
x4219c5s3b0n0.hsn.cm.aurora.alcf.anl.gov: rank 0 died from signal 11
The error can also be obtained running the code in evaluate mode, which just initializes data then runs one of the affected kernels (and saves the results to a plot file if it gets that far):
mpiexec -n 2 ./PeleLMeX3d.sycl.MPI.ex input.3d-regt peleLM.run_mode=evaluate peleLM.evaluate_vars="velForce density"

Debugging Progress
The first error has been isolated to this line in the makeVelForces kernel:

  amrex::Real rho_lcl = 0.0;
  if (is_incomp != 0) {
    rho_lcl = rho_incomp;
  } else {
    rho_lcl = rho(i, j, k); // this line fails
  }

In the cases being run is_incomp is always 0, but this is determined at runtime during initialization. The code works as expected if it is just hard-coded to 0/false. Removing the conditional and just setting rho_lcl = rho(i, j, k); also works. Leaving the conditional in place but making a trivial modification like rho_lcl = rho(i, j, k) + 1e-15; also works. Defining is_incomp to be something that is always false, like (i< -10000) but gets evaluated at runtime, does not work.

The same general code pattern appears several places and if this one is modified so it doesn't fail, the code fails at later instances like here:

            Real soverrho =
              (incompressible) != 0 ? a_dt / rho : a_dt / rho_arr(i, j, k);

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions