Running a parallel GPU ClimaOcean/Oceananigans simulation on Australia's HPC - Gadi #74

taimoorsohail · 2025-05-12T05:52:54Z

taimoorsohail
May 12, 2025
Maintainer

This discussion thread provides step-by-step instructions on running a parallel GPU simulation of the ClimaOcean model on Australia's Gadi HPC.

GPU nodes are available on the gpuvolta queue on Gadi (no express/normal split currently). These instructions specify whether the command should be run on the login node, or an interactive GPU node. The GPUs do not have internet access, so package management needs to be done on the login node, for example.

Add the necessary modules

In the login node, add the following modules (can be added to .bashrc if you want to always load at login):

module load cuda/12.6.2
module load openmpi/4.1.7

We then want to ensure that the MPI versions that are called are the system defaults. Navigate to the folder where you parallel simulation code will sit, and run in the command line:

julia -e 'using Pkg; Pkg.add("MPIPreferences"); using MPIPreferences; MPIPreferences.use_system_binary()'

julia --project

Unfortunately, NCDatasets.jl tries to load the default Julia MPI, which causes issues in Gadi. To fix this, navigate to the ~/.julia/artifacts/<hash>/lib folder (this is where you specified your $JULIA_DEPOT_PATH). Here, we will be removing the libmpi_mpifh.so, libmpi_mpifh.so.40 and libmpi_mpifh.so.40.40.1 files and symbolic linking them to the default files in the
openmpi/4.1.7 GNU install.

rm libmpi_mpifh.so libmpi_mpifh.so.40 libmpi_mpifh.so.40.40.1

ln -s /apps/openmpi/4.1.7/lib/libmpi_mpifh_GNU.so         libmpi_mpifh.so
ln -s /apps/openmpi/4.1.7/lib/libmpi_mpifh_GNU.so.40      libmpi_mpifh.so.40
ln -s /apps/openmpi/4.1.7/lib/libmpi_mpifh_GNU.so.40.30.1 libmpi_mpifh.so.40.40.1

The above is a workaround; ideally NCDatasets.jl should allow us to specify the path of our system MPI. This may occur in the future, rendering the above workaround less relevant.

Now, still in the login node, navigate to the folder where you parallel simulation code will sit and add the necessary Julia packages to run the simulation. For a simple one degree ClimaOcean simulation, use:

julia > ] # press enter to go to package manager mode
(@v1.x) pkg> activate . # create a project in the current directory
(project-name) pkg> add CUDA, MPI, ClimaOcean, Oceananigans, Dates, CFTime, Printf # add required packages

The above should load without a hitch. Next, add the relevant lines to your .bashrc file to ensure that you are using a single thread, among other things.

export JULIA_NUM_THREADS=1
export JULIA_CUDA_MEMORY_POOL=none 
export UCX_ERROR_SIGNALS="SIGILL,SIGBUS,SIGFPE"
export UCX_WARN_UNUSED_ENV_VARS=n

Finally, let's save a new Julia script, say, aquaplanet.jl in our simulation folder, that should run a simple, 1-degree aquaplanet forced by a JRA-55do RYF atmosphere.

using MPI
using CUDA

MPI.Init()


using ClimaOcean
using Oceananigans
using Oceananigans.Units
using CFTime
using Dates
using Printf
using Oceananigans.DistributedComputations

arch = Distributed(GPU(); partition = Partition(y = DistributedComputations.Equal()))

@info "Using architecture: " * string(arch)

Nx = Integer(360)
Ny = Integer(180)
Nz = Integer(50)

@info "Defining tripolar grid"

grid = TripolarGrid(arch;
                    size = (Nx, Ny, Nz),
                    z = (-4000, 0),
                    halo = (6, 6, 2),
                    first_pole_longitude = 70,
                    north_poles_latitude = 55)

@info "Defining ocean simulation"

free_surface = SplitExplicitFreeSurface(grid; substeps = 50)

ocean = ocean_simulation(grid; free_surface)

# We force the simulation with an JRA55-do atmospheric reanalysis.
@info "Defining Atmospheric state"

radiation  = Radiation(arch)
atmosphere = JRA55PrescribedAtmosphere(arch; backend=JRA55NetCDFBackend(20))

@info "Defining coupled model"

coupled_model = OceanSeaIceModel(ocean; atmosphere, radiation)
simulation = Simulation(coupled_model; Δt=5minutes, stop_time=20days)

@info "Defining messenger"

wall_time = Ref(time_ns())

callback_interval = IterationInterval(20)

function progress(sim)
    u, v, w = sim.model.ocean.model.velocities
    T, S, e = sim.model.ocean.model.tracers
    Trange = (maximum((T)), minimum((T)))
    Srange = (maximum((S)), minimum((S)))
    erange = (maximum((e)), minimum((e)))

    umax = (maximum(abs, (u)),
            maximum(abs, (v)),
            maximum(abs, (w)))
        
    step_time = 1e-9 * (time_ns() - wall_time[])

    msg1 = @sprintf("time: %s, iteration: %d, Δt: %s, ", prettytime(sim), iteration(sim), prettytime(sim.Δt))
    msg2 = @sprintf("max|u|: (%.2e, %.2e, %.2e) m s⁻¹, ", umax...)
    msg3 = @sprintf("extrema(T): (%.2f, %.2f) ᵒC, ", Trange...)
    msg4 = @sprintf("extrema(S): (%.2f, %.2f) g/kg, ", Srange...)
    msg5 = @sprintf("extrema(e): (%.2f, %.2f) J, ", erange...)
    msg6 = @sprintf("wall time: %s \n", prettytime(step_time))

    @info msg1 * msg2 * msg3 * msg4 * msg5 * msg6

    wall_time[] = time_ns()

    return nothing
end

add_callback!(simulation, progress, callback_interval)

@info "Running simulation"

run!(simulation)

MPI.Finalize()

Finally, start an interactive, multi-node GPU job on gadi:

qsub -q gpuvolta -P v46 -l walltime=8:00:00 -l ncpus=24 -l ngpus=2 -l mem=190GB -l jobfs=20GB -N gpuineract -W umask=027 -l storage=gdata/v46+scratch/v46+gdata/e14+scratch/e14 -l wd -j n -I -X

then, navigate to the folder where you want to run your simulation, and type:

mpirun -n 2 julia --project aquaplanet.jl

This code should now run a parallel aquaplanet on Gadi! Note that this is just a testing example, and we expect it to NaN within a few time steps, but if it works then your MPI config is running!

jbisits · 2025-05-15T03:38:22Z

jbisits
May 15, 2025

@taimoorsohail I think the adding packages section is missing the activation of the environment in which to add the packages. Should it be

julia > ] # press enter to go to package manager mode
(@v1.x) pkg> activate . # create a project in the current directory
(project-name) pkg> add CUDA, MPI, ClimaOcean, Oceananigans, Dates, CFTime, Printf # add required packages

as otherwise there is this error

julia> st
ERROR: UndefVarError: `st` not defined in `Main`
Suggestion: check for spelling errors or missing imports.

The only reason I put all packages on one line above (I assume this still works) is that it would be easier to copy and paste!

5 replies

taimoorsohail May 15, 2025
Maintainer Author

Thanks Joey, you are right! I was assuming people would already have a Project.toml file hehe. But of course if starting from scratch they would need to activate their project.

I have edited the above instructions :)

jbisits May 15, 2025

Do you also think it worth documenting installing Julia via juliaup plus setting up the JULIA_DEPOT_PATH on gadi? I know we discussed this a bit on teams when you were getting set up so could be a useful resource (for anyone using Julia on gadi). I am happy to add this. Unless there are instructions somewhere we could link to @navidcy ?

taimoorsohail May 16, 2025
Maintainer Author

I am keen to add such instructions but they may be a bit out of the scope of this particular discussion as they are more generally about setting up Julia on Gadi... But I will check if these instructions exist out in the ether

jbisits May 16, 2025

Yep maybe even of ACCESS NRI HIVE things or some other similar forum.

navidcy Jun 6, 2025
Collaborator

I'm trying to gather the basics (eg installing Julia, changing depot directory, etc) at CliMA/Oceananigans.jl#4584

navidcy · 2025-06-07T00:28:15Z

navidcy
Jun 7, 2025
Collaborator

Question: what's the <hash> in the ~/.julia/artifacts/<hash>/lib directory in the step related to NCDatasets?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running a parallel GPU ClimaOcean/Oceananigans simulation on Australia's HPC - Gadi #74

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Running a parallel GPU ClimaOcean/Oceananigans simulation on Australia's HPC - Gadi #74

Uh oh!

Uh oh!

taimoorsohail May 12, 2025 Maintainer

Replies: 2 comments · 5 replies

Uh oh!

Uh oh!

jbisits May 15, 2025

Uh oh!

taimoorsohail May 15, 2025 Maintainer Author

Uh oh!

jbisits May 15, 2025

Uh oh!

taimoorsohail May 16, 2025 Maintainer Author

Uh oh!

jbisits May 16, 2025

Uh oh!

navidcy Jun 6, 2025 Collaborator

Uh oh!

navidcy Jun 7, 2025 Collaborator

taimoorsohail
May 12, 2025
Maintainer

Replies: 2 comments 5 replies

jbisits
May 15, 2025

taimoorsohail May 15, 2025
Maintainer Author

taimoorsohail May 16, 2025
Maintainer Author

navidcy Jun 6, 2025
Collaborator

navidcy
Jun 7, 2025
Collaborator