You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a HPC system with the current configuration:
AMD Ryzen Threadripper PRO 5975WX 32-Cores and two NVIDIA RTX A5500 GPUs .
In issue #748 I had mentioned that I am dealing with SPH-DEM solver and implemented both SPH (has bugs) and DEM solver (Done) independently so far (not coupled yet). I had ran the settling_of_bodies_in_tank.cpp on both parallel CPU cores and on GPU. Here are the run times:
I used the following command to run:
time ./examples/03RBBodiesSettling 0.1 1.0 1.0 1000 0.1 200
Which considers a total no of bodies of 1000, and a total time of $0.1$ seconds, for a total of $1800$ steps. A 1000 rigid bodies resulted in $127436$ of particles.
The total time taken is :
OpenMP
Cuda
9.17 seconds
9.8 seconds
However, the ExaMPM code developed by Cabana developers is very fast on GPU when compared to parallel CPU runtime.
For comparision, I took DamBreak example, and run with the following command
time ./examples/DamBreak 0.05 2 0 0.001 1.0 10 OpenMP
and for GPU
time ./examples/DamBreak 0.05 2 0 0.001 1.0 10 CUDA
I get the following run times:
OpenMP
Cuda
33 seconds
0.98 seconds
The CUDA run is almost 30 times faster than the CPU run. Unfortunately, I am unable to get the same numbers for my own code.
I believe I followed the best practices. I am really not sure why I am lacking with this performance boost. Is it that, in my case I am using two AoSoA's or something else. Is there a way to debug this performance issue. I am almost ready with both SPH and DEM codes, just the coupling is left out to be added. Can you please help me with this issue?
Thank you so much. I will provide any additional information regarding this.
The text was updated successfully, but these errors were encountered:
100k particles is often the break-even point for performance comparisons between a CPU node and single GPU. It's generally (but not necessarily) enough work to fully utilize even a single GPU, let alone two.
You can certainly still get performance improvements, particularly from avoiding memory allocation (as in the previous issue), communication, etc. The important question is what is the timing breakdown for the CPU and GPU? It's very likely the code is spending very different amounts of time in different sections for each hardware
Thanks for the input, Sam. I will leave this open and continue with the rest of the code development. I will update on this once I do complete profiling of the code on both architectures.
streeve
changed the title
Code takes same time to run on both GPU and on parallel CPU cores (No performance increase).
Cabana-based code takes same time to run on both GPU and on parallel CPU cores
Apr 30, 2024
Hi all,
Similar to #748 this is also a question.
I have a HPC system with the current configuration:
AMD Ryzen Threadripper PRO 5975WX 32-Cores and two NVIDIA RTX A5500 GPUs .
In issue #748 I had mentioned that I am dealing with SPH-DEM solver and implemented both SPH (has bugs) and DEM solver (Done) independently so far (not coupled yet). I had ran the
settling_of_bodies_in_tank.cpp
on both parallel CPU cores and on GPU. Here are the run times:I used the following command to run:
time ./examples/03RBBodiesSettling 0.1 1.0 1.0 1000 0.1 200
Which considers a total no of bodies of 1000, and a total time of$0.1$ seconds, for a total of $1800$ steps. A 1000 rigid bodies resulted in $127436$ of particles.
The total time taken is :
However, the
ExaMPM
code developed byCabana
developers is very fast on GPU when compared to parallel CPU runtime.For comparision, I took
DamBreak
example, and run with the following commandtime ./examples/DamBreak 0.05 2 0 0.001 1.0 10 OpenMP
and for GPU
time ./examples/DamBreak 0.05 2 0 0.001 1.0 10 CUDA
I get the following run times:
The CUDA run is almost 30 times faster than the CPU run. Unfortunately, I am unable to get the same numbers for my own code.
I believe I followed the best practices. I am really not sure why I am lacking with this performance boost. Is it that, in my case I am using two
AoSoA
's or something else. Is there a way to debug this performance issue. I am almost ready with both SPH and DEM codes, just the coupling is left out to be added. Can you please help me with this issue?Thank you so much. I will provide any additional information regarding this.
The text was updated successfully, but these errors were encountered: