Skip to content

Commit a7afab6

Browse files
Merge pull request #122 from SciML/cuda
[WIP] Setup DiffEqGPU integration
2 parents 9bcc75e + d0c5ab1 commit a7afab6

File tree

7 files changed

+291
-2
lines changed

7 files changed

+291
-2
lines changed

README.md

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -456,6 +456,270 @@ Notice that the solver accurately is able to simulate the kink (discontinuity)
456456
at `t=20` due to the discontinuity of the derivative at the initial time point!
457457
This is why declaring discontinuities can enhance the solver accuracy.
458458

459+
## GPU-Accelerated ODE Solving of Ensembles
460+
461+
In many cases one is interested in solving the same ODE many times over many
462+
different initial conditions and parameters. In diffeqpy parlance this is called
463+
an ensemble solve. diffeqpy inherits the parallelism tools of the
464+
[SciML ecosystem](https://sciml.ai/) that are used for things like
465+
[automated equation discovery and acceleration](https://arxiv.org/abs/2001.04385).
466+
Here we will demonstrate using these parallel tools to accelerate the solving
467+
of an ensemble.
468+
469+
First, let's define the JIT-accelerated Lorenz equation like before:
470+
471+
```py
472+
from diffeqpy import de
473+
474+
def f(u,p,t):
475+
x, y, z = u
476+
sigma, rho, beta = p
477+
return [sigma * (y - x), x * (rho - z) - y, x * y - beta * z]
478+
479+
u0 = [1.0,0.0,0.0]
480+
tspan = (0., 100.)
481+
p = [10.0,28.0,8/3]
482+
prob = de.ODEProblem(f, u0, tspan, p)
483+
fast_prob = de.jit32(prob)
484+
sol = de.solve(fast_prob,saveat=0.01)
485+
```
486+
487+
Note that here we used `de.jit32` to JIT-compile the problem into a `Float32` form in order to make it more
488+
efficient on most GPUs.
489+
490+
Now we use the `EnsembleProblem` as defined on the
491+
[ensemble parallelism page of the documentation](https://diffeq.sciml.ai/stable/features/ensemble/):
492+
Let's build an ensemble by utilizing uniform random numbers to randomize the
493+
initial conditions and parameters:
494+
495+
```py
496+
import random
497+
def prob_func(prob,i,rep):
498+
de.remake(prob,u0=[random.uniform(0, 1)*u0[i] for i in range(0,3)],
499+
p=[random.uniform(0, 1)*p[i] for i in range(0,3)])
500+
501+
ensembleprob = de.EnsembleProblem(fast_prob, safetycopy=False)
502+
```
503+
504+
Now we solve the ensemble in serial:
505+
506+
```py
507+
sol = de.solve(ensembleprob,de.Tsit5(),de.EnsembleSerial(),trajectories=10000,saveat=0.01)
508+
```
509+
510+
To add GPUs to the mix, we need to bring in [DiffEqGPU](https://github.com/SciML/DiffEqGPU.jl).
511+
The `cuda` submodule will install CUDA for you and bring all of the bindings into the returned object:
512+
513+
```py
514+
from diffeqpy import cuda
515+
```
516+
517+
#### Note: `from diffeqpy import cuda` can take awhile to run the first time as it installs the drivers!
518+
519+
Now we simply use `EnsembleGPUKernel(cuda.CUDABackend())` with a
520+
GPU-specialized ODE solver `cuda.GPUTsit5()` to solve 10,000 ODEs on the GPU in
521+
parallel:
522+
523+
```py
524+
sol = de.solve(ensembleprob,cuda.GPUTsit5(),cuda.EnsembleGPUKernel(CUDABackend()),trajectories=10000,saveat=0.01)
525+
```
526+
527+
For the full list of choices for specialized GPU solvers, see
528+
[the DiffEqGPU.jl documentation](https://docs.sciml.ai/DiffEqGPU/stable/manual/ensemblegpukernel/).
529+
530+
Note that `EnsembleGPUArray` can be used as well, like:
531+
532+
```py
533+
sol = de.solve(ensembleprob,de.Tsit5(),cuda.EnsembleGPUArray(CUDABackend()),trajectories=10000,saveat=0.01)
534+
```
535+
536+
though we highly recommend the `EnsembleGPUKernel` methods for more speed. Given
537+
the way the JIT compilation performed will also ensure that the faster kernel
538+
generation methods work, `EnsembleGPUKernel` is almost certainly the
539+
better choice in most applications.
540+
541+
### Benchmark
542+
543+
To see how much of an effect the parallelism has, let's test this against R's
544+
deSolve package. This is exactly the same problem as the documentation example
545+
for deSolve, so let's copy that verbatim and then add a function to do the
546+
ensemble generation:
547+
548+
```py
549+
import numpy as np
550+
from scipy.integrate import odeint
551+
552+
def lorenz(state, t, sigma, beta, rho):
553+
x, y, z = state
554+
555+
dx = sigma * (y - x)
556+
dy = x * (rho - z) - y
557+
dz = x * y - beta * z
558+
559+
return [dx, dy, dz]
560+
561+
sigma = 10.0
562+
beta = 8.0 / 3.0
563+
rho = 28.0
564+
p = (sigma, beta, rho)
565+
y0 = [1.0, 1.0, 1.0]
566+
567+
t = np.arange(0.0, 100.0, 0.01)
568+
result = odeint(lorenz, y0, t, p)
569+
```
570+
571+
Using `lapply` to generate the ensemble we get:
572+
573+
```py
574+
import timeit
575+
def time_func():
576+
for itr in range(1, 1001):
577+
result = odeint(lorenz, y0, t, p)
578+
579+
timeit.Timer(time_func).timeit(number=1)
580+
581+
# 38.08861699999761 seconds
582+
```
583+
584+
Now let's see how the JIT-accelerated serial Julia version stacks up against that:
585+
586+
```py
587+
def time_func():
588+
sol = de.solve(ensembleprob,de.Tsit5(),de.EnsembleSerial(),trajectories=1000,saveat=0.01)
589+
590+
timeit.Timer(time_func).timeit(number=1)
591+
592+
# 3.1903300999983912
593+
```
594+
595+
Julia is already about 12x faster than the pure Python solvers here! Now let's add
596+
GPU-acceleration to the mix:
597+
598+
```py
599+
def time_func():
600+
sol = de.solve(ensembleprob,cuda.GPUTsit5(),cuda.EnsembleGPUKernel(CUDABackend()),trajectories=1000,saveat=0.01)
601+
602+
timeit.Timer(time_func).timeit(number=1)
603+
604+
# 0.013322799997695256
605+
```
606+
607+
Already 2900x faster than SciPy! But the GPU acceleration is made for massively
608+
parallel problems, so let's up the trajectories a bit. We will not use more
609+
trajectories from R because that would take too much computing power, so let's
610+
see what happens to the Julia serial and GPU at 10,000 trajectories:
611+
612+
```py
613+
def time_func():
614+
sol = de.solve(ensembleprob,de.Tsit5(),de.EnsembleSerial(),trajectories=10000,saveat=0.01)
615+
616+
timeit.Timer(time_func).timeit(number=1)
617+
618+
# 68.80795999999827
619+
```
620+
621+
```py
622+
def time_func():
623+
sol = de.solve(ensembleprob,cuda.GPUTsit5(),cuda.EnsembleGPUKernel(CUDABackend()),trajectories=10000,saveat=0.01)
624+
625+
timeit.Timer(time_func).timeit(number=1)
626+
627+
# 0.10774460000175168
628+
```
629+
630+
To compare this to the pure Julia code:
631+
632+
```julia
633+
using OrdinaryDiffEq, DiffEqGPU, CUDA, StaticArrays
634+
function lorenz(u, p, t)
635+
σ = p[1]
636+
ρ = p[2]
637+
β = p[3]
638+
du1 = σ * (u[2] - u[1])
639+
du2 = u[1] *- u[3]) - u[2]
640+
du3 = u[1] * u[2] - β * u[3]
641+
return SVector{3}(du1, du2, du3)
642+
end
643+
644+
u0 = SA[1.0f0; 0.0f0; 0.0f0]
645+
tspan = (0.0f0, 10.0f0)
646+
p = SA[10.0f0, 28.0f0, 8 / 3.0f0]
647+
prob = ODEProblem{false}(lorenz, u0, tspan, p)
648+
prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p)
649+
monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false)
650+
@time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(CUDA.CUDABackend()),
651+
trajectories = 10_000,
652+
saveat = 0.01);
653+
654+
# 0.014481 seconds (257.64 k allocations: 13.130 MiB)
655+
```
656+
657+
which is about an order of magnitude faster for computing 10,000 trajectories,
658+
note that the major factors are that we cannot define 32-bit floating point values
659+
from Python and the `prob_func` for generating the initial conditions and parameters
660+
is a major bottleneck since this function is written in Python.
661+
662+
To see how this scales in Julia, let's take it to insane heights. First, let's
663+
reduce the amount we're saving:
664+
665+
```julia
666+
@time sol = solve(monteprob,GPUTsit5(),EnsembleGPUKernel(CUDA.CUDABackend()),trajectories=10_000,saveat=1.0f0)
667+
0.015040 seconds (257.64 k allocations: 13.130 MiB)
668+
```
669+
670+
This highlights that controlling memory pressure is key with GPU usage: you will
671+
get much better performance when requiring less saved points on the GPU.
672+
673+
```julia
674+
@time sol = solve(monteprob,GPUTsit5(),EnsembleGPUKernel(CUDA.CUDABackend()),trajectories=100_000,saveat=1.0f0)
675+
# 0.150901 seconds (2.60 M allocations: 131.576 MiB)
676+
```
677+
678+
compared to serial:
679+
680+
```julia
681+
@time sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=100_000,saveat=1.0f0)
682+
# 22.136743 seconds (16.40 M allocations: 1.628 GiB, 42.98% gc time)
683+
```
684+
685+
And now we start to see that scaling power! Let's solve 1 million trajectories:
686+
687+
```julia
688+
@time sol = solve(monteprob,GPUTsit5(),EnsembleGPUKernel(CUDA.CUDABackend()),trajectories=1_000_000,saveat=1.0f0)
689+
# 1.031295 seconds (3.40 M allocations: 241.075 MiB)
690+
```
691+
692+
For reference, let's look at deSolve with the change to only save that much:
693+
694+
```py
695+
t = np.arange(0.0, 100.0, 1.0)
696+
def time_func():
697+
for itr in range(1, 1001):
698+
result = odeint(lorenz, y0, t, p)
699+
700+
timeit.Timer(time_func).timeit(number=1)
701+
702+
# 37.42609280000033
703+
```
704+
705+
The GPU version is solving 1000x as many trajectories, 37x as fast! So conclusion,
706+
if you need the most speed, you may want to move to the Julia version to get the
707+
most out of your GPU due to Float32's, and when using GPUs make sure it's a problem
708+
with a relatively average or low memory pressure, and these methods will give
709+
orders of magnitude acceleration compared to what you might be used to.
710+
711+
## GPU Backend Choices
712+
713+
Just like DiffEqGPU.jl, diffeqpy supports many different GPU venders. `from diffeqpy import cuda`
714+
is just for cuda, but the following others are supported:
715+
716+
- `from diffeqpy import cuda` with `cuda.CUDABackend` for NVIDIA GPUs via CUDA
717+
- `from diffeqpy import amdgpu` with `amdgpu.AMDGPUBackend` for AMD GPUs
718+
- `from diffeqpy import oneapi` with `oneapi.oneAPIBackend` for Intel's oneAPI GPUs
719+
- `from diffeqpy import metal` with `metal.MetalBackend` for Apple's Metal GPUs (on M-series processors)
720+
721+
For more information, see [the DiffEqGPU.jl documentation](https://docs.sciml.ai/DiffEqGPU/stable/manual/backends/).
722+
459723
## Known Limitations
460724

461725
- Autodiff does not work on Python functions. When applicable, either define the derivative function

diffeqpy/amdgpu.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
import sys
2+
from . import load_julia_packages
3+
amdgpu, _ = load_julia_packages("DiffEqGPU, AMDGPU")
4+
from juliacall import Main
5+
de.AMDGPUBackend = Main.seval("AMDGPU.AMDGPUBackend") # kinda hacky
6+
sys.modules[__name__] = amdgpu # mutate myself

diffeqpy/cuda.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
import sys
2+
from . import load_julia_packages
3+
cuda, _ = load_julia_packages("DiffEqGPU, CUDA")
4+
from juliacall import Main
5+
de.CUDABackend = Main.seval("CUDA.CUDABackend") # kinda hacky
6+
sys.modules[__name__] = cuda # mutate myself

diffeqpy/de.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,5 @@
33
de, _ = load_julia_packages("DifferentialEquations, ModelingToolkit")
44
from juliacall import Main
55
de.jit = Main.seval("jit(x) = typeof(x).name.wrapper(ModelingToolkit.modelingtoolkitize(x), x.u0, x.tspan, x.p)") # kinda hackey
6+
de.jit32 = Main.seval("jit(x) = typeof(x).name.wrapper(ModelingToolkit.modelingtoolkitize(x), Float32.(x.u0), Float32.(x.tspan), Float32.(x.p))") # kinda hackey
67
sys.modules[__name__] = de # mutate myself

diffeqpy/metal.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
import sys
2+
from . import load_julia_packages
3+
metal, _ = load_julia_packages("DiffEqGPU, Metal")
4+
from juliacall import Main
5+
de.MetalBackend = Main.seval("Metal.MetalBackend") # kinda hacky
6+
sys.modules[__name__] = metal # mutate myself

diffeqpy/oneapi.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
import sys
2+
from . import load_julia_packages
3+
oneapi, _ = load_julia_packages("DiffEqGPU, oneAPI")
4+
from juliacall import Main
5+
de.oneAPIBackend = Main.seval("oneAPI.oneAPIBackend") # kinda hacky
6+
sys.modules[__name__] = oneapi # mutate myself

setup.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ def readme():
55
return f.read()
66

77
setup(name='diffeqpy',
8-
version='2.1.0',
8+
version='2.2.0',
99
description='Solving Differential Equations in Python',
1010
long_description=readme(),
1111
long_description_content_type="text/markdown",
@@ -19,7 +19,7 @@ def readme():
1919
'Topic :: Scientific/Engineering :: Physics'
2020
],
2121
url='http://github.com/SciML/diffeqpy',
22-
keywords='differential equations stochastic ordinary delay differential-algebraic dae ode sde dde',
22+
keywords='differential equations stochastic ordinary delay differential-algebraic dae ode sde dde oneapi AMD ROCm CUDA Metal GPU',
2323
author='Chris Rackauckas and Takafumi Arakaki',
2424
author_email='[email protected]',
2525
license='MIT',

0 commit comments

Comments
 (0)