@@ -456,6 +456,270 @@ Notice that the solver accurately is able to simulate the kink (discontinuity)
456
456
at ` t=20 ` due to the discontinuity of the derivative at the initial time point!
457
457
This is why declaring discontinuities can enhance the solver accuracy.
458
458
459
+ ## GPU-Accelerated ODE Solving of Ensembles
460
+
461
+ In many cases one is interested in solving the same ODE many times over many
462
+ different initial conditions and parameters. In diffeqpy parlance this is called
463
+ an ensemble solve. diffeqpy inherits the parallelism tools of the
464
+ [ SciML ecosystem] ( https://sciml.ai/ ) that are used for things like
465
+ [ automated equation discovery and acceleration] ( https://arxiv.org/abs/2001.04385 ) .
466
+ Here we will demonstrate using these parallel tools to accelerate the solving
467
+ of an ensemble.
468
+
469
+ First, let's define the JIT-accelerated Lorenz equation like before:
470
+
471
+ ``` py
472
+ from diffeqpy import de
473
+
474
+ def f (u ,p ,t ):
475
+ x, y, z = u
476
+ sigma, rho, beta = p
477
+ return [sigma * (y - x), x * (rho - z) - y, x * y - beta * z]
478
+
479
+ u0 = [1.0 ,0.0 ,0.0 ]
480
+ tspan = (0 ., 100 .)
481
+ p = [10.0 ,28.0 ,8 / 3 ]
482
+ prob = de.ODEProblem(f, u0, tspan, p)
483
+ fast_prob = de.jit32(prob)
484
+ sol = de.solve(fast_prob,saveat = 0.01 )
485
+ ```
486
+
487
+ Note that here we used ` de.jit32 ` to JIT-compile the problem into a ` Float32 ` form in order to make it more
488
+ efficient on most GPUs.
489
+
490
+ Now we use the ` EnsembleProblem ` as defined on the
491
+ [ ensemble parallelism page of the documentation] ( https://diffeq.sciml.ai/stable/features/ensemble/ ) :
492
+ Let's build an ensemble by utilizing uniform random numbers to randomize the
493
+ initial conditions and parameters:
494
+
495
+ ``` py
496
+ import random
497
+ def prob_func (prob ,i ,rep ):
498
+ de.remake(prob,u0 = [random.uniform(0 , 1 )* u0[i] for i in range (0 ,3 )],
499
+ p = [random.uniform(0 , 1 )* p[i] for i in range (0 ,3 )])
500
+
501
+ ensembleprob = de.EnsembleProblem(fast_prob, safetycopy = False )
502
+ ```
503
+
504
+ Now we solve the ensemble in serial:
505
+
506
+ ``` py
507
+ sol = de.solve(ensembleprob,de.Tsit5(),de.EnsembleSerial(),trajectories = 10000 ,saveat = 0.01 )
508
+ ```
509
+
510
+ To add GPUs to the mix, we need to bring in [ DiffEqGPU] ( https://github.com/SciML/DiffEqGPU.jl ) .
511
+ The ` cuda ` submodule will install CUDA for you and bring all of the bindings into the returned object:
512
+
513
+ ``` py
514
+ from diffeqpy import cuda
515
+ ```
516
+
517
+ #### Note: ` from diffeqpy import cuda ` can take awhile to run the first time as it installs the drivers!
518
+
519
+ Now we simply use ` EnsembleGPUKernel(cuda.CUDABackend()) ` with a
520
+ GPU-specialized ODE solver ` cuda.GPUTsit5() ` to solve 10,000 ODEs on the GPU in
521
+ parallel:
522
+
523
+ ``` py
524
+ sol = de.solve(ensembleprob,cuda.GPUTsit5(),cuda.EnsembleGPUKernel(CUDABackend()),trajectories = 10000 ,saveat = 0.01 )
525
+ ```
526
+
527
+ For the full list of choices for specialized GPU solvers, see
528
+ [ the DiffEqGPU.jl documentation] ( https://docs.sciml.ai/DiffEqGPU/stable/manual/ensemblegpukernel/ ) .
529
+
530
+ Note that ` EnsembleGPUArray ` can be used as well, like:
531
+
532
+ ``` py
533
+ sol = de.solve(ensembleprob,de.Tsit5(),cuda.EnsembleGPUArray(CUDABackend()),trajectories = 10000 ,saveat = 0.01 )
534
+ ```
535
+
536
+ though we highly recommend the ` EnsembleGPUKernel ` methods for more speed. Given
537
+ the way the JIT compilation performed will also ensure that the faster kernel
538
+ generation methods work, ` EnsembleGPUKernel ` is almost certainly the
539
+ better choice in most applications.
540
+
541
+ ### Benchmark
542
+
543
+ To see how much of an effect the parallelism has, let's test this against R's
544
+ deSolve package. This is exactly the same problem as the documentation example
545
+ for deSolve, so let's copy that verbatim and then add a function to do the
546
+ ensemble generation:
547
+
548
+ ``` py
549
+ import numpy as np
550
+ from scipy.integrate import odeint
551
+
552
+ def lorenz (state , t , sigma , beta , rho ):
553
+ x, y, z = state
554
+
555
+ dx = sigma * (y - x)
556
+ dy = x * (rho - z) - y
557
+ dz = x * y - beta * z
558
+
559
+ return [dx, dy, dz]
560
+
561
+ sigma = 10.0
562
+ beta = 8.0 / 3.0
563
+ rho = 28.0
564
+ p = (sigma, beta, rho)
565
+ y0 = [1.0 , 1.0 , 1.0 ]
566
+
567
+ t = np.arange(0.0 , 100.0 , 0.01 )
568
+ result = odeint(lorenz, y0, t, p)
569
+ ```
570
+
571
+ Using ` lapply ` to generate the ensemble we get:
572
+
573
+ ``` py
574
+ import timeit
575
+ def time_func ():
576
+ for itr in range (1 , 1001 ):
577
+ result = odeint(lorenz, y0, t, p)
578
+
579
+ timeit.Timer(time_func).timeit(number = 1 )
580
+
581
+ # 38.08861699999761 seconds
582
+ ```
583
+
584
+ Now let's see how the JIT-accelerated serial Julia version stacks up against that:
585
+
586
+ ``` py
587
+ def time_func ():
588
+ sol = de.solve(ensembleprob,de.Tsit5(),de.EnsembleSerial(),trajectories = 1000 ,saveat = 0.01 )
589
+
590
+ timeit.Timer(time_func).timeit(number = 1 )
591
+
592
+ # 3.1903300999983912
593
+ ```
594
+
595
+ Julia is already about 12x faster than the pure Python solvers here! Now let's add
596
+ GPU-acceleration to the mix:
597
+
598
+ ``` py
599
+ def time_func ():
600
+ sol = de.solve(ensembleprob,cuda.GPUTsit5(),cuda.EnsembleGPUKernel(CUDABackend()),trajectories = 1000 ,saveat = 0.01 )
601
+
602
+ timeit.Timer(time_func).timeit(number = 1 )
603
+
604
+ # 0.013322799997695256
605
+ ```
606
+
607
+ Already 2900x faster than SciPy! But the GPU acceleration is made for massively
608
+ parallel problems, so let's up the trajectories a bit. We will not use more
609
+ trajectories from R because that would take too much computing power, so let's
610
+ see what happens to the Julia serial and GPU at 10,000 trajectories:
611
+
612
+ ``` py
613
+ def time_func ():
614
+ sol = de.solve(ensembleprob,de.Tsit5(),de.EnsembleSerial(),trajectories = 10000 ,saveat = 0.01 )
615
+
616
+ timeit.Timer(time_func).timeit(number = 1 )
617
+
618
+ # 68.80795999999827
619
+ ```
620
+
621
+ ``` py
622
+ def time_func ():
623
+ sol = de.solve(ensembleprob,cuda.GPUTsit5(),cuda.EnsembleGPUKernel(CUDABackend()),trajectories = 10000 ,saveat = 0.01 )
624
+
625
+ timeit.Timer(time_func).timeit(number = 1 )
626
+
627
+ # 0.10774460000175168
628
+ ```
629
+
630
+ To compare this to the pure Julia code:
631
+
632
+ ``` julia
633
+ using OrdinaryDiffEq, DiffEqGPU, CUDA, StaticArrays
634
+ function lorenz (u, p, t)
635
+ σ = p[1 ]
636
+ ρ = p[2 ]
637
+ β = p[3 ]
638
+ du1 = σ * (u[2 ] - u[1 ])
639
+ du2 = u[1 ] * (ρ - u[3 ]) - u[2 ]
640
+ du3 = u[1 ] * u[2 ] - β * u[3 ]
641
+ return SVector {3} (du1, du2, du3)
642
+ end
643
+
644
+ u0 = SA[1.0f0 ; 0.0f0 ; 0.0f0 ]
645
+ tspan = (0.0f0 , 10.0f0 )
646
+ p = SA[10.0f0 , 28.0f0 , 8 / 3.0f0 ]
647
+ prob = ODEProblem {false} (lorenz, u0, tspan, p)
648
+ prob_func = (prob, i, repeat) -> remake (prob, p = (@SVector rand (Float32, 3 )) .* p)
649
+ monteprob = EnsembleProblem (prob, prob_func = prob_func, safetycopy = false )
650
+ @time sol = solve (monteprob, GPUTsit5 (), EnsembleGPUKernel (CUDA. CUDABackend ()),
651
+ trajectories = 10_000 ,
652
+ saveat = 0.01 );
653
+
654
+ # 0.014481 seconds (257.64 k allocations: 13.130 MiB)
655
+ ```
656
+
657
+ which is about an order of magnitude faster for computing 10,000 trajectories,
658
+ note that the major factors are that we cannot define 32-bit floating point values
659
+ from Python and the ` prob_func ` for generating the initial conditions and parameters
660
+ is a major bottleneck since this function is written in Python.
661
+
662
+ To see how this scales in Julia, let's take it to insane heights. First, let's
663
+ reduce the amount we're saving:
664
+
665
+ ``` julia
666
+ @time sol = solve (monteprob,GPUTsit5 (),EnsembleGPUKernel (CUDA. CUDABackend ()),trajectories= 10_000 ,saveat= 1.0f0 )
667
+ 0.015040 seconds (257.64 k allocations: 13.130 MiB)
668
+ ```
669
+
670
+ This highlights that controlling memory pressure is key with GPU usage: you will
671
+ get much better performance when requiring less saved points on the GPU.
672
+
673
+ ``` julia
674
+ @time sol = solve (monteprob,GPUTsit5 (),EnsembleGPUKernel (CUDA. CUDABackend ()),trajectories= 100_000 ,saveat= 1.0f0 )
675
+ # 0.150901 seconds (2.60 M allocations: 131.576 MiB)
676
+ ```
677
+
678
+ compared to serial:
679
+
680
+ ``` julia
681
+ @time sol = solve (monteprob,Tsit5 (),EnsembleSerial (),trajectories= 100_000 ,saveat= 1.0f0 )
682
+ # 22.136743 seconds (16.40 M allocations: 1.628 GiB, 42.98% gc time)
683
+ ```
684
+
685
+ And now we start to see that scaling power! Let's solve 1 million trajectories:
686
+
687
+ ``` julia
688
+ @time sol = solve (monteprob,GPUTsit5 (),EnsembleGPUKernel (CUDA. CUDABackend ()),trajectories= 1_000_000 ,saveat= 1.0f0 )
689
+ # 1.031295 seconds (3.40 M allocations: 241.075 MiB)
690
+ ```
691
+
692
+ For reference, let's look at deSolve with the change to only save that much:
693
+
694
+ ``` py
695
+ t = np.arange(0.0 , 100.0 , 1.0 )
696
+ def time_func ():
697
+ for itr in range (1 , 1001 ):
698
+ result = odeint(lorenz, y0, t, p)
699
+
700
+ timeit.Timer(time_func).timeit(number = 1 )
701
+
702
+ # 37.42609280000033
703
+ ```
704
+
705
+ The GPU version is solving 1000x as many trajectories, 37x as fast! So conclusion,
706
+ if you need the most speed, you may want to move to the Julia version to get the
707
+ most out of your GPU due to Float32's, and when using GPUs make sure it's a problem
708
+ with a relatively average or low memory pressure, and these methods will give
709
+ orders of magnitude acceleration compared to what you might be used to.
710
+
711
+ ## GPU Backend Choices
712
+
713
+ Just like DiffEqGPU.jl, diffeqpy supports many different GPU venders. ` from diffeqpy import cuda `
714
+ is just for cuda, but the following others are supported:
715
+
716
+ - ` from diffeqpy import cuda ` with ` cuda.CUDABackend ` for NVIDIA GPUs via CUDA
717
+ - ` from diffeqpy import amdgpu ` with ` amdgpu.AMDGPUBackend ` for AMD GPUs
718
+ - ` from diffeqpy import oneapi ` with ` oneapi.oneAPIBackend ` for Intel's oneAPI GPUs
719
+ - ` from diffeqpy import metal ` with ` metal.MetalBackend ` for Apple's Metal GPUs (on M-series processors)
720
+
721
+ For more information, see [ the DiffEqGPU.jl documentation] ( https://docs.sciml.ai/DiffEqGPU/stable/manual/backends/ ) .
722
+
459
723
## Known Limitations
460
724
461
725
- Autodiff does not work on Python functions. When applicable, either define the derivative function
0 commit comments