Open
Description
At the moment only one dependent variable b
is allowed to solve A @ x = b
for x
.
In other algorithms, multiple dependent variables could be passed, e.g.:
https://numpy.org/devdocs/reference/generated/numpy.linalg.solve.html
Since the solver is implemented in cython, this could be optimized by using prange.
Metadata
Metadata
Assignees
Labels
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
MothNik commentedon May 26, 2024
Hello,
Since this still seems to be of interest, I wanted to give some input here. I'm a co-developer of the package
chemotools
where I lately integratedpentapy
to speed up the solution of pentadiagonal matrices which are encountered during spline smoothing (PR).pentapy
gave quite some speedup, but I also stumbled over the multiple right-hand sides problem.Therefore, I forked
pentapy
to play around a little and I've found that there are still improvements that can be made in terms of runtime and especially memory.I wanted to optimize those first before going into multiple right-hand sides and even parallelization.
I've found the two things:
some of the vectors are used to store data that are never used/don't have to be used:
ze
is a freshly allocated vector for holding the transformed right-hand side. It's common practice to overwrite this vector during backward/forward substitutions. It should be noted that usingze
to holdresult
is saving ~1/6 of the additonal memory required by the algorithm (not considering the left-hand side matrix and the right-hand side vector themselves).ga
andmu
are used to store elements that are never truly acessed again:One might now say that they are referenced right after, but that's why I added the truly in there (I will come to that in a moment). Again, not initialising them will lead to savings of 2/6 of the total memory. Together with
result
, we can get rid off ~50% of the memory requirements 🤯 This means that running the algorithms two times in parallel will have roughly the same memory requirements as the current version. On top of that, the initialisation has a Python interaction and removing them will give a speedup for free.all variables are written to and read again from the internally allocated vectors. While Cython has optimized indexing when compared to Python, a memory access to an Array in the RAM is never really or at least hard to get fast. This is where caching comes in and the pentadiagonal solvers have a bit of a random access order by design (that's naturally given from the original publication and nothing one can fix here unless there is a better algorithm).
In the
pentapy
-implementation there are a lot of intermediate results being computed, stored in vectors, and read again right after. Since Array-accessing is quite expensive, it comes as no surprise that the algorithm spends a significant fraction of time in read-write-processes while it should mostly be doing numerics.How can this be fixed?
First of all,
result
can be replaced byze
in a very straightforward way.Then
ga
andmu
can be dropped completely in favour for the intermediate variablesga_i
andmu_i
, so basically 2xn * 64
-bit vectors are replaced by 2x64
-bit floats that can be accessed directly.Now comes the heavy part of using intermediate results efficiently. All values stored in
al
,be
, andze
are frequently reused. Usually,
x[i-2]
andx[i-1]
are involved in computing all the variables for thei
-th step wherex
is a placeholder for the respective vectors. Therefore, the variables:al_i
,al_i_minus_1
,al_i_plus_1
be_i
,be_i_minus_1
,be_i_plus_1
ze_i
,ze_i_minus_1
,ze_i_plus_1
are introduced (so in total
9 * 64
bit floats more). With their help, the vector accesses can be reduced to 1 write in the factorization and 1 read in the backward substitution.In that go, we can also read from
mat_flat[4, i]
once and use it over and over again (it was read from many times).Doing so, I ended up with the following code for Solver 1 (I didn't yet go over Solver 2):
Is all of this not only micro-optimization?
My first attempts ended up making everything slower and I almost dropped it. But after applying the mentioned optimizations, I found a consistent 20 ... 25% speedup with
perfplot
(applied to the solver only, nothing around):This is admittedly not a quantum leap when it comes to solving pentadiagonal systems, but it's also not nothing. These optimizations can easily save a few minutes in the applications encountered in
chemotools
. Also, that means that 5 right hand sides can be solved in the same time that 4 take with the current version. Besides, this only considers the runtime, but not the memory savings.How to proceed
Would you be interested in a collaboration and me submitting a PR on this from the fork I made?
Thanks for Your time!
I wish a great start in the week!
Kind regards
Appendix - What did not do any good
Changing from C-contiguous to F-contiguous Arrays that would fit better into the solvers access pattern has only a minor effect, i.e.,
did not really improve anything:
MuellerSeb commentedon May 27, 2024
Hey there!
This is awesome 🎉
I have to admit that I just straight forward implemented the algorithms from the referenced paper and didn't care to much about memory efficiency.
I am happy to review a PR from you! Could you update the second algorithm as well?
Cheers,
Sebastian
MothNik commentedon May 29, 2024
Of course I can also update the second algorithm.
Do you have a timeline on this? Just for me to plan.
MothNik commentedon May 29, 2024
Would you actually be interested in porting the low-level implementation to Rust? That would make it more accessible and the Python interface is also relatively straightforward. With your CI to build wheels already in place, distribution would be managable.
I think this repository is great, but the use of Cython unnecessarily limits its use to Python only 🤔
MuellerSeb commentedon Jun 2, 2024
We did something similar with https://github.com/GeoStat-Framework/GSTools-Core/ in the past.
Since I recently created
solver.pxd
file to be able to cimport the solver module in other cython code, I wouldn't drop the cython implementation.There is already a mechanism to select the desired solving backend (like solve_banded, scipy sparse, umf_pack ...), so it could be an idea to create a rust implementation with python bindings and add it to the list of available solvers.
Whats your opinion here?
MuellerSeb commentedon Jun 2, 2024
There is no timeline from my side with this package. ATM just doing simple maintenance.
MothNik commentedon Jun 3, 2024
Ah yeah, that really makes sense!
I'll see what I can do this week with respect to improving algorithm II and also the multiple right-hand sides.
MothNik commentedon Jun 5, 2024
I wanted to provide a tiny update on what I did for Algorithm 1 to make it capable of handling multiple right-hand sides.
After I looked at the algorithm for long enough, I figured that the factorization of the left-hand side matrix is independent of the transformation and solve of the right-hand side.
This allows for splitting them up and saving the factorization results to apply it for multiple right-hand sides afterwards.
An implementation can look like:
What did that do to the runtime?
The algorithm is now as fast as the current implementation in
pentapy
.What does it matter? right-hand sides in a (potentially parallelized) loop would take
The straightforward approach of just running a full solve (factorization + transformation + triangular solve) for
On the other hand, the split approach that saves the factorization and then simply applies it for the tranformations and triangular solves would require
I timed the factorization and the transformations and triangular solves individually and found the following:
So, both steps require roughly the same amount of time, i.e., . That means while the straightforward approach will take as approaches . The split approach on the other hand will only take approximately while already outperforming the former for tiniest values of .
In other words, the split approach will to a good approximation only require the time it takes for a standard unit tridiagonal transformation and backward substitution per 1 right-hand side while the factorization becomes negligible.
Extending the upper code to multiple right-hand sides is quite simple by iterating over the columns of
rhs
after the factorization is done. I'll for now probably just do a plain loop and then we can see whether parallelization is an issue or not given that the solves are very very fast and even 1000 right-hand sides with 10000 variables should only take 100 ms when I extrapolate what we see here. Something that fast is hardly worth parallelizing given the parallelization overhead.@MuellerSeb I would highly appreciate your input on the parallelization (and all the other points of course 😉). What are your/the users' typical usecases? For me, the worst case is solving roughly 3000 right-hand sides with 4000 variables each, so this will easily be done in less than a second. Of course, I can still try what
prange
brings in terms of performance.If we go in that direction, would you like to have an argument
n_workers
in the Python interface as it's typical for parallelized SciPy-functions? Then, the user would have full control of how many cores are used.n=3
#24n=3
-edge case, made tests elaborate #26MothNik commentedon Jul 10, 2024
@MuellerSeb
I hope you are doing well!
I wanted to ask how your timeline is on this issue and if you need me to provide further input. I don't want to build up any pressure, please take your time! I just need to make a design decision that could involve a multithreaded
pentapy
or another solution and to make the decision an estimate would be very helpful. Thanks!MuellerSeb commentedon Jul 23, 2024
Hey there, sorry for the long delay. I am currently preparing my parental leave and there are a lot of issues on my table to solve before leaving.
Since this package only has 14 stars ATM, there are not that much users and you seem to be the first power user besides me. I mainly wrote this package for Anaflow where I solve a PDE in laplace space and since the matrix is changing for every time-step, I only solve for one RHS at a time.
I would really love to see the usage of prange since I made good experience with it. It mostly only means to exchange
range
withprange
for the outer most or inner most loop. The nice thing about prange is, that it acts like a normal range when we are not linking against openmp.OpenMP support should be disabled by default, since cibuildwheel would include the openmp runtime in the wheel which is bad practice since it could mess with other packages using openmp (on conda this is fine).
To do so, I would us an env-var that controls this. Have a look at the GSTools setup, where we implemented this and there we also used the https://github.com/astropy/extension-helpers/ package from astropy to find the correct compile flags to link against openmp.
prange comes with an argument
num_threads
which isNone
by default to use all available cores. This argument could be forwarded from thesolve
routine with None as default. The high-level name is debatable.Thanks for all the great work. The performance improvements look really good.
Cheers,
Sebastian
MothNik commentedon Jul 23, 2024
Thanks for checking back!
I already wish all the best for you and your family then 🎉🥳
Thanks for pointing out the OpenMP issues. I wasn't aware of that since I never packaged a Python package 🙏
I'll come up with some updates then and already incorporate the feedback on the PR.
MothNik commentedon Jul 23, 2024
I think you really saved my a** with the OpenMP issue that would have kicked me quite hard for another package I'm writing.
I incorporated all the changes where you provided input so far 👍
Have a great time! 🙃