Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LUT based implicit triangulation #963

Draft
wants to merge 6 commits into
base: dev
Choose a base branch
from

Conversation

JonasLukasczyk
Copy link
Contributor

Hi Julien,
this PR is NOT supposed to be merged in! Here we can discuss the performance gains of a new lookup-table-based implicit triangulation (called NewImplicitTraingulation in the source code). Note, this PR comes with the test.pvsm statefile in the root directory, which just creates a 1024^3 wavelet and then runs the ttkHelloWorld filter. Here are some timings on my machine:

NewImplicitTriangulation: ~1sec
OldImplicitTriangulation: ~9sec
ModifiedOldImplicitTriangulation: ~4sec

The timings already tell the story. The key point here is that replacing the functions in the old implicit triangulation with the LUT-based functions is not enough to get the performance of the "pure" NewImplicitTriangulation. I did some experiments by making the NewImplicitTriangulation also inherit from a base class, but that had no impact on the performance (as long as overwritten functions are declared final). I guess the CRTP architecture used in the OldImplicitTriangulation is to blame, but I can not tell for sure.

If I could dream, we should move to a completely separate inheritance structure, lets call it CellComplex, and incrementally move triangulations over to there until we can do a complete replace. Since the base functions are already templatized and therefore only check if the object that was passed has the necessary functions, this effort can be a completely separate affair.

I also added @petersteneteg to the discussion, he might have some insight on this this.

@petersteneteg
Copy link
Contributor

Question, in the new implementation you have all the implementation in the header, so it can be inlined. Where as the ModifiedOldImplicitTriangulation/OldImplicitTriangulation it seems most of the implementations are in the cpp file which means that unless you are compiling with link time optimization it can't be inlined, Hence the comparison will be quite misleading. Could you make all the function inline in the headers and then redo the tests?

@JonasLukasczyk
Copy link
Contributor Author

Hi Peter,
good call, but because of the CRTP I gave up moving it to the header. Instead I forced the NewTriangulation not to use inlining and now I end up with the same timings of the ModifiedOldTriangulation. So the inlining is the last ingredient to make this fast.

The question is what we do from here. Modifying the original code is cumbersome. I really suggest to build that separate inheritance structure. Your thoughts on this @julien-tierny?

@julien-tierny
Copy link
Collaborator

Thanks a lot to the both of you.
That's very interesting that the inlining is responsible for so much performance difference. As far as I remember the experiments back then, the CRTP pattern did not degrade performance significantly.

@pierre-guillou
what do you think about this?
it looks like non-negligible performance gains could be obtained by inlining your CRTP pattern.
do you think this is something that'd be easy to do?

@pierre-guillou
Copy link
Contributor

Hello everyone,
I made some quick tests by inlining the get methods of the ImplicitTriangulation. It is quite easy to do, you just have to copy-paste the methods from the .cpp to the .h (and fix all the missing namespace prefixes and add some inline keywords).

I saw a maybe 5% performance improvement with the DiscreteGradient algorithm (which relies on a lot of triangulation calls). It seems that, with the current implementation, inlining does not deliver much (the compiler may choose not to inline the methods if it find them too big).

I played a bit with Jonas' new implementation of ImplicitTriangulationCRTP::getVertexNeighbors and ImplicitTriangulationCRTP::getVertexNeighborsNumber, with and without inlining. This time, inlining plays a much bigger role (up to x4). My results are coherent with Jonas' timings between ModifiedOldImplicitTriangulation and NewImplicitTriangulation (just move the two modified ImplicitTriangulationCRTP methods in the header file). I think this is due to Jonas' methods being much smaller, which makes them more easily inlinable by the compiler.

Note that inlining all these methods may increase the compilation times (and the required memory). We might want to ensure that TTK is still buildable on a small laptop without a lot of RAM.

Great work, Jonas!
Pierre

@julien-tierny
Copy link
Collaborator

alright, thanks a lot for your input @pierre-guillou.
it seems that we can start to have a look at inlining all of the calls of the ImplicitTriangulation.
then jonas' LUT improvement could be integrated without performance loss.

@julien-tierny
Copy link
Collaborator

also, @petersteneteg you mentioned compiler options for "link time optimization".
could that be another possibility?
is that efficient? (or will pure inlining always be significantly more efficient in terms of runtime?)

@petersteneteg
Copy link
Contributor

Link Time Optimization or LTO can basically look at both cpp files and headers and select what to inline, I can also potentially be quite effective at de-virtualizing i.e. getting rid of the extra indirection that has to happen for virtual calls. So in principle it is great. In practice it is quite hard to do. Since the optimizer has to handle the whole program at once not just one cpp file. The complexity becomes huge, and with that compilation time and memory usage. There are approaches to make it faster and consume less memory, but that also looses some of the potential optimizations... Anyway I would not want to depend on LTO for performance. But one could investigate using it for release builds where one can use a big server to build and wait a while maybe.

@julien-tierny julien-tierny marked this pull request as draft August 29, 2023 15:19
@petersteneteg
Copy link
Contributor

petersteneteg commented Aug 29, 2023

@JonasLukasczyk not really related to any perf, but some suggestions for potential refactorings https://godbolt.org/z/81cYbKW5M

This is quite similar to the version I talked to you about last week, except that I also calculate all those offsets at compile time
https://godbolt.org/z/Mfzqdoq4f
It is quite hard to follow, but that one also generates lookup tables for periodic systems and with diagonals either all in the same direction or alternating :)

Alternating diagonals and periodic in 2 directions
image

It also employs strong_types, for all the indices to make it very hard to use an index in the wrong place. i.e. a point index for an edge or some such.

@JonasLukasczyk
Copy link
Contributor Author

Thanks Peter, that looks very interesting! Here are some comments:

  1. The refactoring is nice, so we switch to that.
  2. I try to use constexpr wherever possible, but for the implicit grid in PV we do not know the dimensions of the domain at compile time.
  3. I was not really able to follow your second code base, but when I understood you correctly that is basically also a Freundenthal triangulation with additional subroutines for vertex links and stars, and on top of that even with periodicity and choices on how the grid is tetrahedralized.
  4. I think at least for regular grids we will now move towards a drop-in replacement. My original plan was to use the LUT based triangulation, but if your code performs as well than we could use it instead. Before we continue with this I want to clarify the requirements for the replacement:
  • The methods GetVertexNeighborNumber and GetVertexNeighbors must have similar performance than the LUT based approach.
  • In the end the drop in replacement needs to implement all public methods of the ttk::AbstractTriangulation class.
  • The new super class CellComplex should be abstract enough to also support cubical complexes. Therefore some names will have to change, such as GetTriangleStar.

What are your thoughts everyone?

@julien-tierny
Copy link
Collaborator

hi guys,

thanks a lot for your input to the discussion!

my thoughts are that there are two distinct topics here.

  1. inlining the Implicit Triangulation to enable Jonas' LUT full performance.
    for this topic, I guess the first action would be to move all the calls to the header as discussed by @petersteneteg and @pierre-guillou (that does not seem too difficult).
    ideally, the LUT approach could be extended to all types of query of the implicit triangulation, not only the getVertexNeighbor one (this task requires significant human resources)

  2. extending the current API towards a more generic cell complex.
    for me, this is an orthogonal discussion. there, I'd say the first action would be to stick to the current class hierarchy, and to change all function names including the words triangle and tetrahedron (into face and cell). then the cubical cell complex would need to be implemented based on this API (this task requires significant human resources) and then fix all TTK modules were implicit assumptions were made regarding the number of vertices per cell (this task also requires significant human resources).

while inlining or renaming functions is easy, the other tasks (tagged with "significant human resources") could easily take somewhere between a few months and a year to complete for an engineer (in my view).

best,

@JonasLukasczyk
Copy link
Contributor Author

I agree with what you are saying Julien. The good news here is that due to the base code templatization the cell complex can be worked on separately from the existing triangulation. But as a first result lets integrate the LUT-based approach in the header of the ImplicitTriangulation. I just tried to do that but then I run into the following compile error:

Triangulation.cpp:86:34: error: use of deleted function ‘ttk::ImplicitNoPreconditions& ttk::ImplicitNoPreconditions::operator=(const ttk::ImplicitNoPreconditions&)’
   86 |     implicitTriangulation_ = rhs.implicitTriangulation_;
      |                                  ^~~~~~~~~~~~~~~~~~~~~~

@pierre-guillou I added you as a maintainer to my fork. Could you please push the inlined version here?

@pierre-guillou
Copy link
Contributor

pierre-guillou commented Aug 30, 2023

@JonasLukasczyk your error is linked to the const std::array member variables. Removing the const keyword in front of these variables should do the trick. (edit: Using static const also works).
I've pushed my (quick&dirty) inlined version on your fork, branch implicit_inlining. Note that only the two modified methods getVertexNeighbor and getVertexNeighborNumber have been inlined.

@JonasLukasczyk
Copy link
Contributor Author

Thank you! I will run some tests.

@julien-tierny julien-tierny added the WIP work in progress label Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
WIP work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants