Skip to content

Implement omp backend for basic math operators #103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

pbartholomew08
Copy link
Member

No description provided.

@pbartholomew08 pbartholomew08 added the omp Related to openMP backend label Jun 24, 2024
@pbartholomew08 pbartholomew08 self-assigned this Jun 24, 2024
@pbartholomew08 pbartholomew08 marked this pull request as ready for review June 26, 2024 09:44
@pbartholomew08 pbartholomew08 marked this pull request as draft June 26, 2024 09:44
@pbartholomew08 pbartholomew08 force-pushed the 93-implement-omp-backend-for-basic-math-operators branch 3 times, most recently from 37395ab to c007d31 Compare June 26, 2024 20:05
@pbartholomew08 pbartholomew08 marked this pull request as ready for review June 26, 2024 20:10
@pbartholomew08
Copy link
Member Author

I've added tests (OMP only, however it should be relatively easy to adapt to CUDA) for the scalar produce and sum into x implementations (vecadd appears to be tested indirectly through the timestepping tests).

@pbartholomew08 pbartholomew08 requested review from Nanoseb and semi-h June 27, 2024 08:35
error stop "Called scalar product with incompatible fields"
end if

dims = self%mesh%get_field_dims(x)
Copy link
Member

@semi-h semi-h Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might give us a bit of headache later on.

scalar_product is the only function in the backends as of now that we have to be careful with padding. With all the rest we can simply carry out whatever operation we're doing in the padded regions and it'll be alright, because the results go into the padded bits anyways, but with scalar product we're summing the meaningful entries in the domain into a single scalar, and need to exclude padded ones obviously.

get_field_dims in the mesh class is a bit misleading I believe. It returns the actual dimensions for a DIR_C array, but for DIR_X/Y/Z arrays its a bit different. It returns the actual number of entries along its own DIR, but the first and last dims are always SZ and the number of groups.

The problem is that we potentially have paddings in all dimensions, therefore some groups in the domain will have some of their lines as padded entries. We have to find a way to exclude them in the sum.

My plan to fix this on the CUDA backend is simply having an if clause in the kernel (its like inside the do i = 1, SZ loop here) to skip the padded ones. Its easy to figure out the padding from the i, j, k indices anyways. But this won't be a good solution on the OpenMP backend for performance reasons.

What we can do instead is we can have two set of i, j, k loops, where the first one only deals with the groups with no padding in the whole domain in a vectorised way (with no cleanup bit inside), and then another i, j, k loop region, where we only focus on the groups with padding. Because there are relatively few such groups, not having vectorisation won't be a big problem. (I kept saying i, j, k loops, but in practice we can instead need to have i, j, m, n loop, where m, n determines the value of k.)

TL;DR the present scalar_product implementation can give us wrong results if the domain size requries padding. The current CUDA implementation would also give wrong results if domain size requires padding, and requires a fix. If you're happy to stick to nice domain sizes for the moment, it will work just fine. Just keep in mind that if you have padding the results might be wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_field_dims in the mesh class is a bit misleading I believe. It returns the actual dimensions for a DIR_C array, but for DIR_X/Y/Z arrays its a bit different. It returns the actual number of entries along its own DIR, but the first and last dims are always SZ and the number of groups.

I'm not sure I understand the use case for get_field_dims then, based on

 pure function get_padded_dims_dir(self, dir) result(dims_padded)
  !! Getter for padded dimensions with structure in `dir` direction

I would have thought the former gets the non-padded dimensions, and the latter the padded dimensions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a use case for get_field_dims to be honest. If one needs non-padded dimensions along the present DIR and also the number of groups its better to get these with two separate calls instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is we can't really have non-padded dimensions in DIR_X/Y/Z arrangements. Certain combinations of the first index and last index result in padding in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it not be better to have it return the non-padded dimensions? Are the padded dimensions not just size(phi%data)? How do we get the non-padded dimensions then? I'm sorry this is unclear to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very tricky and not documented at all, my bad.

   2 4 6 8 10
   o o o o o
   x x x x x
   x x x x x
___x x x x x___
   x x x x x
   x x x x x
   x x x x x
   x x x x x
   1 3 5 7 9

group numbers are either below or above. Each group have 4 lines in total. x indicates a real entry, o indicates a padding. _is just a visual guide for separating two groups. Domain size is (nx=n, ny=7, nz=5) Lets say x-direction is towards inside the screen, y-dir is up, and z-dir is to the right. The dims of our DIR_X array shown here is (SZ=4, nx, 10). The groups numbered 2, 4, 6, 8, 10 have some padding in them, but 1, 3, 5, 7, and 9 don't. So we can't really have a non-padded k index. What we can say is that only group numbers divisible by 2 include some amount of padding. To make this easy we can divide a k loop into two loops, so that one loop goes from 1 to 2(because we have two groups in y-direction, ny=7, SZ=4). and then the other loop goes from 1 to 5 (because nz=5).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

padded dimension of an array is equal to size(phi%data).

By the way, the easiest solution here would be calling a reorder function and reorder whatever DIR we have happen to have in scalar_product into DIR_C. Then the 3 dimensions and their paddings are very easy to deal with. non-padded dimensions can be obtained by get_dims(phi%data_loc), and then loop bounds will make sure we stay within the non-padded entries. The only issue with this is it does a reorder unnecessarily. In the solver we use scalar_product only once in a while, for the enstrophy, so no problem if it doens't perform well. But I think the iterative solver uses it a bit more often, but not sure how much weight it has overall.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll go with the reordering approach. The iterative solver shouldn't have to do this, we only need to provide a routine for lapl(p), any scalar products in the solver are handled by PETSc

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we not have a function that returns the 'non padded' SZ value given the group index? It won't be perfect for performance because vectorisation might be harder, but that should be better than reordering and be more 'clear'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, scalar product is now implemented using reordering to DIR_C :)

pbartholomew08 and others added 11 commits June 28, 2024 10:41
Except when using a Cartesian field we can exploit vectorisation on
the inner loop

Note the scalar product is collective on MPI_COMM_WORLD
This generalises the implementation of sum_Iintox
Implementing the inner loop as a vector + remainder loop operation
allows a single implementation of both vs previous (non-)vectorised
approach and exploits vectorisation whenever possible
@pbartholomew08 pbartholomew08 force-pushed the 93-implement-omp-backend-for-basic-math-operators branch from adeccda to bd07574 Compare June 28, 2024 15:00
@pbartholomew08 pbartholomew08 requested a review from semi-h June 28, 2024 15:01
@pbartholomew08
Copy link
Member Author

I think this branch is ready to go

@Nanoseb
Copy link
Collaborator

Nanoseb commented Jun 28, 2024

Just a handful of very minor comments, otherwise lgtm

@pbartholomew08
Copy link
Member Author

I think all resolved - I've left the larger conversation about using reordering unresolved in case there's anything there I've missed.

@pbartholomew08 pbartholomew08 merged commit 3a75900 into xcompact3d:main Jun 28, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
omp Related to openMP backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants