-
Notifications
You must be signed in to change notification settings - Fork 8
Implement omp backend for basic math operators #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement omp backend for basic math operators #103
Conversation
37395ab
to
c007d31
Compare
I've added tests (OMP only, however it should be relatively easy to adapt to CUDA) for the scalar produce and sum into x implementations (vecadd appears to be tested indirectly through the timestepping tests). |
src/omp/backend.f90
Outdated
error stop "Called scalar product with incompatible fields" | ||
end if | ||
|
||
dims = self%mesh%get_field_dims(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might give us a bit of headache later on.
scalar_product
is the only function in the backends as of now that we have to be careful with padding. With all the rest we can simply carry out whatever operation we're doing in the padded regions and it'll be alright, because the results go into the padded bits anyways, but with scalar product we're summing the meaningful entries in the domain into a single scalar, and need to exclude padded ones obviously.
get_field_dims
in the mesh class is a bit misleading I believe. It returns the actual dimensions for a DIR_C
array, but for DIR_X/Y/Z
arrays its a bit different. It returns the actual number of entries along its own DIR
, but the first and last dims are always SZ and the number of groups.
The problem is that we potentially have paddings in all dimensions, therefore some groups in the domain will have some of their lines as padded entries. We have to find a way to exclude them in the sum.
My plan to fix this on the CUDA backend is simply having an if clause in the kernel (its like inside the do i = 1, SZ
loop here) to skip the padded ones. Its easy to figure out the padding from the i, j, k
indices anyways. But this won't be a good solution on the OpenMP backend for performance reasons.
What we can do instead is we can have two set of i, j, k
loops, where the first one only deals with the groups with no padding in the whole domain in a vectorised way (with no cleanup bit inside), and then another i, j, k
loop region, where we only focus on the groups with padding. Because there are relatively few such groups, not having vectorisation won't be a big problem. (I kept saying i, j, k
loops, but in practice we can instead need to have i, j, m, n
loop, where m, n
determines the value of k
.)
TL;DR the present scalar_product
implementation can give us wrong results if the domain size requries padding. The current CUDA implementation would also give wrong results if domain size requires padding, and requires a fix. If you're happy to stick to nice domain sizes for the moment, it will work just fine. Just keep in mind that if you have padding the results might be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_field_dims in the mesh class is a bit misleading I believe. It returns the actual dimensions for a DIR_C array, but for DIR_X/Y/Z arrays its a bit different. It returns the actual number of entries along its own DIR, but the first and last dims are always SZ and the number of groups.
I'm not sure I understand the use case for get_field_dims
then, based on
pure function get_padded_dims_dir(self, dir) result(dims_padded)
!! Getter for padded dimensions with structure in `dir` direction
I would have thought the former gets the non-padded dimensions, and the latter the padded dimensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a use case for get_field_dims
to be honest. If one needs non-padded dimensions along the present DIR
and also the number of groups its better to get these with two separate calls instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is we can't really have non-padded dimensions in DIR_X/Y/Z
arrangements. Certain combinations of the first index and last index result in padding in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it not be better to have it return the non-padded dimensions? Are the padded dimensions not just size(phi%data)
? How do we get the non-padded dimensions then? I'm sorry this is unclear to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very tricky and not documented at all, my bad.
2 4 6 8 10
o o o o o
x x x x x
x x x x x
___x x x x x___
x x x x x
x x x x x
x x x x x
x x x x x
1 3 5 7 9
group numbers are either below or above. Each group have 4 lines in total. x
indicates a real entry, o
indicates a padding. _
is just a visual guide for separating two groups. Domain size is (nx=n, ny=7, nz=5) Lets say x-direction is towards inside the screen, y-dir is up, and z-dir is to the right. The dims of our DIR_X
array shown here is (SZ=4, nx, 10)
. The groups numbered 2, 4, 6, 8, 10 have some padding in them, but 1, 3, 5, 7, and 9 don't. So we can't really have a non-padded k
index. What we can say is that only group numbers divisible by 2 include some amount of padding. To make this easy we can divide a k
loop into two loops, so that one loop goes from 1 to 2(because we have two groups in y-direction, ny=7, SZ=4). and then the other loop goes from 1 to 5 (because nz=5).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
padded dimension of an array is equal to size(phi%data)
.
By the way, the easiest solution here would be calling a reorder function and reorder whatever DIR
we have happen to have in scalar_product
into DIR_C
. Then the 3 dimensions and their paddings are very easy to deal with. non-padded dimensions can be obtained by get_dims(phi%data_loc)
, and then loop bounds will make sure we stay within the non-padded entries. The only issue with this is it does a reorder unnecessarily. In the solver we use scalar_product
only once in a while, for the enstrophy, so no problem if it doens't perform well. But I think the iterative solver uses it a bit more often, but not sure how much weight it has overall.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'll go with the reordering approach. The iterative solver shouldn't have to do this, we only need to provide a routine for lapl(p)
, any scalar products in the solver are handled by PETSc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we not have a function that returns the 'non padded' SZ
value given the group index? It won't be perfect for performance because vectorisation might be harder, but that should be better than reordering and be more 'clear'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, scalar product is now implemented using reordering to DIR_C :)
Except when using a Cartesian field we can exploit vectorisation on the inner loop Note the scalar product is collective on MPI_COMM_WORLD
This generalises the implementation of sum_Iintox
Implementing the inner loop as a vector + remainder loop operation allows a single implementation of both vs previous (non-)vectorised approach and exploits vectorisation whenever possible
adeccda
to
bd07574
Compare
I think this branch is ready to go |
Just a handful of very minor comments, otherwise lgtm |
I think all resolved - I've left the larger conversation about using reordering unresolved in case there's anything there I've missed. |
No description provided.