Implement omp backend for basic math operators #103

pbartholomew08 · 2024-06-24T21:40:34Z

No description provided.

src/omp/backend.f90

pbartholomew08 · 2024-06-26T20:11:37Z

I've added tests (OMP only, however it should be relatively easy to adapt to CUDA) for the scalar produce and sum into x implementations (vecadd appears to be tested indirectly through the timestepping tests).

semi-h · 2024-06-27T09:58:22Z

src/omp/backend.f90

+      error stop "Called scalar product with incompatible fields"
+    end if
+
+    dims = self%mesh%get_field_dims(x)


This might give us a bit of headache later on.

scalar_product is the only function in the backends as of now that we have to be careful with padding. With all the rest we can simply carry out whatever operation we're doing in the padded regions and it'll be alright, because the results go into the padded bits anyways, but with scalar product we're summing the meaningful entries in the domain into a single scalar, and need to exclude padded ones obviously.

get_field_dims in the mesh class is a bit misleading I believe. It returns the actual dimensions for a DIR_C array, but for DIR_X/Y/Z arrays its a bit different. It returns the actual number of entries along its own DIR, but the first and last dims are always SZ and the number of groups.

The problem is that we potentially have paddings in all dimensions, therefore some groups in the domain will have some of their lines as padded entries. We have to find a way to exclude them in the sum.

My plan to fix this on the CUDA backend is simply having an if clause in the kernel (its like inside the do i = 1, SZ loop here) to skip the padded ones. Its easy to figure out the padding from the i, j, k indices anyways. But this won't be a good solution on the OpenMP backend for performance reasons.

What we can do instead is we can have two set of i, j, k loops, where the first one only deals with the groups with no padding in the whole domain in a vectorised way (with no cleanup bit inside), and then another i, j, k loop region, where we only focus on the groups with padding. Because there are relatively few such groups, not having vectorisation won't be a big problem. (I kept saying i, j, k loops, but in practice we can instead need to have i, j, m, n loop, where m, n determines the value of k.)

TL;DR the present scalar_product implementation can give us wrong results if the domain size requries padding. The current CUDA implementation would also give wrong results if domain size requires padding, and requires a fix. If you're happy to stick to nice domain sizes for the moment, it will work just fine. Just keep in mind that if you have padding the results might be wrong.

get_field_dims in the mesh class is a bit misleading I believe. It returns the actual dimensions for a DIR_C array, but for DIR_X/Y/Z arrays its a bit different. It returns the actual number of entries along its own DIR, but the first and last dims are always SZ and the number of groups.

I'm not sure I understand the use case for get_field_dims then, based on

pure function get_padded_dims_dir(self, dir) result(dims_padded) !! Getter for padded dimensions with structure in `dir` direction

I would have thought the former gets the non-padded dimensions, and the latter the padded dimensions?

I can't think of a use case for get_field_dims to be honest. If one needs non-padded dimensions along the present DIR and also the number of groups its better to get these with two separate calls instead.

The problem is we can't really have non-padded dimensions in DIR_X/Y/Z arrangements. Certain combinations of the first index and last index result in padding in that case.

Would it not be better to have it return the non-padded dimensions? Are the padded dimensions not just size(phi%data)? How do we get the non-padded dimensions then? I'm sorry this is unclear to me

This is very tricky and not documented at all, my bad.

2 4 6 8 10 o o o o o x x x x x x x x x x ___x x x x x___ x x x x x x x x x x x x x x x x x x x x 1 3 5 7 9

group numbers are either below or above. Each group have 4 lines in total. x indicates a real entry, o indicates a padding. _is just a visual guide for separating two groups. Domain size is (nx=n, ny=7, nz=5) Lets say x-direction is towards inside the screen, y-dir is up, and z-dir is to the right. The dims of our DIR_X array shown here is (SZ=4, nx, 10). The groups numbered 2, 4, 6, 8, 10 have some padding in them, but 1, 3, 5, 7, and 9 don't. So we can't really have a non-padded k index. What we can say is that only group numbers divisible by 2 include some amount of padding. To make this easy we can divide a k loop into two loops, so that one loop goes from 1 to 2(because we have two groups in y-direction, ny=7, SZ=4). and then the other loop goes from 1 to 5 (because nz=5).

padded dimension of an array is equal to size(phi%data).

By the way, the easiest solution here would be calling a reorder function and reorder whatever DIR we have happen to have in scalar_product into DIR_C. Then the 3 dimensions and their paddings are very easy to deal with. non-padded dimensions can be obtained by get_dims(phi%data_loc), and then loop bounds will make sure we stay within the non-padded entries. The only issue with this is it does a reorder unnecessarily. In the solver we use scalar_product only once in a while, for the enstrophy, so no problem if it doens't perform well. But I think the iterative solver uses it a bit more often, but not sure how much weight it has overall.

OK, I'll go with the reordering approach. The iterative solver shouldn't have to do this, we only need to provide a routine for lapl(p), any scalar products in the solver are handled by PETSc

Could we not have a function that returns the 'non padded' SZ value given the group index? It won't be perfect for performance because vectorisation might be harder, but that should be better than reordering and be more 'clear'

OK, scalar product is now implemented using reordering to DIR_C :)

src/omp/backend.f90

Except when using a Cartesian field we can exploit vectorisation on the inner loop Note the scalar product is collective on MPI_COMM_WORLD

This generalises the implementation of sum_Iintox

Implementing the inner loop as a vector + remainder loop operation allows a single implementation of both vs previous (non-)vectorised approach and exploits vectorisation whenever possible

pbartholomew08 · 2024-06-28T15:01:37Z

I think this branch is ready to go

src/omp/backend.f90

tests/test_scalar_product.f90

tests/test_sum_zintox.f90

Nanoseb · 2024-06-28T15:16:26Z

Just a handful of very minor comments, otherwise lgtm

src/omp/backend.f90

pbartholomew08 · 2024-06-28T15:53:42Z

I think all resolved - I've left the larger conversation about using reordering unresolved in case there's anything there I've missed.

pbartholomew08 added the omp Related to openMP backend label Jun 24, 2024

pbartholomew08 self-assigned this Jun 24, 2024

pbartholomew08 commented Jun 25, 2024

View reviewed changes

src/omp/backend.f90 Outdated Show resolved Hide resolved

semi-h reviewed Jun 25, 2024

View reviewed changes

src/omp/backend.f90 Outdated Show resolved Hide resolved

pbartholomew08 marked this pull request as ready for review June 26, 2024 09:44

pbartholomew08 marked this pull request as draft June 26, 2024 09:44

pbartholomew08 force-pushed the 93-implement-omp-backend-for-basic-math-operators branch 3 times, most recently from 37395ab to c007d31 Compare June 26, 2024 20:05

pbartholomew08 marked this pull request as ready for review June 26, 2024 20:10

pbartholomew08 requested review from Nanoseb and semi-h June 27, 2024 08:35

semi-h reviewed Jun 27, 2024

View reviewed changes

pbartholomew08 and others added 11 commits June 28, 2024 10:41

Implement OMP scalar product

a85662c

Except when using a Cartesian field we can exploit vectorisation on the inner loop Note the scalar product is collective on MPI_COMM_WORLD

Implement vecAdd

4196faf

Add error messages to mesh dims getters

7781834

Implement OMP sum_yintox

7f31366

Implement OMP sum_zintox

1484506

This generalises the implementation of sum_Iintox

Implement vecadd and scalar product in vectorised form

4e6b21f

Implementing the inner loop as a vector + remainder loop operation allows a single implementation of both vs previous (non-)vectorised approach and exploits vectorisation whenever possible

Make sure inner loop indices are private

026145d

Run fprettify

31d11a8

Don't check dataloc ins vecadd as it might not always be set

2396068

Ensure field location is set when computing scalar product

c9f72cc

Implement scalar product using reordering

bd07574

pbartholomew08 force-pushed the 93-implement-omp-backend-for-basic-math-operators branch from adeccda to bd07574 Compare June 28, 2024 15:00

pbartholomew08 requested a review from semi-h June 28, 2024 15:01

Nanoseb reviewed Jun 28, 2024

View reviewed changes

src/omp/backend.f90 Outdated Show resolved Hide resolved

Nanoseb reviewed Jun 28, 2024

View reviewed changes

src/omp/backend.f90 Outdated Show resolved Hide resolved

Nanoseb reviewed Jun 28, 2024

View reviewed changes

src/omp/backend.f90 Outdated Show resolved Hide resolved

Nanoseb reviewed Jun 28, 2024

View reviewed changes

tests/test_scalar_product.f90 Outdated Show resolved Hide resolved

Nanoseb reviewed Jun 28, 2024

View reviewed changes

tests/test_sum_zintox.f90 Outdated Show resolved Hide resolved

semi-h reviewed Jun 28, 2024

View reviewed changes

src/omp/backend.f90 Outdated Show resolved Hide resolved

pbartholomew08 added 4 commits June 28, 2024 16:30

Simplify reordering step in scalar_product

b9162f1

Correct case/typo

0425411

Simplify scalar product test - set data_loc directly on get block

b2a091f

Combine sum_<y|z>intox into a single test

fad8844

semi-h approved these changes Jun 28, 2024

View reviewed changes

pbartholomew08 merged commit 3a75900 into xcompact3d:main Jun 28, 2024
2 checks passed

pbartholomew08 mentioned this pull request Jun 28, 2024

Implement OMP backend for basic math operators #93

Closed

Implement omp backend for basic math operators #103

Implement omp backend for basic math operators #103

Uh oh!

Conversation

pbartholomew08 commented Jun 24, 2024

Uh oh!

Uh oh!

Uh oh!

pbartholomew08 commented Jun 26, 2024

Uh oh!

semi-h Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pbartholomew08 commented Jun 28, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Nanoseb commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pbartholomew08 commented Jun 28, 2024

Uh oh!

Uh oh!

Uh oh!

semi-h Jun 27, 2024 •

edited

Loading

Nanoseb commented Jun 28, 2024 •

edited

Loading