-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Description
It looks like cupla does not swap the number of threads and elements when using the OpenMP 4.0 backend.
Using alpaka directly, with the swap explicitly in place:
Running with the blocking serial CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 532.66 us
Running with the non-blocking TBB CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 283.06 us
Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 211.79 us
Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 632.7 us
Using cupla:
Running with the blocking serial CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 471.79 us
Running with the non-blocking TBB CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 240.64 us
Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 186.92 us
Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 128157 us
The much larger time observed with the OpenMP 4.0 backend is consistent with what I was seeing with alpaka before introducing the swap between threads and elements.
Metadata
Metadata
Assignees
Labels
No labels