Skip to content

threads vs elements when using the OpenMP 4.0 backend #156

@fwyzard

Description

@fwyzard

It looks like cupla does not swap the number of threads and elements when using the OpenMP 4.0 backend.

Using alpaka directly, with the swap explicitly in place:

Running with the blocking serial CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 532.66 us

Running with the non-blocking TBB CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 283.06 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 211.79 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: (71), threads per block: (1), elements per thread: (512)
Output: 1699 modules in 632.7 us

Using cupla:

Running with the blocking serial CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 471.79 us

Running with the non-blocking TBB CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 240.64 us

Running with the non-blocking OpenMP 2.0 blocks CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 186.92 us

Running with the non-blocking OpenMP 4.0 CPU backend...
blocks per grid: 71, threads per block: 512
Output: 1699 modules in 128157 us

The much larger time observed with the OpenMP 4.0 backend is consistent with what I was seeing with alpaka before introducing the swap between threads and elements.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions