Skip to content

Device api v2 perftest#11125

Open
amastbaum wants to merge 3 commits intoopenucx:masterfrom
amastbaum:device_api_v2_perftest
Open

Device api v2 perftest#11125
amastbaum wants to merge 3 commits intoopenucx:masterfrom
amastbaum:device_api_v2_perftest

Conversation

@amastbaum
Copy link
Contributor

What?

Added perftest for device api v2

@amastbaum amastbaum requested a review from ofirfarjun7 January 20, 2026 09:19
_func(UCX_PERF_CMD_PUT_SINGLE, __VA_ARGS__); \
break; \
case UCX_PERF_CMD_PUT_SINGLE_V2: \
_func(UCX_PERF_CMD_PUT_SINGLE_V2, __VA_ARGS__); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a general suggestion here:
Could we utilize the fall through feature of the switch statement to avoid repetition?
Something like:

#define UCX_PERF_SWITCH_CMD(_cmd, _func, ...) \
    switch (_cmd) { \
    case UCX_PERF_CMD_PUT_SINGLE: \
    case UCX_PERF_CMD_PUT_SINGLE_V2: \
    case UCX_PERF_CMD_PUT_MULTI: \
    case UCX_PERF_CMD_PUT_PARTIAL: \
        _func(_cmd, __VA_ARGS__); \
        break; \
    default: \
        ucs_error("Unsupported cmd: %d", _cmd); \
        break; \
    }

The same could be applied for #define UCX_PERF_SWITCH_LEVEL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that _cmd's value eventually gets used as a template parameter in the kernel launch:
_kernel<_level, _cmd><<<_blocks, _threads, _shared_size>>>(__VA_ARGS__);
and nvcc requires template parameters to be compile-time constants, but when using fall-through, _cmd remains a runtime variable.

@amastbaum amastbaum force-pushed the device_api_v2_perftest branch from 8271a71 to 9f60574 Compare January 25, 2026 11:11
@amastbaum amastbaum force-pushed the device_api_v2_perftest branch from 6b2e563 to f4e9803 Compare January 27, 2026 14:14
elems[i].length = perf.params.msg_size_list[i];
offset += elems[i].length;

/* local elements - API v2 */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe init mem list elems on demand according to test and not both always.

size_t count = data_count + (has_counter(perf) ? 1 : 0);
size_t offset = 0;
ucp_device_mem_list_elem_t elems[count];
ucp_device_mem_list_elem_t local_elems[count];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With latest API we can use existing ucp_device_mem_list_elem_t elems[count] to create all handles types

throw std::runtime_error("Failed to create memory list");
}

ucp_device_mem_list_params_t local_params;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe create mem list handle on demand according to test and not both always.

UCX_PERF_CMD_AM,
UCX_PERF_CMD_PUT,
UCX_PERF_CMD_PUT_SINGLE,
UCX_PERF_CMD_PUT_SINGLE_V2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe use existing UCX_PERF_CMD_PUT instead of UCX_PERF_CMD_PUT_SINGLE_V2?
UCX_PERF_CMD_PUT is used for host put tests, but maybe we can use it also for device put test if we can differ between them by the -a option

remote_params.elements = remote_elems;

deadline = ucs_get_time() + ucs_time_from_sec(60.0);
do {
Copy link
Contributor

@ofirfarjun7 ofirfarjun7 Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use helper func to improve code reuse and better separation ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants