-
Notifications
You must be signed in to change notification settings - Fork 570
Description
Problem Description
For a ~4KiB buffer, we’re seeing up to 50 microseconds spent inside hipMemcpy for host to device transfers. This seems extremely long given that PCIe bandwidth/latency characteristics should make this transfer in the single digit microsecond range. As an additional data point, rocm-bandwidth-test reports 6 microseconds latency for a similar sized copy, so it seems that most of the time is spent inside the HIP/CLR software stack.
Following the HIP developer guide, we’re using pinned memory from hipHostMalloc, so any additional copies on the host side should not be happening. We’ve also ensured that the CPU core(s) being used are those that are closest in NUMA distance to the selected GPU.


Operating System
Rocky Linux 9.5
CPU
AMD EPYC 9534
GPU
AMD Instinct MI300X
ROCm Version
ROCm 6.3.3
ROCm Component
HIP
Steps to Reproduce
The following toy example averages 36-40 microseconds for the copy. While not as long as what we're seeing in our full application, this is still a significant overhead.
For reference, we've also attached profiler output from our full application, which we are unable to share.
taskset --cpu-list 0-15 rocprofv3 --runtime-trace --output-format pftrace -- {compiled binary name}
#include <hip/hip_runtime_api.h>
#include <rocblas/rocblas.h>
#include <rocsparse/rocsparse.h>
#include <iostream>
#include <cassert>
#include <random>
#define CHECK_HIP_ERROR(error) \
if (error != hipSuccess) \
{ \
std::cerr << "hip error: " << hipGetErrorString(error) << "(" << error << ")" << " at " \
<< __FILE__ << ":" << __LINE__ << std::endl; \
exit(error); \
}
int main()
{
static constexpr size_t BUFFER_COUNT = 1024;
static constexpr size_t BUFFER_SIZE = sizeof(float) * BUFFER_COUNT;
float* host_buffer;
float* device_buffer;
CHECK_HIP_ERROR(hipHostMalloc(&host_buffer, BUFFER_SIZE));
CHECK_HIP_ERROR(hipMalloc(&device_buffer, BUFFER_SIZE));
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_real_distribution<> dis(0.f, 1.f);
std::vector<float> gold;
for (size_t i = 0; i < BUFFER_COUNT; i++)
{
gold.push_back(dis(mt));
host_buffer[i] = gold.back();
}
for (size_t i = 0; i < 100; i++)
{
CHECK_HIP_ERROR(hipMemcpy(device_buffer, host_buffer, BUFFER_SIZE, hipMemcpyHostToDevice));
CHECK_HIP_ERROR(hipMemcpy(host_buffer, device_buffer, BUFFER_SIZE, hipMemcpyDeviceToHost));
}
for (size_t i = 0; i < BUFFER_COUNT; i++)
{
assert(gold[i] == host_buffer[i]);
}
}
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Additional Information
No response