Skip to content

[Issue]: High Memory Transfer Latency with hipMemcpy #3809

@5591996036

Description

@5591996036

Problem Description

For a ~4KiB buffer, we’re seeing up to 50 microseconds spent inside hipMemcpy for host to device transfers. This seems extremely long given that PCIe bandwidth/latency characteristics should make this transfer in the single digit microsecond range. As an additional data point, rocm-bandwidth-test reports 6 microseconds latency for a similar sized copy, so it seems that most of the time is spent inside the HIP/CLR software stack.

Following the HIP developer guide, we’re using pinned memory from hipHostMalloc, so any additional copies on the host side should not be happening. We’ve also ensured that the CPU core(s) being used are those that are closest in NUMA distance to the selected GPU.

Image Image

Operating System

Rocky Linux 9.5

CPU

AMD EPYC 9534

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.3.3

ROCm Component

HIP

Steps to Reproduce

The following toy example averages 36-40 microseconds for the copy. While not as long as what we're seeing in our full application, this is still a significant overhead.
For reference, we've also attached profiler output from our full application, which we are unable to share.

taskset --cpu-list 0-15 rocprofv3 --runtime-trace --output-format pftrace -- {compiled binary name}
#include <hip/hip_runtime_api.h>
#include <rocblas/rocblas.h>
#include <rocsparse/rocsparse.h>
#include <iostream>
#include <cassert>
#include <random>

#define CHECK_HIP_ERROR(error)                                                                          \
    if (error != hipSuccess)                                                                            \
    {                                                                                                   \
        std::cerr << "hip error: " << hipGetErrorString(error) << "(" << error << ")" << " at "         \
                  << __FILE__ << ":" << __LINE__ << std::endl;                                          \
        exit(error);                                                                                    \
    }

int main()
{
    static constexpr size_t BUFFER_COUNT = 1024;
    static constexpr size_t BUFFER_SIZE = sizeof(float) * BUFFER_COUNT;

    float* host_buffer;
    float* device_buffer;
    CHECK_HIP_ERROR(hipHostMalloc(&host_buffer, BUFFER_SIZE));
    CHECK_HIP_ERROR(hipMalloc(&device_buffer, BUFFER_SIZE));

    std::random_device rd;
    std::mt19937 mt(rd());
    std::uniform_real_distribution<> dis(0.f, 1.f);
    std::vector<float> gold;
    for (size_t i = 0; i < BUFFER_COUNT; i++)
    {
        gold.push_back(dis(mt));
        host_buffer[i] = gold.back();
    }

    for (size_t i = 0; i < 100; i++)
    {
        CHECK_HIP_ERROR(hipMemcpy(device_buffer, host_buffer, BUFFER_SIZE, hipMemcpyHostToDevice));
        CHECK_HIP_ERROR(hipMemcpy(host_buffer, device_buffer, BUFFER_SIZE, hipMemcpyDeviceToHost));
    }

    for (size_t i = 0; i < BUFFER_COUNT; i++)
    {
        assert(gold[i] == host_buffer[i]);
    }
}

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

rocminfo 1.txt

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions