|
| 1 | +Getting Started with Intel(R) MKL-DNN with GPU support {#getting_started_gpu} |
| 2 | +============================================================================= |
| 3 | + |
| 4 | +This is an introduction to Intel MKL-DNN with GPU support. |
| 5 | +We are going to walk through a simple example to demonstrate OpenCL\* extensions API in Intel MKL-DNN. |
| 6 | + |
| 7 | +## Intel MKL-DNN basic workflow |
| 8 | + |
| 9 | +A very simple workflow in Intel MKL-DNN includes the following steps: |
| 10 | + |
| 11 | +- Engine creation |
| 12 | +- Input/output memory objects creation |
| 13 | + - Memory descriptors creation |
| 14 | + - Memory objects creation |
| 15 | +- Operation primitive creation |
| 16 | + - Operation descriptor creation |
| 17 | + - Operation primitive descriptor creation |
| 18 | + - Primitive creation |
| 19 | +- Stream object creation |
| 20 | +- Primitive submission for execution to a stream |
| 21 | + |
| 22 | +## Create engine and memory object |
| 23 | + |
| 24 | +Let's create a GPU engine object. The second parameter specifies the index of the requested engine. |
| 25 | + |
| 26 | +~~~cpp |
| 27 | +auto eng = engine(engine::kind::gpu, 0); |
| 28 | +~~~ |
| 29 | + |
| 30 | +Then, we create a memory object. We need to specify dimensions of our memory by passing `memory::dims` object. |
| 31 | +Then we create a memory descriptor with these dimensions, with `f32` data type and `nchw` memory format. |
| 32 | +Finally, we construct a memory object and pass the memory descriptor. The library allocates memory internally. |
| 33 | + |
| 34 | +~~~cpp |
| 35 | +auto tz_dims = memory::dims{2, 3, 4, 5}; |
| 36 | +memory::desc mem_d(tz_dims, memory::data_type::f32, memory::format_tag::nchw); |
| 37 | +memory mem(mem_d, eng); |
| 38 | +~~~ |
| 39 | +
|
| 40 | +## Initialize the data executing a custom OpenCL kernel |
| 41 | +
|
| 42 | +We are going to create an OpenCL kernel that will initialize our data. |
| 43 | +It requries writing a bit of C code to create an OpenCL program from a string literal source, build it and extract the kernel. |
| 44 | +The kernel initializes the data by the `0, -1, 2, -3, ...` sequence: `data[i] = (-1)^i * i`. |
| 45 | +
|
| 46 | +~~~cpp |
| 47 | +const char *ocl_code |
| 48 | + = "__kernel void init(__global float *data) {" |
| 49 | + " int id = get_global_id(0);" |
| 50 | + " data[id] = (id % 2) ? -id : id;" |
| 51 | + "}"; |
| 52 | +const char *kernel_name = "init"; |
| 53 | +cl_kernel ocl_init_kernel = create_init_opencl_kernel( |
| 54 | + eng.get_ocl_context(), kernel_name, ocl_code); |
| 55 | +~~~ |
| 56 | + |
| 57 | +Refer to the full code example for the code of `create_init_opencl_kernel()` function. |
| 58 | +The next step is to execute our OpenCL kernel: set its arguments and enqueue to an OpenCL queue. |
| 59 | +The underlying OpenCL buffer can be extracted from the memory object using |
| 60 | +the interoperability interface: `memory::get_ocl_mem_object()`. |
| 61 | +For simplicity we can just construct a stream, extract the underlying OpenCL queue and enqueue the kernel to this queue: |
| 62 | + |
| 63 | +~~~cpp |
| 64 | +cl_mem ocl_buf = mem.get_ocl_mem_object(); |
| 65 | +clSetKernelArg(ocl_init_kernel, 0, sizeof(ocl_buf), &ocl_buf); |
| 66 | + |
| 67 | +mkldnn::stream strm(eng); |
| 68 | +cl_command_queue ocl_queue = strm.get_ocl_command_queue(); |
| 69 | +clEnqueueNDRangeKernel(ocl_queue, ocl_init_kernel, 1, nullptr, &N, nullptr, 0, |
| 70 | + nullptr, nullptr); |
| 71 | +~~~ |
| 72 | +
|
| 73 | +## Create and execute a primitive |
| 74 | +
|
| 75 | +There are 3 steps to create an operation primitive in Intel MKL-DNN: |
| 76 | +
|
| 77 | +- Create an operation descriptor |
| 78 | +- Create a primitive descriptor |
| 79 | +- Create a primitive |
| 80 | +
|
| 81 | +Let's create the primitive to perform ReLU (recitified linear unit) operation: `x = max(0, x)`. |
| 82 | +
|
| 83 | +~~~cpp |
| 84 | +auto relu_d = eltwise_forward::desc(prop_kind::forward, algorithm::eltwise_relu, |
| 85 | + mem_d, 0.0f); |
| 86 | +auto relu_pd = eltwise_forward::primitive_desc(relu_d, eng); |
| 87 | +auto relu = eltwise_forward(relu_pd); |
| 88 | +~~~ |
| 89 | + |
| 90 | +From the code above we see that an operation descriptor has no dependency on a specific engine - it just describes some operation. |
| 91 | +On the contrary, primitive descriptors are attached to a specific engine and represent some implementation for this engine. |
| 92 | +A primitive object is realization of a primitive descriptor and its construction is usually much "heavier". |
| 93 | + |
| 94 | +Note that for our primitive `mem` serves as both input and output parameter. |
| 95 | + |
| 96 | +Next, execute the primitive: |
| 97 | + |
| 98 | +~~~cpp |
| 99 | +relu.execute(strm, { { MKLDNN_ARG_SRC, mem }, { MKLDNN_ARG_DST, mem } }); |
| 100 | +~~~ |
| 101 | +
|
| 102 | +Note, primitive submission on GPU is asynchronous. |
| 103 | +But user can call `stream::wait()` to synchronize the stream and ensure that all previously submitted primitives are completed. |
| 104 | +
|
| 105 | +## Validating the results |
| 106 | +
|
| 107 | +The simplest way to access the OpenCL memory is to map it to the host using `memory::map_data()` and `memory::unmap_data()` APIs. |
| 108 | +After mapping this data is directly accessible (reading/writing) on the host. Whlie the data is mapped, any GPU-side operations on this data are not allowed. |
| 109 | +The data should be unmapped to release all resources associated with mapping. |
| 110 | +
|
| 111 | +~~~cpp |
| 112 | +float *mapped_data = mem.map_data<float>(); |
| 113 | +for (size_t i = 0; i < N; i++) { |
| 114 | + float expected = (i % 2) ? 0.0f : (float)i; |
| 115 | + assert(mapped_data[i] == expected); |
| 116 | +} |
| 117 | +mem.unmap_data(mapped_data); |
| 118 | +~~~ |
| 119 | + |
| 120 | +--- |
| 121 | + |
| 122 | +The full code example is listed below: |
| 123 | + |
| 124 | +~~~cpp |
| 125 | +#include <CL/cl.h> |
| 126 | +#include <mkldnn.hpp> |
| 127 | + |
| 128 | +#include <cassert> |
| 129 | +#include <iostream> |
| 130 | +#include <numeric> |
| 131 | + |
| 132 | +using namespace mkldnn; |
| 133 | + |
| 134 | +#define OCL_CHECK(x) \ |
| 135 | + do { \ |
| 136 | + cl_int s = (x); \ |
| 137 | + if (s != CL_SUCCESS) { \ |
| 138 | + printf("OpenCL error: %d at %s:%d\n", s, __FILE__, __LINE__); \ |
| 139 | + exit(1); \ |
| 140 | + } \ |
| 141 | + } while (0) |
| 142 | + |
| 143 | +cl_kernel create_init_opencl_kernel( |
| 144 | + cl_context ocl_ctx, const char *kernel_name, const char *ocl_code) { |
| 145 | + cl_int err; |
| 146 | + const char *sources[] = { ocl_code }; |
| 147 | + cl_program ocl_program |
| 148 | + = clCreateProgramWithSource(ocl_ctx, 1, sources, nullptr, &err); |
| 149 | + OCL_CHECK(err); |
| 150 | + |
| 151 | + OCL_CHECK( |
| 152 | + clBuildProgram(ocl_program, 0, nullptr, nullptr, nullptr, nullptr)); |
| 153 | + |
| 154 | + cl_kernel ocl_kernel = clCreateKernel(ocl_program, kernel_name, &err); |
| 155 | + OCL_CHECK(err); |
| 156 | + |
| 157 | + OCL_CHECK(clReleaseProgram(ocl_program)); |
| 158 | + return ocl_kernel; |
| 159 | +} |
| 160 | + |
| 161 | +int main() { |
| 162 | + memory::dims tz_dims = { 2, 3, 4, 5 }; |
| 163 | + const size_t N = std::accumulate(tz_dims.begin(), tz_dims.end(), (size_t)1, |
| 164 | + std::multiplies<size_t>()); |
| 165 | + |
| 166 | + memory::desc mem_d(tz_dims, memory::data_type::f32, |
| 167 | + memory::format_tag::nchw); |
| 168 | + |
| 169 | + engine eng(engine::kind::gpu, 0); |
| 170 | + memory mem(mem_d, eng); |
| 171 | + |
| 172 | + // Extract OpenCL buffer from memory object |
| 173 | + cl_mem ocl_buf = mem.get_ocl_mem_object(); |
| 174 | + |
| 175 | + // Create stream |
| 176 | + mkldnn::stream strm(eng); |
| 177 | + |
| 178 | + // Create custom OpenCL kernel to initialize the data |
| 179 | + const char *ocl_code |
| 180 | + = "__kernel void init(__global float *data) {" |
| 181 | + " int id = get_global_id(0);" |
| 182 | + " data[id] = (id % 2) ? -id : id;" |
| 183 | + "}"; |
| 184 | + const char *kernel_name = "init"; |
| 185 | + cl_kernel ocl_init_kernel = create_init_opencl_kernel( |
| 186 | + eng.get_ocl_context(), kernel_name, ocl_code); |
| 187 | + |
| 188 | + // Execute the custom OpenCL kernel |
| 189 | + OCL_CHECK(clSetKernelArg(ocl_init_kernel, 0, sizeof(ocl_buf), &ocl_buf)); |
| 190 | + |
| 191 | + cl_command_queue ocl_queue = strm.get_ocl_command_queue(); |
| 192 | + OCL_CHECK(clEnqueueNDRangeKernel(ocl_queue, ocl_init_kernel, 1, nullptr, &N, |
| 193 | + nullptr, 0, nullptr, nullptr)); |
| 194 | + |
| 195 | + // Perform ReLU operation by executing the primitive |
| 196 | + auto relu_d = eltwise_forward::desc(prop_kind::forward, |
| 197 | + algorithm::eltwise_relu, mem_d, 0.0f); |
| 198 | + auto relu_pd = eltwise_forward::primitive_desc(relu_d, eng); |
| 199 | + auto relu = eltwise_forward(relu_pd); |
| 200 | + relu.execute(strm, { { MKLDNN_ARG_SRC, mem }, { MKLDNN_ARG_DST, mem } }); |
| 201 | + strm.wait(); |
| 202 | + |
| 203 | + // Map the data to the host to validate the results |
| 204 | + float *mapped_data = mem.map_data<float>(); |
| 205 | + for (size_t i = 0; i < N; i++) { |
| 206 | + float expected = (i % 2) ? 0.0f : (float)i; |
| 207 | + assert(mapped_data[i] == expected); |
| 208 | + } |
| 209 | + mem.unmap_data(mapped_data); |
| 210 | + |
| 211 | + OCL_CHECK(clReleaseKernel(ocl_init_kernel)); |
| 212 | + |
| 213 | + std::cout << "PASSED" << std::endl; |
| 214 | + return 0; |
| 215 | +} |
| 216 | +~~~ |
| 217 | + |
| 218 | +--- |
| 219 | + |
| 220 | +[Legal information](@ref legal_information) |
0 commit comments