Skip to content

Commit 31df9d3

Browse files
committed
doc: update for gpu support
1 parent 80f633b commit 31df9d3

File tree

3 files changed

+298
-2
lines changed

3 files changed

+298
-2
lines changed

README.md

Lines changed: 32 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@
1313
1414
Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN) is
1515
an open-source performance library for deep-learning applications. The library
16-
accelerates deep-learning applications and frameworks on Intel architecture.
16+
accelerates deep-learning applications and frameworks on Intel(R) architecture and
17+
Intel(R) Processor Graphics Architecture.
1718
Intel MKL-DNN contains vectorized and threaded building blocks that you can
1819
use to implement deep neural networks (DNN) with C and C++ interfaces.
1920

@@ -80,6 +81,7 @@ Please submit your questions, feature requests, and bug reports on the
8081
**WARNING** The following functionality has preview status and might change
8182
without prior notification in future releases:
8283
* Threading Building Blocks (TBB) support
84+
* Intel(R) Processor Graphics support
8385

8486
## How to Contribute
8587
We welcome community contributions to Intel MKL-DNN. If you have an idea on how to improve the library:
@@ -106,6 +108,10 @@ Ivy Bridge, Haswell, and Broadwell)
106108

107109
and compatible processors.
108110

111+
Intel MKL-DNN supports Intel(R) Processor Graphics.
112+
The library is optimized for the systems based on
113+
* Intel(R) Iris(R) Pro Graphics.
114+
109115
The software dependencies are:
110116
* [Cmake](https://cmake.org/download/) 2.8.0 or later
111117
* [Doxygen](http://www.stack.nl/~dimitri/doxygen/download.html#srcbin) 1.8.5 or later
@@ -115,6 +121,9 @@ The software dependencies are:
115121
* Threading Building Blocks (TBB) 2017 or later
116122
* Intel MKL 2017 Update 1 or Intel MKL small libraries
117123

124+
The additional software dependencies for Intel(R) Processor Graphics support:
125+
* [Intel(R) SDK for OpenCL\* applications](https://software.intel.com/en-us/intel-opencl)
126+
118127
> **Note**
119128
> Building Intel MKL-DNN with optional dependencies may introduce additional
120129
> runtime dependencies for the library. For details, refer to the corresponding
@@ -136,6 +145,13 @@ on macOS\* 10.13 (High Sierra) with
136145
* [Intel C/C++ Compiler](https://software.intel.com/en-us/intel-parallel-studio-xe)
137146
18.0 and 19.0
138147

148+
Intel(R) Processor Graphics support was validated on Ubuntu\* 18.04 with
149+
* GNU Compiler Collection 5.4 and 8.1
150+
* Clang\* 3.8.0
151+
* [Intel C/C++ Compiler](https://software.intel.com/en-us/intel-parallel-studio-xe)
152+
19.0
153+
* Intel(R) SDK for OpenCL\* applications version 18.1
154+
139155
The implementation uses OpenMP 4.0 SIMD extensions. We recommend using the
140156
Intel C++ Compiler for the best performance results.
141157

@@ -265,10 +281,22 @@ supported version.
265281

266282
Configure CMake and create a makefile:
267283

284+
The library can be built in the following configurations:
285+
286+
Configuration with **CPU support (Native backend)**:
287+
268288
```
269289
mkdir -p build && cd build && cmake $CMAKE_OPTIONS ..
270290
```
271291

292+
Configuration with **CPU support (Native backend) and GPU support (OpenCL backend)**:
293+
294+
```
295+
mkdir -p build && cd build && cmake -DMKLDNN_GPU_BACKEND=OPENCL $CMAKE_OPTIONS ..
296+
```
297+
298+
You can use the option `-DOPENCLROOT=...` to specify the path to the OpenCL installation manually.
299+
272300
Build the application:
273301

274302
```
@@ -353,6 +381,7 @@ Intel MKL-DNN was built.
353381
|:--- |:---
354382
|include/mkldnn.h | C header
355383
|include/mkldnn.hpp | C++ header
384+
|include/mkldnn_config.h| Library configuration file
356385
|include/mkldnn_types.h | Auxiliary C header
357386
|lib/libmkldnn.so | Intel MKL-DNN dynamic library
358387
|lib/libmkldnn.a | Intel MKL-DNN static library (if built with `MKLDNN_LIBRARY_TYPE=STATIC`)
@@ -366,6 +395,7 @@ Intel MKL-DNN was built.
366395
|:--- |:---
367396
|include/mkldnn.h | C header
368397
|include/mkldnn.hpp | C++ header
398+
|include/mkldnn_config.h | Library configuration file
369399
|include/mkldnn_types.h | Auxiliary C header
370400
|lib/libmkldnn.dylib | Intel MKL-DNN dynamic library
371401
|lib/libmkldnn.a | Intel MKL-DNN static library (if built with `MKLDNN_LIBRARY_TYPE=STATIC`)
@@ -407,6 +437,7 @@ Intel MKL-DNN was built.
407437
|bin\libmklml.dll | Intel MKL small library (if built with `MKLDNN_USE_MKL=ML`)
408438
|include\mkldnn.h | C header
409439
|include\mkldnn.hpp | C++ header
440+
|include\mkldnn_config.h| Library configuration file
410441
|include\mkldnn_types.h | Auxiliary C header
411442
|lib\libmkldnn.lib | Intel MKL-DNN import library
412443
|lib\libiomp5.lib | Intel OpenMP\* runtime import library (if built with `MKLDNN_USE_MKL=ML`)

doc/getting_started_gpu.md

Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
Getting Started with Intel(R) MKL-DNN with GPU support {#getting_started_gpu}
2+
=============================================================================
3+
4+
This is an introduction to Intel MKL-DNN with GPU support.
5+
We are going to walk through a simple example to demonstrate OpenCL\* extensions API in Intel MKL-DNN.
6+
7+
## Intel MKL-DNN basic workflow
8+
9+
A very simple workflow in Intel MKL-DNN includes the following steps:
10+
11+
- Engine creation
12+
- Input/output memory objects creation
13+
- Memory descriptors creation
14+
- Memory objects creation
15+
- Operation primitive creation
16+
- Operation descriptor creation
17+
- Operation primitive descriptor creation
18+
- Primitive creation
19+
- Stream object creation
20+
- Primitive submission for execution to a stream
21+
22+
## Create engine and memory object
23+
24+
Let's create a GPU engine object. The second parameter specifies the index of the requested engine.
25+
26+
~~~cpp
27+
auto eng = engine(engine::kind::gpu, 0);
28+
~~~
29+
30+
Then, we create a memory object. We need to specify dimensions of our memory by passing `memory::dims` object.
31+
Then we create a memory descriptor with these dimensions, with `f32` data type and `nchw` memory format.
32+
Finally, we construct a memory object and pass the memory descriptor. The library allocates memory internally.
33+
34+
~~~cpp
35+
auto tz_dims = memory::dims{2, 3, 4, 5};
36+
memory::desc mem_d(tz_dims, memory::data_type::f32, memory::format_tag::nchw);
37+
memory mem(mem_d, eng);
38+
~~~
39+
40+
## Initialize the data executing a custom OpenCL kernel
41+
42+
We are going to create an OpenCL kernel that will initialize our data.
43+
It requries writing a bit of C code to create an OpenCL program from a string literal source, build it and extract the kernel.
44+
The kernel initializes the data by the `0, -1, 2, -3, ...` sequence: `data[i] = (-1)^i * i`.
45+
46+
~~~cpp
47+
const char *ocl_code
48+
= "__kernel void init(__global float *data) {"
49+
" int id = get_global_id(0);"
50+
" data[id] = (id % 2) ? -id : id;"
51+
"}";
52+
const char *kernel_name = "init";
53+
cl_kernel ocl_init_kernel = create_init_opencl_kernel(
54+
eng.get_ocl_context(), kernel_name, ocl_code);
55+
~~~
56+
57+
Refer to the full code example for the code of `create_init_opencl_kernel()` function.
58+
The next step is to execute our OpenCL kernel: set its arguments and enqueue to an OpenCL queue.
59+
The underlying OpenCL buffer can be extracted from the memory object using
60+
the interoperability interface: `memory::get_ocl_mem_object()`.
61+
For simplicity we can just construct a stream, extract the underlying OpenCL queue and enqueue the kernel to this queue:
62+
63+
~~~cpp
64+
cl_mem ocl_buf = mem.get_ocl_mem_object();
65+
clSetKernelArg(ocl_init_kernel, 0, sizeof(ocl_buf), &ocl_buf);
66+
67+
mkldnn::stream strm(eng);
68+
cl_command_queue ocl_queue = strm.get_ocl_command_queue();
69+
clEnqueueNDRangeKernel(ocl_queue, ocl_init_kernel, 1, nullptr, &N, nullptr, 0,
70+
nullptr, nullptr);
71+
~~~
72+
73+
## Create and execute a primitive
74+
75+
There are 3 steps to create an operation primitive in Intel MKL-DNN:
76+
77+
- Create an operation descriptor
78+
- Create a primitive descriptor
79+
- Create a primitive
80+
81+
Let's create the primitive to perform ReLU (recitified linear unit) operation: `x = max(0, x)`.
82+
83+
~~~cpp
84+
auto relu_d = eltwise_forward::desc(prop_kind::forward, algorithm::eltwise_relu,
85+
mem_d, 0.0f);
86+
auto relu_pd = eltwise_forward::primitive_desc(relu_d, eng);
87+
auto relu = eltwise_forward(relu_pd);
88+
~~~
89+
90+
From the code above we see that an operation descriptor has no dependency on a specific engine - it just describes some operation.
91+
On the contrary, primitive descriptors are attached to a specific engine and represent some implementation for this engine.
92+
A primitive object is realization of a primitive descriptor and its construction is usually much "heavier".
93+
94+
Note that for our primitive `mem` serves as both input and output parameter.
95+
96+
Next, execute the primitive:
97+
98+
~~~cpp
99+
relu.execute(strm, { { MKLDNN_ARG_SRC, mem }, { MKLDNN_ARG_DST, mem } });
100+
~~~
101+
102+
Note, primitive submission on GPU is asynchronous.
103+
But user can call `stream::wait()` to synchronize the stream and ensure that all previously submitted primitives are completed.
104+
105+
## Validating the results
106+
107+
The simplest way to access the OpenCL memory is to map it to the host using `memory::map_data()` and `memory::unmap_data()` APIs.
108+
After mapping this data is directly accessible (reading/writing) on the host. Whlie the data is mapped, any GPU-side operations on this data are not allowed.
109+
The data should be unmapped to release all resources associated with mapping.
110+
111+
~~~cpp
112+
float *mapped_data = mem.map_data<float>();
113+
for (size_t i = 0; i < N; i++) {
114+
float expected = (i % 2) ? 0.0f : (float)i;
115+
assert(mapped_data[i] == expected);
116+
}
117+
mem.unmap_data(mapped_data);
118+
~~~
119+
120+
---
121+
122+
The full code example is listed below:
123+
124+
~~~cpp
125+
#include <CL/cl.h>
126+
#include <mkldnn.hpp>
127+
128+
#include <cassert>
129+
#include <iostream>
130+
#include <numeric>
131+
132+
using namespace mkldnn;
133+
134+
#define OCL_CHECK(x) \
135+
do { \
136+
cl_int s = (x); \
137+
if (s != CL_SUCCESS) { \
138+
printf("OpenCL error: %d at %s:%d\n", s, __FILE__, __LINE__); \
139+
exit(1); \
140+
} \
141+
} while (0)
142+
143+
cl_kernel create_init_opencl_kernel(
144+
cl_context ocl_ctx, const char *kernel_name, const char *ocl_code) {
145+
cl_int err;
146+
const char *sources[] = { ocl_code };
147+
cl_program ocl_program
148+
= clCreateProgramWithSource(ocl_ctx, 1, sources, nullptr, &err);
149+
OCL_CHECK(err);
150+
151+
OCL_CHECK(
152+
clBuildProgram(ocl_program, 0, nullptr, nullptr, nullptr, nullptr));
153+
154+
cl_kernel ocl_kernel = clCreateKernel(ocl_program, kernel_name, &err);
155+
OCL_CHECK(err);
156+
157+
OCL_CHECK(clReleaseProgram(ocl_program));
158+
return ocl_kernel;
159+
}
160+
161+
int main() {
162+
memory::dims tz_dims = { 2, 3, 4, 5 };
163+
const size_t N = std::accumulate(tz_dims.begin(), tz_dims.end(), (size_t)1,
164+
std::multiplies<size_t>());
165+
166+
memory::desc mem_d(tz_dims, memory::data_type::f32,
167+
memory::format_tag::nchw);
168+
169+
engine eng(engine::kind::gpu, 0);
170+
memory mem(mem_d, eng);
171+
172+
// Extract OpenCL buffer from memory object
173+
cl_mem ocl_buf = mem.get_ocl_mem_object();
174+
175+
// Create stream
176+
mkldnn::stream strm(eng);
177+
178+
// Create custom OpenCL kernel to initialize the data
179+
const char *ocl_code
180+
= "__kernel void init(__global float *data) {"
181+
" int id = get_global_id(0);"
182+
" data[id] = (id % 2) ? -id : id;"
183+
"}";
184+
const char *kernel_name = "init";
185+
cl_kernel ocl_init_kernel = create_init_opencl_kernel(
186+
eng.get_ocl_context(), kernel_name, ocl_code);
187+
188+
// Execute the custom OpenCL kernel
189+
OCL_CHECK(clSetKernelArg(ocl_init_kernel, 0, sizeof(ocl_buf), &ocl_buf));
190+
191+
cl_command_queue ocl_queue = strm.get_ocl_command_queue();
192+
OCL_CHECK(clEnqueueNDRangeKernel(ocl_queue, ocl_init_kernel, 1, nullptr, &N,
193+
nullptr, 0, nullptr, nullptr));
194+
195+
// Perform ReLU operation by executing the primitive
196+
auto relu_d = eltwise_forward::desc(prop_kind::forward,
197+
algorithm::eltwise_relu, mem_d, 0.0f);
198+
auto relu_pd = eltwise_forward::primitive_desc(relu_d, eng);
199+
auto relu = eltwise_forward(relu_pd);
200+
relu.execute(strm, { { MKLDNN_ARG_SRC, mem }, { MKLDNN_ARG_DST, mem } });
201+
strm.wait();
202+
203+
// Map the data to the host to validate the results
204+
float *mapped_data = mem.map_data<float>();
205+
for (size_t i = 0; i < N; i++) {
206+
float expected = (i % 2) ? 0.0f : (float)i;
207+
assert(mapped_data[i] == expected);
208+
}
209+
mem.unmap_data(mapped_data);
210+
211+
OCL_CHECK(clReleaseKernel(ocl_init_kernel));
212+
213+
std::cout << "PASSED" << std::endl;
214+
return 0;
215+
}
216+
~~~
217+
218+
---
219+
220+
[Legal information](@ref legal_information)

doc/mainpage.md

Lines changed: 46 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,9 @@ computational functions (also called primitives) used in deep neural
1414
networks covering a wide range of applications, including image recognition,
1515
object detection, semantic segmentation, neural machine translation,
1616
and speech recognition.
17-
The table below summarizes the list of supported functions and their variants.
17+
The tables below summarize the list of supported functions and their variants for CPU and GPU.
18+
19+
## CPU support
1820

1921
| Primitive class | Primitive | fp32 training | fp32 inference | int8 inference |
2022
| :---------------- | :----------------------- | :-----------: | :------------: | :------------: |
@@ -53,6 +55,45 @@ The table below summarizes the list of supported functions and their variants.
5355
| | Concat | x | x | x |
5456
| | Shuffle | x | x | x |
5557

58+
## GPU support
59+
60+
| Primitive class | Primitive | fp32 training | fp32 inference | fp16 inference | int8 inference |
61+
| :---------------- | :----------------------- | :-----------: | :------------: | :------------: | :------------: |
62+
| Convolution | 1D direct convolution | | | | |
63+
| | 2D direct convolution | x | x | x | |
64+
| | 2D direct deconvolution | | | | |
65+
| | 2D winograd convolution | | | | |
66+
| | 3D direct convolution | x | x | x | |
67+
| | 3D direct deconvolution | | | | |
68+
| Inner Product | 2D inner product | x | x | x | |
69+
| | 3D inner product | x | x | x | |
70+
| RNN | Vanilla RNN | | x | x | |
71+
| | LSTM | | x | x | |
72+
| | GRU | | | | |
73+
| Pooling | 2D maximum pooling | x | x | x | |
74+
| | 2D average pooling | x | x | x | |
75+
| | 3D maximum pooling | x | x | x | |
76+
| | 3D average pooling | x | x | x | |
77+
| Normalization | 2D LRN (within channel) | | x | | |
78+
| | 2D LRN (across channels) | | x | | |
79+
| | 2D batch normalization | x | x | | |
80+
| | 3D batch normalization | x | x | | |
81+
| Activation and | ReLU | x | x | x | |
82+
| elementwise | Tanh | | | | |
83+
| functions | ELU | | | | |
84+
| | Square | | | | |
85+
| | Sqrt | | | | |
86+
| | Abs | | | | |
87+
| | Linear | x | x | x | |
88+
| | Bounded ReLU | x | x | x | |
89+
| | Soft ReLU | x | x | x | |
90+
| | Logistic | x | x | x | |
91+
| | Softmax | | x | | |
92+
| Data manipulation | Reorder | x | x | x | x |
93+
| | Sum | x | x | | |
94+
| | Concat | x | x | x | x |
95+
| | Shuffle | | | | |
96+
5697
## Programming Model
5798

5899
Intel MKL-DNN models memory as a primitive similar to an operation
@@ -134,6 +175,10 @@ An introductory example to low-precision 8-bit computations:
134175

135176
* [Int8 SimpleNet Example](@ref ex_int8_simplenet)
136177

178+
Getting started with GPU support:
179+
180+
* [Getting Started with GPU Support](@ref getting_started_gpu)
181+
137182
The following examples are available in the /examples directory and provide more details about the API.
138183
* Creation of forward primitives
139184
- C: simple_net.c

0 commit comments

Comments
 (0)