doc: update for gpu support

echeresh · echeresh · commit 31df9d38288a · 2019-04-19T19:01:11.000-07:00
diff --git a/README.md b/README.md
@@ -13,7 +13,8 @@
 
 Intel(R) Math Kernel Library for Deep Neural Networks (Intel(R) MKL-DNN) is
 an open-source performance library for deep-learning applications. The library
-accelerates deep-learning applications and frameworks on Intel architecture.
+accelerates deep-learning applications and frameworks on Intel(R) architecture and
+Intel(R) Processor Graphics Architecture.
 Intel MKL-DNN contains vectorized and threaded building blocks that you can
 use to implement deep neural networks (DNN) with C and C++ interfaces.
 
@@ -80,6 +81,7 @@ Please submit your questions, feature requests, and bug reports on the
 **WARNING** The following functionality has preview status and might change
 without prior notification in future releases:
 * Threading Building Blocks (TBB) support
+* Intel(R) Processor Graphics support
 
 ## How to Contribute
 We welcome community contributions to Intel MKL-DNN. If you have an idea on how to improve the library:
@@ -106,6 +108,10 @@ Ivy Bridge, Haswell, and Broadwell)
 
 and compatible processors.
 
+Intel MKL-DNN supports Intel(R) Processor Graphics.
+The library is optimized for the systems based on
+* Intel(R) Iris(R) Pro Graphics.
+
 The software dependencies are:
 * [Cmake](https://cmake.org/download/) 2.8.0 or later
 * [Doxygen](http://www.stack.nl/~dimitri/doxygen/download.html#srcbin) 1.8.5 or later
@@ -115,6 +121,9 @@ The software dependencies are:
   * Threading Building Blocks (TBB) 2017 or later
   * Intel MKL 2017 Update 1 or Intel MKL small libraries
 
+The additional software dependencies for Intel(R) Processor Graphics support:
+* [Intel(R) SDK for OpenCL\* applications](https://software.intel.com/en-us/intel-opencl)
+
 > **Note**
 > Building Intel MKL-DNN with optional dependencies may introduce additional
 > runtime dependencies for the library. For details, refer to the corresponding
@@ -136,6 +145,13 @@ on macOS\* 10.13 (High Sierra) with
 * [Intel C/C++ Compiler](https://software.intel.com/en-us/intel-parallel-studio-xe)
   18.0 and 19.0
 
+Intel(R) Processor Graphics support was validated on Ubuntu\* 18.04 with
+* GNU Compiler Collection 5.4 and 8.1
+* Clang\* 3.8.0
+* [Intel C/C++ Compiler](https://software.intel.com/en-us/intel-parallel-studio-xe)
+  19.0
+* Intel(R) SDK for OpenCL\* applications version 18.1
+
 The implementation uses OpenMP 4.0 SIMD extensions. We recommend using the
 Intel C++ Compiler for the best performance results.
 
@@ -265,10 +281,22 @@ supported version.
 
 Configure CMake and create a makefile:
 
+The library can be built in the following configurations:
+
+Configuration with **CPU support (Native backend)**:
+
 ```
 mkdir -p build && cd build && cmake $CMAKE_OPTIONS ..
 ```
 
+Configuration with **CPU support (Native backend) and GPU support (OpenCL backend)**:
+
+```
+mkdir -p build && cd build && cmake -DMKLDNN_GPU_BACKEND=OPENCL $CMAKE_OPTIONS ..
+```
+
+You can use the option `-DOPENCLROOT=...` to specify the path to the OpenCL installation manually.
+
 Build the application:
 
 ```
@@ -353,6 +381,7 @@ Intel MKL-DNN was built.
 |:---                   |:---
 |include/mkldnn.h       | C header
 |include/mkldnn.hpp     | C++ header
+|include/mkldnn_config.h| Library configuration file
 |include/mkldnn_types.h | Auxiliary C header
 |lib/libmkldnn.so       | Intel MKL-DNN dynamic library
 |lib/libmkldnn.a        | Intel MKL-DNN static library (if built with `MKLDNN_LIBRARY_TYPE=STATIC`)
@@ -366,6 +395,7 @@ Intel MKL-DNN was built.
 |:---                     |:---
 |include/mkldnn.h         | C header
 |include/mkldnn.hpp       | C++ header
+|include/mkldnn_config.h  | Library configuration file
 |include/mkldnn_types.h   | Auxiliary C header
 |lib/libmkldnn.dylib      | Intel MKL-DNN dynamic library
 |lib/libmkldnn.a          | Intel MKL-DNN static library (if built with `MKLDNN_LIBRARY_TYPE=STATIC`)
@@ -407,6 +437,7 @@ Intel MKL-DNN was built.
 |bin\libmklml.dll       | Intel MKL small library (if built with `MKLDNN_USE_MKL=ML`)
 |include\mkldnn.h       | C header
 |include\mkldnn.hpp     | C++ header
+|include\mkldnn_config.h| Library configuration file
 |include\mkldnn_types.h | Auxiliary C header
 |lib\libmkldnn.lib      | Intel MKL-DNN import library
 |lib\libiomp5.lib       | Intel OpenMP\* runtime import library (if built with `MKLDNN_USE_MKL=ML`)
diff --git a/doc/getting_started_gpu.md b/doc/getting_started_gpu.md
@@ -0,0 +1,220 @@
+Getting Started with Intel(R) MKL-DNN with GPU support {#getting_started_gpu}
+=============================================================================
+
+This is an introduction to Intel MKL-DNN with GPU support.
+We are going to walk through a simple example to demonstrate OpenCL\* extensions API in Intel MKL-DNN.
+
+## Intel MKL-DNN basic workflow
+
+A very simple workflow in Intel MKL-DNN includes the following steps:
+
+- Engine creation
+- Input/output memory objects creation
+    - Memory descriptors creation
+    - Memory objects creation
+- Operation primitive creation
+    - Operation descriptor creation
+    - Operation primitive descriptor creation
+    - Primitive creation
+- Stream object creation
+- Primitive submission for execution to a stream
+
+## Create engine and memory object
+
+Let's create a GPU engine object. The second parameter specifies the index of the requested engine.
+
+~~~cpp
+auto eng = engine(engine::kind::gpu, 0);
+~~~
+
+Then, we create a memory object. We need to specify dimensions of our memory by passing `memory::dims` object.
+Then we create a memory descriptor with these dimensions, with `f32` data type and `nchw` memory format.
+Finally, we construct a memory object and pass the memory descriptor. The library allocates memory internally.
+
+~~~cpp
+auto tz_dims = memory::dims{2, 3, 4, 5};
+memory::desc mem_d(tz_dims, memory::data_type::f32, memory::format_tag::nchw);
+memory mem(mem_d, eng);
+~~~
+
+## Initialize the data executing a custom OpenCL kernel
+
+We are going to create an OpenCL kernel that will initialize our data.
+It requries writing a bit of C code to create an OpenCL program from a string literal source, build it and extract the kernel.
+The kernel initializes the data by the `0, -1, 2, -3, ...` sequence: `data[i] = (-1)^i * i`.
+
+~~~cpp
+const char *ocl_code
+        = "__kernel void init(__global float *data) {"
+          "    int id = get_global_id(0);"
+          "    data[id] = (id % 2) ? -id : id;"
+          "}";
+const char *kernel_name = "init";
+cl_kernel ocl_init_kernel = create_init_opencl_kernel(
+        eng.get_ocl_context(), kernel_name, ocl_code);
+~~~
+
+Refer to the full code example for the code of `create_init_opencl_kernel()` function.
+The next step is to execute our OpenCL kernel: set its arguments and enqueue to an OpenCL queue.
+The underlying OpenCL buffer can be extracted from the memory object using
+the interoperability interface: `memory::get_ocl_mem_object()`.
+For simplicity we can just construct a stream, extract the underlying OpenCL queue and enqueue the kernel to this queue:
+
+~~~cpp
+cl_mem ocl_buf = mem.get_ocl_mem_object();
+clSetKernelArg(ocl_init_kernel, 0, sizeof(ocl_buf), &ocl_buf);
+
+mkldnn::stream strm(eng);
+cl_command_queue ocl_queue = strm.get_ocl_command_queue();
+clEnqueueNDRangeKernel(ocl_queue, ocl_init_kernel, 1, nullptr, &N, nullptr, 0,
+                       nullptr, nullptr);
+~~~
+
+## Create and execute a primitive
+
+There are 3 steps to create an operation primitive in Intel MKL-DNN:
+
+- Create an operation descriptor
+- Create a primitive descriptor
+- Create a primitive
+
+Let's create the primitive to perform ReLU (recitified linear unit) operation: `x = max(0, x)`.
+
+~~~cpp
+auto relu_d = eltwise_forward::desc(prop_kind::forward, algorithm::eltwise_relu,
+                                    mem_d, 0.0f);
+auto relu_pd = eltwise_forward::primitive_desc(relu_d, eng);
+auto relu = eltwise_forward(relu_pd);
+~~~
+
+From the code above we see that an operation descriptor has no dependency on a specific engine - it just describes some operation.
+On the contrary, primitive descriptors are attached to a specific engine and represent some implementation for this engine.
+A primitive object is realization of a primitive descriptor and its construction is usually much "heavier".
+
+Note that for our primitive `mem` serves as both input and output parameter.
+
+Next, execute the primitive:
+
+~~~cpp
+relu.execute(strm, { { MKLDNN_ARG_SRC, mem }, { MKLDNN_ARG_DST, mem } });
+~~~
+
+Note, primitive submission on GPU is asynchronous.
+But user can call `stream::wait()` to synchronize the stream and ensure that all previously submitted primitives are completed.
+
+## Validating the results
+
+The simplest way to access the OpenCL memory is to map it to the host using `memory::map_data()` and `memory::unmap_data()` APIs.
+After mapping this data is directly accessible (reading/writing) on the host. Whlie the data is mapped, any GPU-side operations on this data are not allowed.
+The data should be unmapped to release all resources associated with mapping.
+
+~~~cpp
+float *mapped_data = mem.map_data<float>();
+for (size_t i = 0; i < N; i++) {
+    float expected = (i % 2) ? 0.0f : (float)i;
+    assert(mapped_data[i] == expected);
+}
+mem.unmap_data(mapped_data);
+~~~
+
+---
+
+The full code example is listed below:
+
+~~~cpp
+#include <CL/cl.h>
+#include <mkldnn.hpp>
+
+#include <cassert>
+#include <iostream>
+#include <numeric>
+
+using namespace mkldnn;
+
+#define OCL_CHECK(x)                                                      \
+    do {                                                                  \
+        cl_int s = (x);                                                   \
+        if (s != CL_SUCCESS) {                                            \
+            printf("OpenCL error: %d at %s:%d\n", s, __FILE__, __LINE__); \
+            exit(1);                                                      \
+        }                                                                 \
+    } while (0)
+
+cl_kernel create_init_opencl_kernel(
+        cl_context ocl_ctx, const char *kernel_name, const char *ocl_code) {
+    cl_int err;
+    const char *sources[] = { ocl_code };
+    cl_program ocl_program
+            = clCreateProgramWithSource(ocl_ctx, 1, sources, nullptr, &err);
+    OCL_CHECK(err);
+
+    OCL_CHECK(
+            clBuildProgram(ocl_program, 0, nullptr, nullptr, nullptr, nullptr));
+
+    cl_kernel ocl_kernel = clCreateKernel(ocl_program, kernel_name, &err);
+    OCL_CHECK(err);
+
+    OCL_CHECK(clReleaseProgram(ocl_program));
+    return ocl_kernel;
+}
+
+int main() {
+    memory::dims tz_dims = { 2, 3, 4, 5 };
+    const size_t N = std::accumulate(tz_dims.begin(), tz_dims.end(), (size_t)1,
+            std::multiplies<size_t>());
+
+    memory::desc mem_d(tz_dims, memory::data_type::f32,
+            memory::format_tag::nchw);
+
+    engine eng(engine::kind::gpu, 0);
+    memory mem(mem_d, eng);
+
+    // Extract OpenCL buffer from memory object
+    cl_mem ocl_buf = mem.get_ocl_mem_object();
+
+    // Create stream
+    mkldnn::stream strm(eng);
+
+    // Create custom OpenCL kernel to initialize the data
+    const char *ocl_code
+            = "__kernel void init(__global float *data) {"
+              "    int id = get_global_id(0);"
+              "    data[id] = (id % 2) ? -id : id;"
+              "}";
+    const char *kernel_name = "init";
+    cl_kernel ocl_init_kernel = create_init_opencl_kernel(
+            eng.get_ocl_context(), kernel_name, ocl_code);
+
+    // Execute the custom OpenCL kernel
+    OCL_CHECK(clSetKernelArg(ocl_init_kernel, 0, sizeof(ocl_buf), &ocl_buf));
+
+    cl_command_queue ocl_queue = strm.get_ocl_command_queue();
+    OCL_CHECK(clEnqueueNDRangeKernel(ocl_queue, ocl_init_kernel, 1, nullptr, &N,
+            nullptr, 0, nullptr, nullptr));
+
+    // Perform ReLU operation by executing the primitive
+    auto relu_d = eltwise_forward::desc(prop_kind::forward,
+            algorithm::eltwise_relu, mem_d, 0.0f);
+    auto relu_pd = eltwise_forward::primitive_desc(relu_d, eng);
+    auto relu = eltwise_forward(relu_pd);
+    relu.execute(strm, { { MKLDNN_ARG_SRC, mem }, { MKLDNN_ARG_DST, mem } });
+    strm.wait();
+
+    // Map the data to the host to validate the results
+    float *mapped_data = mem.map_data<float>();
+    for (size_t i = 0; i < N; i++) {
+        float expected = (i % 2) ? 0.0f : (float)i;
+        assert(mapped_data[i] == expected);
+    }
+    mem.unmap_data(mapped_data);
+
+    OCL_CHECK(clReleaseKernel(ocl_init_kernel));
+
+    std::cout << "PASSED" << std::endl;
+    return 0;
+}
+~~~
+
+---
+
+[Legal information](@ref legal_information)
diff --git a/doc/mainpage.md b/doc/mainpage.md
@@ -14,7 +14,9 @@ computational functions (also called primitives) used in deep neural
 networks covering a wide range of applications, including image recognition,
 object detection, semantic segmentation, neural machine translation,
 and speech recognition.
-The table below summarizes the list of supported functions and their variants.
+The tables below summarize the list of supported functions and their variants for CPU and GPU.
+
+## CPU support
 
 | Primitive class   | Primitive                | fp32 training | fp32 inference | int8 inference |
 | :---------------- | :----------------------- | :-----------: | :------------: | :------------: |
@@ -53,6 +55,45 @@ The table below summarizes the list of supported functions and their variants.
 |                   | Concat                   | x             | x              | x              |
 |                   | Shuffle                  | x             | x              | x              |
 
+## GPU support
+
+| Primitive class   | Primitive                | fp32 training | fp32 inference | fp16 inference | int8 inference |
+| :---------------- | :----------------------- | :-----------: | :------------: | :------------: | :------------: |
+| Convolution       | 1D direct convolution    |               |                |                |                |
+|                   | 2D direct convolution    | x             | x              | x              |                |
+|                   | 2D direct deconvolution  |               |                |                |                |
+|                   | 2D winograd convolution  |               |                |                |                |
+|                   | 3D direct convolution    | x             | x              | x              |                |
+|                   | 3D direct deconvolution  |               |                |                |                |
+| Inner Product     | 2D inner product         | x             | x              | x              |                |
+|                   | 3D inner product         | x             | x              | x              |                |
+| RNN               | Vanilla RNN              |               | x              | x              |                |
+|                   | LSTM                     |               | x              | x              |                |
+|                   | GRU                      |               |                |                |                |
+| Pooling           | 2D maximum pooling       | x             | x              | x              |                |
+|                   | 2D average pooling       | x             | x              | x              |                |
+|                   | 3D maximum pooling       | x             | x              | x              |                |
+|                   | 3D average pooling       | x             | x              | x              |                |
+| Normalization     | 2D LRN (within channel)  |               | x              |                |                |
+|                   | 2D LRN (across channels) |               | x              |                |                |
+|                   | 2D batch normalization   | x             | x              |                |                |
+|                   | 3D batch normalization   | x             | x              |                |                |
+| Activation and    | ReLU                     | x             | x              | x              |                |
+| elementwise       | Tanh                     |               |                |                |                |
+| functions         | ELU                      |               |                |                |                |
+|                   | Square                   |               |                |                |                |
+|                   | Sqrt                     |               |                |                |                |
+|                   | Abs                      |               |                |                |                |
+|                   | Linear                   | x             | x              | x              |                |
+|                   | Bounded ReLU             | x             | x              | x              |                |
+|                   | Soft ReLU                | x             | x              | x              |                |
+|                   | Logistic                 | x             | x              | x              |                |
+|                   | Softmax                  |               | x              |                |                |
+| Data manipulation | Reorder                  | x             | x              | x              | x              |
+|                   | Sum                      | x             | x              |                |                |
+|                   | Concat                   | x             | x              | x              | x              |
+|                   | Shuffle                  |               |                |                |                |
+
 ## Programming Model
 
 Intel MKL-DNN models memory as a primitive similar to an operation
@@ -134,6 +175,10 @@ An introductory example to low-precision 8-bit computations:
 
 * [Int8 SimpleNet Example](@ref ex_int8_simplenet)
 
+Getting started with GPU support:
+
+* [Getting Started with GPU Support](@ref getting_started_gpu)
+
 The following examples are available in the /examples directory and provide more details about the API.
 * Creation of forward primitives
     - C: simple_net.c