-
Notifications
You must be signed in to change notification settings - Fork 2
/
scratch.txt
70 lines (50 loc) · 2.01 KB
/
scratch.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
Introduction
- "In a few years, only a fraction of the computing power will be used" (see streamcomputing)
- trends until 2003: increase f_clk => faster program, same code
since 2003: f_clk remains unchanged ; N_cores increases => need to adapt the code.
There is no longer "Just wait 18 months, my program will be twice faster" (see "your free lunch is over")
GPU architecture
Basic concepts
---------------
Context, Queue, instructions. (a)synchronous.
This encapsulation is OOP-friendly
Host, device, kernel. On-the-fly compilation.
Software: Grid, blocks, threads
Hardware: Stream Multiprocessors, Warps, cores
Differences between other programming paradigms
wrt OMP : fine thread control, explicit communications, host/device distinction, ...
Memory
--------
GPU: no virtual memory => no "electric fence" => no segfault. The memory can be corrupted !
Most of the times, performances are memory-related. Thus, memory access are crucial !
Warm-up : which is fastest: y*Nc + x or x*Nr + y ? (on CPU and GPU).
Latency and bandwidth
Give figures for BW on recent GPU (DH, HD, DD).
time = latency + bandwidth*nbits
kernels should "make the memory transfers worth". No speed-up if most of the time is spent transfering the data !
Nvidia visual profiler, OpenCL events and profiling mode
Element-wise
-------------
add
flat-field
# Block size is free to tune
Gather
-------
2x2 binning
convolution/gradient/Haar transform
# block size is often imposed
Reduction
----------
min/max/sum
histogram
# block size is application-dependent
Linear Algebra and Fourier Transforms
--------------------------------------
cublas, cufft => easy to call from CUDA C/C++ ; python wrappers (eg. pyfft)
clblas, clfft => ?
CUDA vs OpenCL
---------------
Performance portable
Open issues regarding OpenCL
- Will OpenCL remain a serious competitor of CUDA ? (AMD is going CUDA, Nvidia does not maintain OpenCL anymore, bugs in Apple's implementation, ...)
- OpenCL 2.x in Linux (eg. Intel's Beignet driver)