You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+13-6
Original file line number
Diff line number
Diff line change
@@ -5,15 +5,17 @@
5
5
6
6
## Course Concept
7
7
8
-
This course is designed to help someone think holistically about parallelism. To start with "scale up" or "scale out" is to miss the point entirely. First, we should "scale down." There's a world of parallelism in a single core.
8
+
This course is designed to help someone think holistically about parallelism. Starting with "scale up" or "scale out" overlooks the fundamentals: first, we should "scale down." It's quite shocking the amount of parallelism already available in a single core.
9
9
10
10
Too often, we jump to multiple processors (cores or nodes) without learning how to optimise the first one. Then, _to our horror_, parallel computing makes code slower! 1 + 1 = 0.5; it defies reason! This can't be an encouraging initiation for the budding parallel programmer.
11
11
12
-
Here, we aim to train a parallel programmer before they ever touch a multi-core (never mind distributed or co-processing!) device. If one can _expose parallelism_ within a single core, adding more cores is (relatively) easy, regardless of whether they are connected via a network, the PCIe bus, or L3 cache.
12
+
This course trains a parallel programmer before they ever touch a multicore (never mind distributed or co-processing!) device. If one can _expose parallelism_ within a single core, adding more cores is (relatively) easy, regardless of whether they are connected via a network, the PCIe bus, or L3 cache.
13
13
14
-
This patient, scenic approach first takes us through taclking the memory wall, data access patterns, and cache optimisation. We enable super-scalar execution by exposing instruction-level parallelism and we activate SIMD. We try to get so much out of an individual core that it is worth asking, "what's the need for another one?"
14
+
This patientapproach spends considerable time tackling the memory wall, data access patterns, and cache optimisation. We learn to enable super-scalar execution by exposing instruction-level parallelism and we activate SIMD. We try to get so much out of an individual core that it is worth asking, "what's the need for another one?"
15
15
16
-
At that point, we've exposed sufficient data-level parallelism that simple OpenMP pragmas _accelerate_ the code. _Then_ we can go planet-scale with the parallelism. _Then_, our trainees really understand parallel computing.
16
+
The second half of the course focuses on "algorithmic primitives" for breaking dependencies, thereby exposing data-level parallelism. These primitives are then applied to multicore and GPU architectures to get the most out of each.
17
+
18
+
At the end of the course, one should have a substantial grasp on how to optimise performance of a program and also on how to problem solve with a data-level parallel perspective.
17
19
18
20
19
21
## Course Structure
@@ -23,16 +25,21 @@ The course is structured in four modules:
23
25
* Single-core parallelism (breaking the memory wall, latency hiding, cache-friendly data structures, super-scalar OOE, ILP, and SIMD)
*Wishful advanced topics (processor-in-memory, quantum, and other exciting topics we won't have time for)
28
+
*Advanced topics (e.g., processor-in-memory), if there is enough time
27
29
28
30
29
31
## Lecture Structure
30
32
31
-
The lectures mostly follow two formats, either a class discussion of a paper (that the students should have read in advance) or a live coding session to illustrate a particular effect. In the case of the paper discussions, I have included "minutes" of our class discussion, which usually match my notes that I have prepared in advance to direct the discussion. The livecoding sessions are more involved with a sub-directory structure that includes source code (and header) files as well as a README to describe the overall lecture plan.
33
+
The lectures mostly follow two formats, either a class discussion of a paper (that the students should have read in advance) or a live coding session to illustrate a particular effect. In the case of the paper discussions, I have included "minutes" of our class discussion, which usually match my notes that I have prepared in advance to direct the discussion. The live-coding sessions are more involved with a sub-directory structure that includes source code (and header) files as well as a README to describe the overall lecture plan.
32
34
33
35
34
36
## Assessment
35
37
36
38
Two exams separate individual performance, but the class is primarily based on one semester-long group project. Each group should develop an efficient GPU algorithm to process a complex computing task of their choice. The project is set up in stages to facilitate this by first requiring to optimise a single-threaded implementation, then port it to multi-core, and then finally implement it for a GPU. Some pre-screening is necessary to ensure that they choose a problem early on that has a potential to fit the GPU architecture well.
37
39
38
40
This reflects the course concept that achieving a high degree of GPU parallelism requires first understanding how to optimise a single thread.
41
+
42
+
43
+
## Prerequisites
44
+
45
+
This course is designed for upper-level or graduate Computer Science students. Familiarity with asymptotic analysis, algorithm design, and data structures is assumed, preferably at the level of a systems course (e.g., Intro to Operating Systems). One should also be capable of reading generic C++11 code quickly enough to follow a live-coding session.
0 commit comments