|
33 | 33 | }
|
34 | 34 |
|
35 | 35 | \bigskip
|
36 |
| -%\printanswers |
| 36 | +\printanswers |
37 | 37 | \begin{questions}
|
38 | 38 | \question Please explain the following concepts in 1-3 sentences and/or code snippets and/or a small illustration. An excellent answer does not have to be long, just precise.\\{\em Guide: 2 minutes each}
|
39 | 39 |
|
40 | 40 | \medskip
|
41 | 41 |
|
42 | 42 | \begin{parts}
|
43 |
| - \part[1] Cache Coherency |
| 43 | + \part[1] Cache Coherence |
44 | 44 | \begin{solution}[6em]
|
| 45 | +The maintenance of consistency between cache lines in separate caches that reference the same underlying memory address. |
45 | 46 | \end{solution}
|
46 | 47 |
|
47 | 48 | \part[1] Data-level Parallelism
|
48 | 49 | \begin{solution}[6em]
|
| 50 | +Parallelisation across elements of data (as opposed to, say, across threads or tasks) |
49 | 51 | \end{solution}
|
50 | 52 |
|
51 | 53 | \part[1] False Sharing
|
52 | 54 | \begin{solution}[6em]
|
| 55 | +When distinct threads reference distinct, independent data elements, but those elements reside on the same cache line. This creates a dependency/contention that logically should not exist. |
53 | 56 | \end{solution}
|
54 | 57 |
|
55 | 58 | \part[1] Synchronisation
|
56 | 59 | \begin{solution}[6em]
|
| 60 | +Causing threads to join up and wait for each other (e.g., in order to coordinate computation or data structures) |
57 | 61 | \end{solution}
|
58 | 62 |
|
59 | 63 | \part[1] Vectorisation
|
60 | 64 | \begin{solution}[6em]
|
| 65 | +The application of vector (i.e., SIMD) registers to accelerate code or the process of exposing data-level parallelism that can be exploited with SIMD. |
61 | 66 | \end{solution}
|
62 | 67 | \end{parts}
|
63 | 68 |
|
|
74 | 79 | \begin{parts}
|
75 | 80 | \part[2] Which parallel algorithm is more work-efficient? Explain how you arrived at this conclusion.
|
76 | 81 | \begin{solution}[8em]
|
| 82 | +{\em My Algorithm} is more work-efficient. We can compare the performance of the two algorithms on a single thread, relative to the sequential baseline, and observe that {\em State of the Art} incurs a lot more overhead/additional work. |
77 | 83 | \end{solution}
|
78 | 84 |
|
79 | 85 | \part[2] Which algorithm exhibits the best parallel scalability? Justify your response.
|
80 | 86 | \begin{solution}[8em]
|
| 87 | +This answer is nuanced. {\em State-of-the-art} gains the most speed-up from additional threads, up to $t=8$. After this point, its performance degrades whereas {\em My Algorithm} continues to see (diminishing) returns. This suggests that either we have 8 physical cores and the last 8 are logical cores/hyperthreads or that the final 8 cores are on a separate socket and {\em My Algorithm} has better NUMA performance than {\em State of the Art}. |
81 | 88 | \end{solution}
|
82 | 89 |
|
83 | 90 | \part[2] What is the overall message (a.k.a., purpose, or significance) of the plot?
|
84 | 91 | \begin{solution}[8em]
|
| 92 | +This plot shows performance relative to thread count. it emphasises raw execution time, but parallel scalability and work-efficiency can both be read from it. The primary message is that {\em My Algorithm} is the fastest or near-fastest algorithm on systems with at least 4 cores, but that it takes 4 cores to amortise the overhead for both parallel algorithms. |
85 | 93 | \end{solution}
|
86 | 94 | \end{parts}
|
87 | 95 |
|
|
127 | 135 |
|
128 | 136 |
|
129 | 137 | \begin{solution}[25em]
|
| 138 | +To take full advantage of the SIMD-width, we need to rewrite the vector of points as three separate aligned vectors, one for each coordinate. Then we can rewrite the loops to compare multiple $x$'s to each other, multiple $y$'s to each other, and multiple $z$'s to each other for equality. Thereafter, we can take the bitwise conjunction of the three result mask vectors and then count the number of set bits. Bonus mark for handling alignment around the edge cases of $j$ correctly. E.g.: |
| 139 | + |
| 140 | + \begin{verbatim} |
| 141 | +std::vector< double, MyAlloc > xvals; |
| 142 | +std::vector< double, MyAlloc > yvals; |
| 143 | +std::vector< double, MyAlloc > zvals; |
| 144 | +
|
| 145 | +
|
| 146 | +template < typename T > |
| 147 | +__mm256 num_matches( T const& points) |
| 148 | +{ |
| 149 | + auto const n = cards.size(); |
| 150 | + auto matches = 0llu; |
| 151 | +
|
| 152 | + for( auto i = 0u; i < n; i+=8 ) |
| 153 | + { |
| 154 | + __m256d const xi = _mm256_load_pd( &( xvals.data() + i ) ); |
| 155 | + __m256d const yi = _mm256_load_pd( &( yvals.data() + i ) ); |
| 156 | + __m256d const zi = _mm256_load_pd( &( zvals.data() + i ) ); |
| 157 | +
|
| 158 | + for( auto j = i + 1; j < n; j+=8 ) |
| 159 | + { |
| 160 | + __m256d const result = _mm256_and_pd( |
| 161 | + _mm256_cmp_pd( |
| 162 | + xi, _mm256_load_pd( &( xvals.data() + j ), 0) |
| 163 | + ), |
| 164 | + _mm256_and_pd( |
| 165 | + _mm256_cmp_pd( |
| 166 | + yi, _mm256_load_pd( &( yvals.data() + j ), 0) |
| 167 | + ), |
| 168 | + _mm256_cmp_pd( |
| 169 | + zi, _mm256_load_pd( &( zvals.data() + j ), 0) |
| 170 | + ) |
| 171 | + ) |
| 172 | + ); |
| 173 | + matches += __builtin_popcount(_mm256_movemask_pd( result )); |
| 174 | + } |
| 175 | + } |
| 176 | + return matches; |
| 177 | +}; |
| 178 | + \end{verbatim} |
130 | 179 | \end{solution}
|
131 | 180 |
|
132 | 181 | \newpage
|
|
138 | 187 | {\em Guide: 20 min}
|
139 | 188 |
|
140 | 189 | \begin{solution}[8em]
|
| 190 | +This is similar to the example in class for finding isolated points. There are many possible solutions. One of the simplest is to first sort the data based on L1/Manhattan Norm. In the second step, we still use a double-nested loop, but no point $i$ needs to be compared to any point $j$ that has a Manhattan Norm more than 2 units larger than that of $i$. |
141 | 191 | \end{solution}
|
142 | 192 | \end{questions}
|
143 | 193 |
|
|
0 commit comments