Skip to content

Commit a9f8c08

Browse files
committed
✅ Add solutions to second practice exam
1 parent 7127756 commit a9f8c08

File tree

1 file changed

+52
-2
lines changed

1 file changed

+52
-2
lines changed

exams/practice2/practice2.tex

+52-2
Original file line numberDiff line numberDiff line change
@@ -33,31 +33,36 @@
3333
}
3434

3535
\bigskip
36-
%\printanswers
36+
\printanswers
3737
\begin{questions}
3838
\question Please explain the following concepts in 1-3 sentences and/or code snippets and/or a small illustration. An excellent answer does not have to be long, just precise.\\{\em Guide: 2 minutes each}
3939

4040
\medskip
4141

4242
\begin{parts}
43-
\part[1] Cache Coherency
43+
\part[1] Cache Coherence
4444
\begin{solution}[6em]
45+
The maintenance of consistency between cache lines in separate caches that reference the same underlying memory address.
4546
\end{solution}
4647

4748
\part[1] Data-level Parallelism
4849
\begin{solution}[6em]
50+
Parallelisation across elements of data (as opposed to, say, across threads or tasks)
4951
\end{solution}
5052

5153
\part[1] False Sharing
5254
\begin{solution}[6em]
55+
When distinct threads reference distinct, independent data elements, but those elements reside on the same cache line. This creates a dependency/contention that logically should not exist.
5356
\end{solution}
5457

5558
\part[1] Synchronisation
5659
\begin{solution}[6em]
60+
Causing threads to join up and wait for each other (e.g., in order to coordinate computation or data structures)
5761
\end{solution}
5862

5963
\part[1] Vectorisation
6064
\begin{solution}[6em]
65+
The application of vector (i.e., SIMD) registers to accelerate code or the process of exposing data-level parallelism that can be exploited with SIMD.
6166
\end{solution}
6267
\end{parts}
6368

@@ -74,14 +79,17 @@
7479
\begin{parts}
7580
\part[2] Which parallel algorithm is more work-efficient? Explain how you arrived at this conclusion.
7681
\begin{solution}[8em]
82+
{\em My Algorithm} is more work-efficient. We can compare the performance of the two algorithms on a single thread, relative to the sequential baseline, and observe that {\em State of the Art} incurs a lot more overhead/additional work.
7783
\end{solution}
7884

7985
\part[2] Which algorithm exhibits the best parallel scalability? Justify your response.
8086
\begin{solution}[8em]
87+
This answer is nuanced. {\em State-of-the-art} gains the most speed-up from additional threads, up to $t=8$. After this point, its performance degrades whereas {\em My Algorithm} continues to see (diminishing) returns. This suggests that either we have 8 physical cores and the last 8 are logical cores/hyperthreads or that the final 8 cores are on a separate socket and {\em My Algorithm} has better NUMA performance than {\em State of the Art}.
8188
\end{solution}
8289

8390
\part[2] What is the overall message (a.k.a., purpose, or significance) of the plot?
8491
\begin{solution}[8em]
92+
This plot shows performance relative to thread count. it emphasises raw execution time, but parallel scalability and work-efficiency can both be read from it. The primary message is that {\em My Algorithm} is the fastest or near-fastest algorithm on systems with at least 4 cores, but that it takes 4 cores to amortise the overhead for both parallel algorithms.
8593
\end{solution}
8694
\end{parts}
8795

@@ -127,6 +135,47 @@
127135

128136

129137
\begin{solution}[25em]
138+
To take full advantage of the SIMD-width, we need to rewrite the vector of points as three separate aligned vectors, one for each coordinate. Then we can rewrite the loops to compare multiple $x$'s to each other, multiple $y$'s to each other, and multiple $z$'s to each other for equality. Thereafter, we can take the bitwise conjunction of the three result mask vectors and then count the number of set bits. Bonus mark for handling alignment around the edge cases of $j$ correctly. E.g.:
139+
140+
\begin{verbatim}
141+
std::vector< double, MyAlloc > xvals;
142+
std::vector< double, MyAlloc > yvals;
143+
std::vector< double, MyAlloc > zvals;
144+
145+
146+
template < typename T >
147+
__mm256 num_matches( T const& points)
148+
{
149+
auto const n = cards.size();
150+
auto matches = 0llu;
151+
152+
for( auto i = 0u; i < n; i+=8 )
153+
{
154+
__m256d const xi = _mm256_load_pd( &( xvals.data() + i ) );
155+
__m256d const yi = _mm256_load_pd( &( yvals.data() + i ) );
156+
__m256d const zi = _mm256_load_pd( &( zvals.data() + i ) );
157+
158+
for( auto j = i + 1; j < n; j+=8 )
159+
{
160+
__m256d const result = _mm256_and_pd(
161+
_mm256_cmp_pd(
162+
xi, _mm256_load_pd( &( xvals.data() + j ), 0)
163+
),
164+
_mm256_and_pd(
165+
_mm256_cmp_pd(
166+
yi, _mm256_load_pd( &( yvals.data() + j ), 0)
167+
),
168+
_mm256_cmp_pd(
169+
zi, _mm256_load_pd( &( zvals.data() + j ), 0)
170+
)
171+
)
172+
);
173+
matches += __builtin_popcount(_mm256_movemask_pd( result ));
174+
}
175+
}
176+
return matches;
177+
};
178+
\end{verbatim}
130179
\end{solution}
131180

132181
\newpage
@@ -138,6 +187,7 @@
138187
{\em Guide: 20 min}
139188

140189
\begin{solution}[8em]
190+
This is similar to the example in class for finding isolated points. There are many possible solutions. One of the simplest is to first sort the data based on L1/Manhattan Norm. In the second step, we still use a double-nested loop, but no point $i$ needs to be compared to any point $j$ that has a Manhattan Norm more than 2 units larger than that of $i$.
141191
\end{solution}
142192
\end{questions}
143193

0 commit comments

Comments
 (0)