index.html

<!doctype html>
<html lang="en">

    <head>
        <meta charset="utf-8">

        <title>The Math of Reliability</title>

        <meta name="description" content="">
        <meta name="author" content="Avishai Ish-Shalom">

        <meta name="apple-mobile-web-app-capable" content="yes" />
        <meta name="apple-mobile-web-app-status-bar-style" content="black-translucent" />

        <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">

        <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.3.0/css/reveal.min.css">
        <link rel="stylesheet" href="https://cdn.rawgit.com/hakimel/reveal.js/3.3.0/css/theme/league.css" id="theme">

        <!-- For syntax highlighting -->
        <link rel="stylesheet" href="https://cdn.rawgit.com/hakimel/reveal.js/3.3.0/lib/css/zenburn.css">

    <!-- Printing and PDF exports -->
    <script>
        var link = document.createElement( 'link' );
        link.rel = 'stylesheet';
        link.type = 'text/css';
        link.href = window.location.search.match( /print-pdf/gi ) ? 'https://cdn.rawgit.com/hakimel/reveal.js/3.3.0/css/print/pdf.css' : 'https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.3.0/css/print/paper.css';
        document.getElementsByTagName( 'head' )[0].appendChild( link );

        var m = window.location.search.match(/theme=([^&]+)/);
        if( m ) {
            theme = m[1];
            var link = document.getElementById('theme');
            link.href = 'https://cdn.rawgit.com/hakimel/reveal.js/3.3.0/css/theme/' + theme + '.css';
        }
    </script>

        <!--[if lt IE 9]>
        <script src="lib/js/html5shiv.js"></script>
        <![endif]-->
    </head>

    <body>

        <div class="reveal">

            <!-- Any section element inside of this container is displayed as a slide -->
            <div class="slides">
              <section>
                <img src="images/guinness.jpg" style="height: 50%; width: 50%;" alt="">
                <aside class="notes">
                  <p>t-test, introduced 1908 by william gosset working for guinness. published under the pen name "A. Student" because Claude Guinness treaeted his use of math in the brewry as trade secret</p>
                </aside>
              </section>
                <section>
                    <h1 style="font-size: 120px;">The Math of Reliability</h1>
                    <p>Avishai Ish-Shalom (@nukemberg)</p>
                    <p style="display: inline-block; vertical-align: bottom;">
                        <b>repo:</b> <a href="http://github.com/avishai-ish-shalom/the-math-of-reliability">github.com/avishai-ish-shalom/the-math-of-reliability</a>
                        <br>Companion IPython notebook: <a href="https://github.com/avishai-ish-shalom/the-math-of-reliability/blob/master/reliability.ipynb">reliability.ipynb</a>
                    </p>
                    <aside class="notes">
                        <ul>
                            <li>Reliability is initimately linked to culture, but that's a different talk</li>
                            <li>You can't "bolt on" reliability</li>
                            <li>Purpose of this talk: get people to think about reliability analytically</li>
                            <li>Who's using math daily?</li>
                        </ul>
                    </aside>
                </section>
                <section>
                    <h2>Math!?</h2>
                    <img src="images/run-away.gif" style="height: 55%; width: 55%;" alt="">
                    <aside class="notes">
                        <ul>
                            <li>People are scared of math, but they shouldn't be</li>
                            <li>It's not about the formulas or numbers, it's about models, formalizing and proving</li>
                            <li>Story of the professor - "it's trivial that..."</li>
                        </ul>
                    </aside>
                </section>
                <section>
                    <h2>Example: Nagios-like alerts</h2>
                    <div>
                        <p>Nagios service with <i>max_check_attempts=4</i>, <i>check_interval</i>=15sec</p>
                        <p>Service experiencing 40% error rate</p>
                    </div>
                    <div style="margin-top: 2em;">
                        <p>Chance of hard CRITICAL: <span style="color: red;">2.6%</span></p>
                        <p>Chance of NOT GETTING ANY ALERT:</p>
                        <table>
                            <tr><td>0.5 hour</td><td>45.9%</td></tr>
                            <tr>
                                <td>1 hour</td>
                                <td>21.1%</td>
                            </tr>
                            <tr>
                                <td>1.5 hours</td>
                                <td>9.9%</td>
                            </tr>
                        </table>
                    </div>
                    <aside class="notes">
                        <ul>
                            <li>Nagios was not designed for statistical failures (false negatives)</li>
                        </ul>
                    </aside>
                </section>
                <section>
                    <section>
                        <h2>Define "Reliable"</h2>
                        <div class="fragment">
                            <ul>
                                <li>"4 nines"</li>
                                <li>MTBF</li>
                                <li>Failures per Year</li>
                                <li>QoS</li>
                                <li>SLA</li>
                            </ul>
                            <aside class="notes">
                                <ul>
                                    <li>Lots of terms, all insufficient</li>
                                    <li>"uptime" means "no failure"</li>
                                    <li>Need to define what is failure</li>
                                </ul>
                            </aside>
                        </div>
                    </section>
                    <section>
                        <h2>Define "Failure"</h2>
                        <h3 class="fragment">System operating outside specified parameters</h3>
                        <h3 class="fragment">In reality: users are complaining!</h3>
                    </section>
                    <section>
                        <h2>"Failure" is subjective!</h2>
                        <aside class="notes"><p>We have to understand the business to define failure</p></aside>
                    </section>
                    <section>
                        <h2>Possible states</h2>
                        <ul class="fragment">
                            <li>Working OK</li>
                            <li>Failure</li>
                            <li class="fragment">Unknown</li>
                            <li class="fragment">Fuzzy</li>
                        </ul>

                        <aside class="notes">
                            <ul>
                                <li>Failure - operating outside parameters</li>
                                <li>Failure isn't allways obvious</li>
                                <li>which clock is "correct"? which value?</li>
                                <li>We don't always know what "correct" is</li>
                                <li>We don't always know what the system state is. e.g. our telemetry can be wrong</li>
                            </ul>
                        </aside>
                    </section>
                    <section data-transition="fade-out" data-autoslide="3000">
                        <h3>The absence of evidence <br>is not the evidence of absence</h3>
                    </section>
                    <section data-transition="fade-in">
                        <h3>The absence of alerts<br>is not the evidence of proper operation</h3>
                    </section>
                </section>
                <section>
                    <section>
                        <h2>Let's talk about failure</h2>
                        <img src="images/cat-failure.gif" alt="">
                    </section>
                    <section>
                        <h2>Reliability measures</h2>
                        <ul>
                            <li>MTBF = mean time between failures (years per failure)</li>
                            <li>&lambda; = failures per year</li>
                            <li>F = failure rate or probability of failure in one year</li>
                            <li>R = reliability rate (probability of working in one year)</li>
                        </ul>
                        <p>$$\lambda = T / MTFB$$</p>
                        <p>$$F = \lambda / T = 1 / MTBF$$</p>
                        <p>$$R = 1 - F$$</p>
                        <aside class="notes">
                            <p>typical hdd MTBF - 0.3-1M hours (about 35-120 years); MTBF computed in a lab and extrapolated over time. With enough disks you will see failures all the time</p>
                        </aside>
                    </section>
                    <section>
                        <h2>Statistical independence</h2>
                        <aside class="notes">
                            <ul>
                                <li>Dominant mode in hardware</li>
                                <li>Also applies to some software failures</li>
                            </ul>
                        </aside>
                    </section>
                    <section>
                        <h2>The Hot Hand fallacy</h2>
                        <div class="fragment">
                            <hr>
                            <h2>The Gambler's fallacy</h2>
                        </div>
                    </section>
                    <section>
                        <h2>Past performance does not predict future performance*</h2>
                        <aside class="notes">
                          <ul>
                            <li>This is what statistical independence means. future events are independent of past events</li>
                            <li>In general, we can only make statistical prediction over a large number of similar systems</li>
                            <li>Assumed reliability: not enought trust -> underusage; too much trust -> no preparation for failure</li>
                          </ul>
                        </aside>
                    </section>
                </section>
                <section>
                    <section>
                        <h2>Serial reliability</h2>
                        <h4>$$R_{total} = \prod_{i=0}^{n} R_{i}$$</h4>
                        <!-- <p>$$\lambda_{total} = \sum_{i=0}^{n} \lambda_{i}$$</p> -->
                    </section>
                    <section>
                        <h2>Serial reliability</h2>
                        <table>
                            <thead>
                                <th>R1</th><th>R2</th><th>R3</th><th>R system</th><th>Improvement (MTBF)</th>
                            </thead>
                            <tr>
                                <td>0.995</td><td>0.99</td><td>0.95</td><td>0.936</td><td>-</td>
                            </tr>
                            <tr class="fragment">
                                <td style="font-size: 120%;"><b>0.9995</b></td>
                                <td>0.99</td>
                                <td>0.95</td>
                                <td>0.94</td>
                                <td>X 1.07</td>
                            </tr>
                            <tr class="fragment">
                                <td>0.995</td>
                                <td style="font-size: 120%;"><b>0.999</b></td>
                                <td>0.95</td>
                                <td>0.944</td>
                                <td>X 1.15</td>
                            </tr>
                            <tr class="fragment">
                                <td>0.995</td>
                                <td>0.99</td>
                                <td style="font-size: 120%;"><b>0.995</b></td>
                                <td>0.98</td>
                                <td>X 3.21</td>
                            </tr>
                        </table>
                        <div class="fragment">
                            <p style="align: center;">$$R_{total} \lt min(R_{i})$$</p>
                            <p>Best ROI - improve the <b>worst</b> component</p>
                            <p>Improvement is expensive</p>
                        </div>
                        <aside class="notes">
                          <p>Total reliability always lower than worse component. No point using disproportionally reliable components</p>
                          <p>Cheaper way: clusters!</p>
                        </aside>
                    </section>
                    <section>
                        <h2>Parallel reliability (redundancy)</h2>
                        <p>Reliability of redundant system, up to $k$ failures</p>
                        <p>$$R_{total}(n, k) = \sum_{i=0}^{k} {n \choose i} F^{i} R^{n-i}$$</p>
                    </section>
                    <section data-markdown>
                        <script type="text/template">
                            ## Reliability improvement

                            Redundant system, 10 components, up-to 2 failures

                            |R|MTBF|Total R|Total MTBF|improvement factor|
                            |-|-|-|-|-|
                            |0.95|19|0.9885|86|4.5x|
                            |0.99|99|0.9999|8783|88.7x|

                            Higher ROI with more reliable components
                        </script>
                        <aside class="notes">
                          <p>Is an argument for using enterprise-grade hardware???</p>
                        </aside>
                    </section>
                    <section>
                        <p>Redundant system, R=0.95</p>
                          <p style="font-size: 80%;">n - cluster size; k - failures tolerated</p>
                        <table>
                            <thead>
                                <th>n</th>
                                <th>k</th>
                                <th>Overhead</th>
                                <th>R total</th>
                            </thead>
                            <tr class="fragment">
                                <td>10</td>
                                <td>1</td>
                                <td>10%</td>
                                <td><span style="color: red;">0.914</span></td>
                            </tr>
                            <tr class="fragment">
                                <td>10</td>
                                <td>2</td>
                                <td>20%</td>
                                <td>0.989</td>
                            </tr>
                            <tr class="fragment">
                                <td>100</td>
                                <td>5</td>
                                <td>5%</td>
                                <td><span style="color: red;">0.616</span></td>
                            </tr>
                            <tr class="fragment">
                                <td>100</td>
                                <td>9</td>
                                <td>9%</td>
                                <td>0.972</td>
                            </tr>
                            <tr class="fragment">
                                <td>100</td>
                                <td>11</td>
                                <td>11%</td>
                                <td>0.996</td>
                            </tr>
                        </table>
                        <div class="fragment">
                            <ul>
                                <li>Not enough redundancy will REDUCE your reliability</li>
                                <li>N+1 rule only true for small clusters</li>
                                <li>Large clusters more cost effective</li>
                            </ul>
                        </div>
                        <aside class="notes">
                          <p>Use many small/cheap identical components</p>
                        </aside>
                    </section>
                </section>
                <section>
                    <section>
                        <h2>Statistically dependent / <br/>Correlated failures</h2>
                        <ul>
                            <li>Shared workload</li>
                            <li>Shared code</li>
                            <li>Shared infrastracture</li>
                        </ul>
                        <aside class="notes">
                            <ul>
                                <li>Dominant failure mode in software</li>
                                <li>Still gain added reliability from redudancy, but not as much</li>
                                <li>How do you deal with it?</li>
                                <li>Segmentations, failure domains</li>
                                <li>Intentional variation</li>
                            </ul>
                        </aside>
                    </section>
                    <section>
                        <h3>Backup and operational sub-systems should avoid coupling with primaries</h3>
                        <aside class="notes">
                            <ul>
                                <li>Seems obvious but people get it wrong</li>
                                <li>Especially when it comes to monitoring and ops tools</li>
                            </ul>
                        </aside>
                    </section>
                    <section>
                        <ul>
                            <li>1 out of 1000 drivers is drunk</li>
                            <li>Breathalyzer detects all drunks but has 5% false positives</li>
                            <li>Drivers stopped at random</li>
                        </ul>
                        <h3 style="margin-top: 1em;">A driver was stopped and breathalyzer shows he's drunk<br/>What's the probability he's really drunk?</h3>
                    </section>
                    <section>
                        <p>If you answered 0.95, you have fallen for the</p>
                        <h2>The Base rate fallacy</h2>
                        <p>Correct answer: ~ 0.02</p>
                    </section>
                    <section>
                        <h3>Explanation</h3>
                        <p>In a 1000 drivers sample, 1 would be drunk and 49.95 (999 x 0.05) would falsely test as drunk</p>
                        <p>Base rate of being falsely detected as drunk (P(D)=50.95/1000) &gt;&gt; rate of drunk drivers (P(drunk)=1/1000)</p>
                        <p style="margin-top: 2em;">Bayes theorem: $P(drunk|D) = P(D|drunk) P(drunk)/P(D)$</p>
                        <p style="font-size: 80%">$P(D|drunk) = 1, P(drunk)=1/1000, P(D) = 50.95/1000$</p>
                    </section>
                    <section>
                        <h2>Active/Standby failover</h2>
                        <ul>
                            <li>Failed master always detected</li>
                            <li>2% probability of false positive (working master detected as failed)</li>
                            <li>~ 95% of failovers are erroneous</li>
                            <li>Erroneous failovers can cause severe issues</li>
                        </ul>
                        <h3 class="fragment" style="margin-top: 1em;">Disable auto failover, greatly reduce false positives or use active/active</h3>
                        <aside class="notes">
                            <ul>
                                <li>Database failover dillema, github 2012 outage</li>
                                <li>You may be tempted to say quorum decision can solve this, but..</li>
                                <li>Either reduce false positives drastically or reduce failover issues</li>
                            </ul>
                        </aside>
                    </section>
                </section>
                <section>
                    <section>
                        <h2>Multiple dependencies</h2>
                        <div data-svg-fragment="images/layers.svg#[*|label=Layer_1]">
                            <div class="fragment" title="[*|label=Layer_2]"></div>
                            <div class="fragment" title="[*|label=Layer_3]"></div>
                        </div>
                        <h3 class="fragment">Circuit breakers!!</h3>
                        <aside class="notes">
                          <p>microservices FTW</p>
                        </aside>
                    </section>
                    <section>
                        <h2>Queuing delay</h2>
                        <p>$delay \propto \frac {\rho} {1 - \rho}$</p>
                        <p style="font-size: 70%;">&rho; - system utilization</p>
                        <img src="images/queue-latency.svg" style="border: none; background: none; box-shadow: none;" height="50%" width="50%" alt="">
                        <h3 class="fragment">Throttle your system!</h3>
                        <aside class="notes">
                            <ul>
                                <li>if you go over ~ 80% utilization latency will start rising fast</li>
                            </ul>
                        </aside>
                    </section>
                    <section>
                        <div data-svg-fragment="images/queue.svg#[*|label=Layer_1]">
                            <div class="fragment" title="[*|label=Layer_2]"></div>
                            <div class="fragment" title="[*|label=Layer_3]"></div>
                        </div>
                        <h3 class="fragment">Backpressure</h3>
                        <aside class="notes">
                            <ul>
                                <li>backend server utilization too high</li>
                                <li>load will queue inside your system</li>
                                <li>limit internal queues and apply backpressure</li>
                            </ul>
                        </aside>
                    </section>
                    <section>
                        <h2>Little's Law</h2>
                        <p>$L = \lambda W$</p>
                        <p style="font-size: 70%">L - clients in the system, &lambda; - arrival rate, W - wait time (latency)</p>
                        <div data-svg-fragment="images/cluster-little-law.svg#[*|label=Layer_1]">
                            <div class="fragment" title="[*|label=Layer_2]"></div>
                        </div>
                        <p>$L_i = L_j \rightarrow \frac {\lambda_i} {\lambda_j} = \frac {W_j} {W_i}$</p>
                        <aside class="notes">
                            <ul>
                                <li>What happens when 1 process failes and returs errors with 1/100 latency?</li>
                                <li>How do you deal with this?</li>
                                <li>throttling according to "normal" throughput</li>
                            </ul>
                        </aside>
                    </section>
                    <section>
                        <h2>Feedback loops</h2>
                        <p>$\frac {df} {dt} = \alpha f \rightarrow f(t) = A e^{\alpha t}$</p>
                        <h3 class="fragment">Backoffs, cooldowns</h3>
                    </section>
                </section>
                <section>
                    <section data-background="images/final-words.svg">
                        <h3>Reliability is everyone's responsibility</h3>
                    </section>
                    <section>
                        <h2>Thank you</h2>
                        <img src="images/holy-grail-taunt.gif" alt="">
                    </section>
                </section>
                <section>
                    <section>
                        <h1>Complex Adaptive Systems</h1>
                    </section>
                    <section>
                        <h2>Phase changes</h2>
                        <img src="images/instant-freezing-water.gif" alt="">
                    </section>
                    <section>
                        <h2>Chain Reaction</h2>
                        <img src="images/chain-reaction-ping-pong.gif" alt="">
                    </section>
                    <section>
                        <h2>System memory</h2>
                    </section>
                    <section>
                        <h2>Transient -> permanent</h2>
                    </section>
                </section>
            </div>
        </div>

        <script src="https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.3.0/lib/js/head.min.js"></script>
        <script src="https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.3.0/js/reveal.min.js"></script>
        <script src="https://cdnjs.cloudflare.com/ajax/libs/d3/3.5.6/d3.min.js"></script>
        <script>

            // Full list of configuration options available here:
            // https://github.com/hakimel/reveal.js#configuration
            Reveal.initialize({
                controls: true,
                progress: true,
                history: true,
                center: true,

                transition: Reveal.getQueryHash().transition || 'slide', // none/fade/slide/convex/concave/zoom
                math: {
                    mathjax: 'https://cdn.mathjax.org/mathjax/latest/MathJax.js',
                    config: 'TeX-AMS_HTML-full'  // See http://docs.mathjax.org/en/latest/config-files.html
                },
                // Parallax scrolling
                // parallaxBackgroundImage: 'https://s3.amazonaws.com/hakim-static/reveal-js/reveal-parallax-1.jpg',
                // parallaxBackgroundSize: '2100px 900px',

                // Optional libraries used to extend on reveal.js
                dependencies: [
                    { src: 'https://cdn.rawgit.com/hakimel/reveal.js/3.3.0/js/classList.js', condition: function() { return !document.body.classList; } },
                    { src: 'https://cdn.rawgit.com/hakimel/reveal.js/3.3.0/plugin/markdown/marked.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
                    { src: 'https://cdn.rawgit.com/hakimel/reveal.js/3.3.0/plugin/markdown/markdown.js', condition: function() { return !!document.querySelector( '[data-markdown]' ); } },
                    { src: 'https://cdn.rawgit.com/hakimel/reveal.js/3.3.0/plugin/highlight/highlight.js', async: true, condition: function() { return !!document.querySelector( 'pre code' ); }, callback: function() { hljs.initHighlightingOnLoad(); } },
                    { src: 'https://cdn.rawgit.com/hakimel/reveal.js/3.3.0/plugin/zoom-js/zoom.js', async: true },
                    { src: 'https://cdn.rawgit.com/hakimel/reveal.js/3.3.0/plugin/notes/notes.js', async: true },
                    { src: 'https://cdn.rawgit.com/hakimel/reveal.js/3.3.0/plugin/math/math.js', async: true},
                    { src: 'js/reveal-svg-fragment.js', condition: function() { return !!document.querySelector( '[data-svg-fragment]' ); }}
                ]
            });

        </script>
    </body>
</html>