dltech-xyz
diff --git a/‎chapter_appendix-mathematics-for-deep-learning/distributions.ipynb
+1,871-1,836 b/‎chapter_appendix-mathematics-for-deep-learning/distributions.ipynb
+1,871-1,836
diff --git a/‎chapter_appendix-mathematics-for-deep-learning/eigendecomposition.ipynb
+144-130 b/‎chapter_appendix-mathematics-for-deep-learning/eigendecomposition.ipynb
+144-130
diff --git a/‎chapter_appendix-mathematics-for-deep-learning/geometry-linear-algebraic-ops.ipynb
+30,336-63 b/‎chapter_appendix-mathematics-for-deep-learning/geometry-linear-algebraic-ops.ipynb
+30,336-63
diff --git a/‎chapter_appendix-mathematics-for-deep-learning/index.ipynb
-60 b/‎chapter_appendix-mathematics-for-deep-learning/index.ipynb
-60
diff --git a/‎chapter_appendix-mathematics-for-deep-learning/information-theory.ipynb
+46-27 b/‎chapter_appendix-mathematics-for-deep-learning/information-theory.ipynb
+46-27
@@ -1,60 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "origin_pos": 0
-   },
-   "source": [
-    "# Appendix: Mathematics for Deep Learning\n",
-    ":label:`chap_appendix_math`\n",
-    "\n",
-    "**Brent Werness** (*Amazon*), **Rachel Hu** (*Amazon*), and authors of this book\n",
-    "\n",
-    "\n",
-    "One of the wonderful parts of modern deep learning is the fact that much of it can be understood and used without a full understanding of the mathematics below it.  This is a sign that the field is maturing.  Just as most software developers no longer need to worry about the theory of computable functions, neither should deep learning practitioners need to worry about the theoretical foundations of maximum likelihood learning.\n",
-    "\n",
-    "But, we are not quite there yet.\n",
-    "\n",
-    "In practice, you will sometimes need to understand how architectural choices influence gradient flow, or the implicit assumptions you make by training with a certain loss function.  You might need to know what in the world entropy measures, and how it can help you understand exactly what bits-per-character means in your model.  These all require deeper mathematical understanding.\n",
-    "\n",
-    "This appendix aims to provide you the mathematical background you need to understand the core theory of modern deep learning, but it is not exhaustive.  We will begin with examining linear algebra in greater depth.  We develop a geometric understanding of all the common linear algebraic objects and operations that will enable us to visualize the effects of various transformations on our data.  A key element is the development of the basics of eigen-decompositions.\n",
-    "\n",
-    "We next develop the theory of differential calculus to the point that we can fully understand why the gradient is the direction of steepest descent, and why back-propagation takes the form it does.  Integral calculus is then discussed to the degree needed to support our next topic, probability theory.\n",
-    "\n",
-    "Problems encountered in practice frequently are not certain, and thus we need a language to speak about uncertain things.  We review the theory of random variables and the most commonly encountered distributions so we may discuss models probabilistically.  This provides the foundation for the naive Bayes classifier, a probabilistic classification technique.\n",
-    "\n",
-    "Closely related to probability theory is the study of statistics.  While statistics is far too large a field to do justice in a short section, we will introduce fundamental concepts that all machine learning practitioners should be aware of, in particular: evaluating and comparing estimators, conducting hypothesis tests, and constructing confidence intervals.\n",
-    "\n",
-    "Last, we turn to the topic of information theory, which is the mathematical study of information storage and transmission.  This provides the core language by which we may discuss quantitatively how much information a model holds on a domain of discourse.\n",
-    "\n",
-    "Taken together, these form the core of the mathematical concepts needed to begin down the path towards a deep understanding of deep learning.\n",
-    "\n",
-    ":begin_tab:toc\n",
-    " - [geometry-linear-algebraic-ops](geometry-linear-algebraic-ops.ipynb)\n",
-    " - [eigendecomposition](eigendecomposition.ipynb)\n",
-    " - [single-variable-calculus](single-variable-calculus.ipynb)\n",
-    " - [multivariable-calculus](multivariable-calculus.ipynb)\n",
-    " - [integral-calculus](integral-calculus.ipynb)\n",
-    " - [random-variables](random-variables.ipynb)\n",
-    " - [maximum-likelihood](maximum-likelihood.ipynb)\n",
-    " - [distributions](distributions.ipynb)\n",
-    " - [naive-bayes](naive-bayes.ipynb)\n",
-    " - [statistics](statistics.ipynb)\n",
-    " - [information-theory](information-theory.ipynb)\n",
-    ":end_tab:\n"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3",
-   "name": "python3"
-  },
-  "language_info": {
-   "name": "python"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
@@ -55,7 +55,9 @@
    "execution_count": 1,
    "metadata": {
     "origin_pos": 2,
-    "tab": "pytorch"
+    "tab": [
+     "pytorch"
+    ]
    },
    "outputs": [
     {
@@ -80,7 +82,7 @@
     "def self_information(p):\n",
     "    return -torch.log2(torch.tensor(p)).item()\n",
     "\n",
-    "self_information(1/64)"
+    "self_information(1 / 64)"
    ]
   },
   {
@@ -125,7 +127,9 @@
    "execution_count": 2,
    "metadata": {
     "origin_pos": 5,
-    "tab": "pytorch"
+    "tab": [
+     "pytorch"
+    ]
    },
    "outputs": [
     {
@@ -142,7 +146,7 @@
    "source": [
     "def entropy(p):\n",
     "    entropy = - p * torch.log2(p)\n",
-    "    # nansum will sum up the non-nan number\n",
+    "    # Operator nansum will sum up the non-nan number\n",
     "    out = nansum(entropy)\n",
     "    return out\n",
     "\n",
@@ -216,7 +220,9 @@
    "execution_count": 3,
    "metadata": {
     "origin_pos": 8,
-    "tab": "pytorch"
+    "tab": [
+     "pytorch"
+    ]
    },
    "outputs": [
     {
@@ -281,7 +287,9 @@
    "execution_count": 4,
    "metadata": {
     "origin_pos": 11,
-    "tab": "pytorch"
+    "tab": [
+     "pytorch"
+    ]
    },
    "outputs": [
     {
@@ -349,7 +357,9 @@
    "execution_count": 5,
    "metadata": {
     "origin_pos": 14,
-    "tab": "pytorch"
+    "tab": [
+     "pytorch"
+    ]
    },
    "outputs": [
     {
@@ -367,13 +377,12 @@
     "def mutual_information(p_xy, p_x, p_y):\n",
     "    p = p_xy / (p_x * p_y)\n",
     "    mutual = p_xy * torch.log2(p)\n",
-    "    # nansum will sum up the non-nan number\n",
+    "    # Operator nansum will sum up the non-nan number\n",
     "    out = nansum(mutual)\n",
     "    return out\n",
     "\n",
     "mutual_information(torch.tensor([[0.1, 0.5], [0.1, 0.3]]),\n",
-    "                   torch.tensor([0.2, 0.8]),\n",
-    "                   torch.tensor([[0.75, 0.25]]))"
+    "                   torch.tensor([0.2, 0.8]), torch.tensor([[0.75, 0.25]]))"
    ]
   },
   {
@@ -431,7 +440,9 @@
    "execution_count": 6,
    "metadata": {
     "origin_pos": 17,
-    "tab": "pytorch"
+    "tab": [
+     "pytorch"
+    ]
    },
    "outputs": [],
    "source": [
@@ -474,7 +485,9 @@
    "execution_count": 7,
    "metadata": {
     "origin_pos": 20,
-    "tab": "pytorch"
+    "tab": [
+     "pytorch"
+    ]
    },
    "outputs": [],
    "source": [
@@ -504,7 +517,9 @@
    "execution_count": 8,
    "metadata": {
     "origin_pos": 23,
-    "tab": "pytorch"
+    "tab": [
+     "pytorch"
+    ]
    },
    "outputs": [
     {
@@ -540,7 +555,9 @@
    "execution_count": 9,
    "metadata": {
     "origin_pos": 26,
-    "tab": "pytorch"
+    "tab": [
+     "pytorch"
+    ]
    },
    "outputs": [
     {
@@ -606,7 +623,9 @@
    "execution_count": 10,
    "metadata": {
     "origin_pos": 29,
-    "tab": "pytorch"
+    "tab": [
+     "pytorch"
+    ]
    },
    "outputs": [],
    "source": [
@@ -629,7 +648,9 @@
    "execution_count": 11,
    "metadata": {
     "origin_pos": 32,
-    "tab": "pytorch"
+    "tab": [
+     "pytorch"
+    ]
    },
    "outputs": [
     {
@@ -716,7 +737,9 @@
    "execution_count": 12,
    "metadata": {
     "origin_pos": 35,
-    "tab": "pytorch"
+    "tab": [
+     "pytorch"
+    ]
    },
    "outputs": [
     {
@@ -731,8 +754,8 @@
     }
    ],
    "source": [
-    "# Implementation of CrossEntropy loss in pytorch \n",
-    "# combines nn.LogSoftmax() and nn.NLLLoss().\n",
+    "# Implementation of CrossEntropy loss in pytorch combines nn.LogSoftmax() and\n",
+    "# nn.NLLLoss()\n",
     "nll_loss = NLLLoss()\n",
     "loss = nll_loss(torch.log(preds), labels)\n",
     "loss"
@@ -755,17 +778,13 @@
     "## Exercises\n",
     "\n",
     "1. Verify that the card examples from the first section indeed have the claimed entropy.\n",
-    "2. Let us compute the entropy from a few data sources:\n",
+    "1. Show that the KL divergence $D(p\\|q)$ is nonnegative for all distributions $p$ and $q$. Hint: use Jensen's inequality, i.e., use the fact that $-\\log x$ is a convex function.\n",
+    "1. Let us compute the entropy from a few data sources:\n",
     "    * Assume that you are watching the output generated by a monkey at a typewriter. The monkey presses any of the $44$ keys of the typewriter at random (you can assume that it has not discovered any special keys or the shift key yet). How many bits of randomness per character do you observe?\n",
     "    * Being unhappy with the monkey, you replaced it by a drunk typesetter. It is able to generate words, albeit not coherently. Instead, it picks a random word out of a vocabulary of $2,000$ words. Moreover, assume that the average length of a word is $4.5$ letters in English. How many bits of randomness do you observe now?\n",
     "    * Still being unhappy with the result, you replace the typesetter by a high quality language model. These can currently obtain perplexity numbers as low as $15$ points per character. The perplexity is defined as a length normalized probability, i.e., $$PPL(x) = \\left[p(x)\\right]^{1 / \\text{length(x)} }.$$ How many bits of randomness do you observe now?\n",
-    "3. Explain intuitively why $I(X, Y) = H(X) - H(X|Y)$.  Then, show this is true by expressing both sides as an expectation with respect to the joint distribution.\n",
-    "4. What is the KL Divergence between the two Gaussian distributions $\\mathcal{N}(\\mu_1, \\sigma_1^2)$ and $\\mathcal{N}(\\mu_2, \\sigma_2^2)$?\n",
-    "\n",
-    "\n",
-    "## [Discussions](https://discuss.mxnet.io/t/5157)\n",
-    "\n",
-    "![](../img/qr_information-theory.svg)\n"
+    "1. Explain intuitively why $I(X, Y) = H(X) - H(X|Y)$.  Then, show this is true by expressing both sides as an expectation with respect to the joint distribution.\n",
+    "1. What is the KL Divergence between the two Gaussian distributions $\\mathcal{N}(\\mu_1, \\sigma_1^2)$ and $\\mathcal{N}(\\mu_2, \\sigma_2^2)$?\n"
    ]
   }
  ],