|
3 | 3 | ## Activation Functions
|
4 | 4 |
|
5 | 5 | * Sigmoid / Logistic Function
|
6 |
| - * 1 / (1 + exp(-α)) |
| 6 | + * $\frac{1}{1 + \exp{(-\alpha)})}$ |
7 | 7 | * Tanh
|
8 |
| - * Like logistic function but shifted to range [-1, +1] |
| 8 | + * Like logistic function but shifted to range $[-1, +1]$ |
9 | 9 | * reLU often used in vision tasks
|
10 | 10 | * rectified linear unit
|
11 | 11 | * Linear with cutoff at zero
|
12 |
| - * max(0, wx+b) |
13 |
| - * Soft version: log(exp(x) + 1) |
| 12 | + * $max(0, wx+b)$ |
| 13 | + * Soft version: $\log{(\exp{(x)} + 1)}$ |
14 | 14 |
|
15 | 15 | ## Objective Function
|
16 | 16 |
|
|
23 | 23 | * this requires probabilities, so we add an additional "softmax" layer at the end of our network
|
24 | 24 | * steeper
|
25 | 25 |
|
26 |
| -| | Forward | Backward | |
27 |
| -| ------------- | ------------------------------- | ------------------------------- | |
28 |
| -| Quadratic | J = 1/2 (y - y\*)^2 | dJ/dy = y - y\* | |
29 |
| -| Cross Entropy | J = y\*log(y) + (1-y\*)log(1-y) | dJ/dy = y\*1/y + (1-y\*)1/(y-1) | |
| 26 | +| | Forward | Backward | |
| 27 | +| ------------- | --------------------------------------- | ----------------------------------------------------- | |
| 28 | +| Quadratic | $J = 1/2 (y - y^*)^2$ | $\frac{dJ}{dy} = y - y^*$ | |
| 29 | +| Cross Entropy | $J = y^*\log{(y)} + (1-y^*)\log{(1-y)}$ | $\frac{dJ}{dy} = \frac{y^*}{y} + \frac{(1-y^*)}{y-1}$ | |
30 | 30 |
|
31 | 31 | ## Multi-class Output
|
32 | 32 |
|
33 |
| -* Softmax: y_k = exp(b_k) / Σ_{l=1}^{K} exp(b_l) |
| 33 | +* Softmax: $y_k = \frac{\exp{(b_k)}}{\sum_{l=1}^{K} \exp{(b_l)}}$ |
| 34 | +* Loss: $J = \sum_{k=1}^K y_k^* \log{(y_k)}$ |
34 | 35 |
|
35 | 36 | ## Chain Rule
|
36 | 37 |
|
37 |
| -* Def #1 |
38 |
| - * y = f(u) |
39 |
| - * u = g(x) |
40 |
| - * dy/dx = dy/du·du/dx |
41 |
| -* Def #2 |
42 |
| - * y = f(u_1,u_2) |
43 |
| - * u2 = g2(x) |
44 |
| - * u1 = g1(x) |
45 |
| - * dy/dx = dy/du1·du1/dx + dy/du2·du2/dx |
46 |
| -* **Def #3 Chain Rule** |
47 |
| - * y = f(**u**) |
48 |
| - * **u** = g(x) |
49 |
| - * dy/dx = Σ^K_{k=1}dy/duk·duk/dx |
50 |
| - * Holds for any intermediate quantities |
| 38 | +* Def #1 Chain Rule |
| 39 | + * $y = f(u)$ |
| 40 | + * $u = g(x)$ |
| 41 | + * $\frac{dy}{dx} = \frac{dy}{du}·\frac{du}{dx}$ |
| 42 | +* Def #2 Chain Rule |
| 43 | + * $y = f(u_1,u_2)$ |
| 44 | + * $u_2 = g_2(x)$ |
| 45 | + * $u_1 = g_1(x)$ |
| 46 | + * $\frac{dy}{dx} = \frac{dy}{du_1}·\frac{du_1}{dx} + \frac{dy}{du_2}·\frac{du_2}{dx}$ |
| 47 | +* Def #3 Chain Rule |
| 48 | + * $y = f(u)$ |
| 49 | + * $u = g(x)$ |
| 50 | + * $\frac{dy}{dx} = \sum_{j=1}^J \frac{dy_i}{du_j}·\frac{du_j}{dx_k}, \forall i,k$ |
| 51 | + * Backpropagation is just repeated application of the chain rule |
51 | 52 | * Computation Graphs
|
52 | 53 | * not a Neural Network diagram
|
53 | 54 |
|
54 | 55 | ## Backpropagation
|
55 | 56 |
|
56 | 57 | * Backprop Ex #1
|
57 |
| - * y = f(x,z) = exp(xz) + xz/log(x) + sin(log(x))/xz |
| 58 | + * $y = f(x,z) = \exp(xz) + \frac{xz}{\log(x)} + \frac{\sin(\log(x))}{xz}$ |
58 | 59 | * Forward Computation
|
59 |
| - * Given x = 2, z = 3 |
60 |
| - * a = xz, b = log(x), c = sin(b), d = exp(a), e = a / b, f = c / a |
61 |
| - * y = d + e + f |
| 60 | + * Given $x = 2, z = 3$ |
| 61 | + * $a = xz, b = log(x), c = sin(b), d = exp(a), e = a / b, f = c / a$ |
| 62 | + * $y = d + e + f$ |
62 | 63 | * Backgward Computation
|
63 |
| - * gy = dy/dy = 1 |
64 |
| - * gf = dy/df = 1, de = dy/dc = 1, gd = dy/gd = 1 |
65 |
| - * gc = dy/dc = dy/df·df/dc = gf(1/a) |
66 |
| - * gb = dy/db = dy/de·de/db + dy/dc·dc/db = (ge)(-a/b^2) + (gc)(cos(b)) |
67 |
| - * ga = dy/da = dy/dc·de/da + dy/dd·dd/da + dy/df·df/da = (ge)(1/b) + (gd)(exp(a)) + (gf)(-c/a^2) |
68 |
| - * gx = (ga)(z) + (gb)(1/x) |
69 |
| - * Gz = (ga)(x) |
| 64 | + * $gy = dy/dy = 1$ |
| 65 | + * $gf = dy/df = 1, de = dy/dc = 1, gd = dy/gd = 1$ |
| 66 | + * $gc = dy/dc = dy/df·df/dc = gf(1/a)$ |
| 67 | + * $gb = dy/db = dy/de·de/db + dy/dc·dc/db = (ge)(-a/b^2) + (gc)(cos(b))$ |
| 68 | + * $ga = dy/da = dy/dc·de/da + dy/dd·dd/da + dy/df·df/da = (ge)(1/b) + (gd)(exp(a)) + (gf)(-c/a^2)$ |
| 69 | + * $gx = (ga)(z) + (gb)(1/x)$ |
| 70 | + * $g_z = (ga)(x)$ |
70 | 71 | * Updates for Backprop
|
71 |
| - * gx = dy/dx = Σ^K\_{k=1}dy/duk·duk/x = Σ^K_{k=1}(guk)(duk/dx) |
| 72 | + * $gx = \frac{dy}{dx} = \sum_{k=1}^K \frac{dy}{du_k}·\frac{du_k}{x} = \sum_{k=1}^K (gu_k)(\frac{du_k}{dx})$ |
72 | 73 | * Reuse forward computation in backward computation
|
73 | 74 | * Reuse backward computation within itself
|
74 | 75 |
|
75 | 76 | ## Neural Network Training
|
76 | 77 |
|
77 | 78 | * Consider a 2-hidden layer neural nets
|
78 |
| -* parameters are θ = [α^(1), α^(2), β] |
| 79 | +* parameters are $\theta = [\alpha^{(1)}, \alpha^{(2)}, \beta]$ |
79 | 80 | * SGD training
|
80 | 81 | * Iterate until convergence:
|
81 |
| - * Sample i ∈ {1, ..., N} |
| 82 | + * Sample $i \in {1, \cdots, N}$ |
82 | 83 | * Compute gradient by backprop
|
83 |
| - * gα^(1) = ▽ α^(1)J^(i)(θ) |
84 |
| - * gα^(2) = ▽ α^(2)J^(i)(θ) |
85 |
| - * gβ = ▽β J^(i)(θ) |
86 |
| - * J^(i)(θ) = l(hθ(x^(i)), y^(i)) |
| 84 | + * $g\alpha^{(1)} = \nabla \alpha^{(1)}J^{(i)}(\theta)$ |
| 85 | + * $g\alpha^{(2)} = \nabla \alpha^{(2)}J^{(i)}(\theta)$ |
| 86 | + * $g\beta = \nabla \beta J^{(i)}(\theta)$ |
| 87 | + * $J^{(i)}(\theta) = \ell(h_\theta(x^{(i)}), y^{(i)})$ |
87 | 88 | * Step opposite the gradient
|
88 |
| - * α^(1) <- α^(1) - γgα^(1) |
89 |
| - * α^(2) <- α^(2) - γgα^(2) |
90 |
| - * β <- β - γgβ |
| 89 | + * $\alpha^{(1)} \leftarrow \alpha^{(1)} - \gamma g\alpha^{(1)}$ |
| 90 | + * $\alpha^{(2)} \leftarrow \alpha^{(2)} - \gamma g\alpha^{(2)}$ |
| 91 | + * $\beta \leftarrow \beta - \gamma g\beta$ |
91 | 92 | * Backprop Ex #2: for neural network
|
92 |
| - * Given: decision function y^ = hθ(x) = σ((α^(3))^T)·σ((α^(2))^T·σ((α^(1))^T·x)) |
93 |
| - * loss function J = l(y^,y\*) = y\*log(y^) + (1-y*)log(1-y^) |
| 93 | + * Given: decision function $\hat{y} = hθ(x) = \sigma((\alpha^{(3)})^T)·\sigma((\alpha^{(2)})^T·\sigma((\alpha^{(1)})^T·x))$ |
| 94 | + * loss function $J = \ell(\hat{y},y^*) = y^*\log(\hat{y}) + (1-y^*)\log(1-\hat{y})$ |
94 | 95 | * Forward
|
95 |
| - * Given x, α^(1), α^(2), α^(3), y* |
96 |
| - * z^(0) = x |
97 |
| - * for i = 1, 2, 3 |
98 |
| - * u^(i) = (α^(1))^T·z^(i-1) |
99 |
| - * z^(i) = σ(u^(i)) |
100 |
| - * y^ = z^(3) |
101 |
| - * J = l(y^, y*) |
| 96 | + * Given $x, \alpha^{(1)}, \alpha^{(2)}, \alpha^{(3)}, y^*$ |
| 97 | + * $z^{(0)} = x$ |
| 98 | + * for $i = 1, 2, 3$ |
| 99 | + * $u^{(i)} = (\alpha^{(1)})^T·z^{(i-1)}$ |
| 100 | + * $z^{(i)} = \sigma(u^{(i)})$ |
| 101 | + * $\hat{y} = z^{(3)}$ |
| 102 | + * $J = \ell(\hat{y}, y^*)$ |
102 | 103 |
|
0 commit comments