|
| 1 | +## Lecture 12 Neural Network |
| 2 | + |
| 3 | +## Background |
| 4 | + |
| 5 | +* Neural Network Model |
| 6 | + * Independent variables |
| 7 | + * weights |
| 8 | + * Hidden Layer |
| 9 | + * Weights |
| 10 | + * Dependent variable (Prediction) |
| 11 | +* Artificial Model |
| 12 | + * Neuron: node in a directed acyclic graph (DAG) |
| 13 | + * Weight: multiplier on each edge |
| 14 | + * Activation Function: nonlinear thresholding function, which allows a neuron to "fire" when the input value is sufficiently high |
| 15 | + * Artificial Neural Network: collection of neurons into a DAG, which define some differentiable function |
| 16 | + |
| 17 | +## Example #1: Neural Network with 1 Hidden Layer and 2 Hidden Units |
| 18 | + |
| 19 | +* Let σ be the activation function |
| 20 | +* If σ is sigmoid: σ(α) = 1 / (1 + exp(-α)) |
| 21 | +* xi ∈ R |
| 22 | +* zi ∈ (0, 1) if σ is sigmoid |
| 23 | +* zi ∈ R more generally |
| 24 | +* z1 = σ(α11x1 + α12x2 + α10) |
| 25 | +* z2 = σ(α21x1 + α22x2 + α20) |
| 26 | +* y = σ(β1z1 + β2z2 + β0) = σ(β1 σ(α11x1 + α12x2 + α10) + β2 σ(α21x1 + α22x2 + α20) + β0) |
| 27 | +* (Each is a logistic regression model function) |
| 28 | +* (Don't forget the intercept terms) |
| 29 | +* y => Pr[Y=1|x1α1β1] => predict using Bayes Optimal Classifier y^ = h_αβ(x) = 1 if y > 0.5; 0 otherwise |
| 30 | + |
| 31 | +## Example #2: 1D Face Recognition |
| 32 | + |
| 33 | +* D = {(1+μ, 0), (3+μ, 1)} |
| 34 | +* Is D for classification or regression? Both! |
| 35 | +* Which line is learned by linear regression on data set? Z_B(x) |
| 36 | + * Z_A(x) = wAx+bA |
| 37 | + * Z_B(x) = wBx+bB |
| 38 | + * Z_C(x) = wCx+bC |
| 39 | +* Which sigmoid is learned by logistic regression? |
| 40 | + * h_A(x) = σ(Z_A(x)) |
| 41 | + * h_B(x) = σ(Z_B(x)) |
| 42 | + * h_C(x) = σ(Z_C(x)) |
| 43 | +* What happens if increasing intercept b? |
| 44 | + * to z(x)? Shift up OR shift left |
| 45 | + * to h(x)? Shift left |
| 46 | + * Shift left |
| 47 | +* Which changes in h_A(x) if increasing wA? steeper sigmoid |
| 48 | +* What is the decision boundary for h_C(x)? the point x = 2 |
| 49 | +* What is h_E(x) = σ((h_C(x) + h_D(x))/2) |
| 50 | + * not σ((Z_C(x) + Z_D(x))/2) |
| 51 | + * h_E is the first neural network |
| 52 | + * decision boundary is a nonlinear function of x |
| 53 | + |
| 54 | +## Neural Network Parameters |
| 55 | + |
| 56 | +* nonconvex |
| 57 | +* no unique set of parameters |
| 58 | + |
| 59 | +## Architectures |
| 60 | + |
| 61 | +* Number of hidden layers (depth) |
| 62 | +* Number of units per hidden layer (width) |
| 63 | +* Type of activation function (nonlinearity) |
| 64 | +* Form of objective function |
| 65 | +* How to initialize parameters |
| 66 | + |
| 67 | +## Example #3: Arbitrart Feedward Neural Network (Matrix Form) |
| 68 | + |
| 69 | +* Parameters |
| 70 | + * x1 ... xm |
| 71 | + * d1 ... d2 |
| 72 | + * α ∈ R^(M×D1) |
| 73 | + * β ∈ R^(D1) |
| 74 | +* Computation |
| 75 | + * z^(1) = σ((α^(1))^T+b^(1)) |
| 76 | + * σ applied elementwise to the vector ((α^(1)^T)x+b^(1)) |
| 77 | + * z^(2) = σ(()(α^(2))^T)z^(1)+b^(2)) |
| 78 | + * y = σ(β^T z^(2) + β0) |
| 79 | +* Fold in the intercept terms? |
| 80 | + * Assume x1 = 1 z1^(1) = 1 z1^(2) = 1 |
| 81 | + * drop β0, b^(1), b^(2) |
| 82 | + * Caution: tricky to implement |
| 83 | + |
| 84 | +## Building a Neural Net |
| 85 | + |
| 86 | +* D = M |
| 87 | +* D < M |
| 88 | +* D > M => Feature Engineering |
| 89 | +* Theoretical answer: |
| 90 | + * A neural network with 1 hidden layer is a universal function approximator |
| 91 | + * For any continuous function g(x), there exists a 1-hidden-layer neural net hθ(x) such that |hθ(x) - g(x| < ε for all x, assuming sigmoid activation |
| 92 | +* Empirical answer: |
| 93 | + * After 2006, deep networks are easier to train than shallow networks for many problems |
0 commit comments