Skip to content

Files

103 lines (89 loc) · 3.81 KB

lecture13-backpropagation.md

File metadata and controls

103 lines (89 loc) · 3.81 KB

Lecture 13 Neural Network + Backpropagation

Activation Functions

  • Sigmoid / Logistic Function
    • 1 1 + exp ( α ) )
  • Tanh
    • Like logistic function but shifted to range [ 1 , + 1 ]
  • reLU often used in vision tasks
    • rectified linear unit
    • Linear with cutoff at zero
    • m a x ( 0 , w x + b )
    • Soft version: log ( exp ( x ) + 1 )

Objective Function

  • Quadratic Loss
    • the same objective as Linear Regression
    • i.e. MSE
  • Cross Entropy
    • the same objective as Logistic Regression
    • i.e. negative log likelihood
    • this requires probabilities, so we add an additional "softmax" layer at the end of our network
    • steeper
Forward Backward
Quadratic J = 1 / 2 ( y y ) 2 d J d y = y y
Cross Entropy $J = y^\log{(y)} + (1-y^)\log{(1-y)}$ $\frac{dJ}{dy} = \frac{y^}{y} + \frac{(1-y^)}{y-1}$

Multi-class Output

  • Softmax: y k = exp ( b k ) l = 1 K exp ( b l )
  • Loss: J = k = 1 K y k log ( y k )

Chain Rule

  • Def #1 Chain Rule
    • y = f ( u )
    • u = g ( x )
    • d y d x = d y d u · d u d x
  • Def #2 Chain Rule
    • y = f ( u 1 , u 2 )
    • u 2 = g 2 ( x )
    • u 1 = g 1 ( x )
    • d y d x = d y d u 1 · d u 1 d x + d y d u 2 · d u 2 d x
  • Def #3 Chain Rule
    • y = f ( u )
    • u = g ( x )
    • d y d x = j = 1 J d y i d u j · d u j d x k , i , k
    • Backpropagation is just repeated application of the chain rule
  • Computation Graphs
    • not a Neural Network diagram

Backpropagation

  • Backprop Ex #1
    • y = f ( x , z ) = exp ( x z ) + x z log ( x ) + sin ( log ( x ) ) x z
    • Forward Computation
      • Given x = 2 , z = 3
      • a = x z , b = l o g ( x ) , c = s i n ( b ) , d = e x p ( a ) , e = a / b , f = c / a
      • y = d + e + f
    • Backgward Computation
      • g y = d y / d y = 1
      • g f = d y / d f = 1 , d e = d y / d c = 1 , g d = d y / g d = 1
      • g c = d y / d c = d y / d f · d f / d c = g f ( 1 / a )
      • g b = d y / d b = d y / d e · d e / d b + d y / d c · d c / d b = ( g e ) ( a / b 2 ) + ( g c ) ( c o s ( b ) )
      • g a = d y / d a = d y / d c · d e / d a + d y / d d · d d / d a + d y / d f · d f / d a = ( g e ) ( 1 / b ) + ( g d ) ( e x p ( a ) ) + ( g f ) ( c / a 2 )
      • g x = ( g a ) ( z ) + ( g b ) ( 1 / x )
      • g z = ( g a ) ( x )
    • Updates for Backprop
      • g x = d y d x = k = 1 K d y d u k · d u k x = k = 1 K ( g u k ) ( d u k d x )
      • Reuse forward computation in backward computation
      • Reuse backward computation within itself

Neural Network Training

  • Consider a 2-hidden layer neural nets
  • parameters are θ = [ α ( 1 ) , α ( 2 ) , β ]
  • SGD training
    • Iterate until convergence:
      • Sample i 1 , , N
      • Compute gradient by backprop
        • g α ( 1 ) = α ( 1 ) J ( i ) ( θ )
        • g α ( 2 ) = α ( 2 ) J ( i ) ( θ )
        • g β = β J ( i ) ( θ )
        • J ( i ) ( θ ) = ( h θ ( x ( i ) ) , y ( i ) )
      • Step opposite the gradient
        • α ( 1 ) α ( 1 ) γ g α ( 1 )
        • α ( 2 ) α ( 2 ) γ g α ( 2 )
        • β β γ g β
  • Backprop Ex #2: for neural network
    • Given: decision function y ^ = h θ ( x ) = σ ( ( α ( 3 ) ) T ) · σ ( ( α ( 2 ) ) T · σ ( ( α ( 1 ) ) T · x ) )
    • loss function $J = \ell(\hat{y},y^) = y^\log(\hat{y}) + (1-y^*)\log(1-\hat{y})$
    • Forward
      • Given x , α ( 1 ) , α ( 2 ) , α ( 3 ) , y
      • z ( 0 ) = x
      • for i = 1 , 2 , 3
      • u ( i ) = ( α ( 1 ) ) T · z ( i 1 )
      • z ( i ) = σ ( u ( i ) )
    • y ^ = z ( 3 )
    • J = ( y ^ , y )