CS489/698: Intro to ML Lecture 03: Multi-layer Perceptron
Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 2
Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 3
The XOR problem X = { (0,0), (0,), (,0), (,) }, y = {-,,, -} f * (x) = sign[ f xor - ½ ], f xor (x) = (x & -x 2 ) (-x & x 2 ) 4
No separating hyperplane x = (0,0), y = - b < 0 x 2 = (0,), y 2 = w 2 + b > 0 x 3 = (,0), y 3 = w + b > 0 x 4 = (,), y 4 = - w + w 2 + b < 0 Contradiction! (w + w 2 + b) + b > 0 Ex what happens if we run perceptron/winnow on this example? [At least one of the blue or green inequalities has to be strict] 5
Fixing the problem Our model (hyperplanes) underfits the data (xor func) Fix representation, richer model Fix model, richer representation NN: still use hyperplane, but learn representation simultaneously 6
Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 7
Two-layer perceptron input layer x x 2 u u 2 u 2 u 22 c c 2 weights z z 2 hidden layer h h Nonlinear learned activation rep function 2 w w 2 b weights ŷ output layer st linear layer makes all the difference! nonlinear transform 8 2 nd linear layer
Does it work? Rectified Linear Unit (ReLU) x = (0,0), y = - z = (0,-), h = (0,0) ŷ = - x 2 = (0,), y 2 = z 2 = (,0), h 2 = (,0) ŷ 2 = x 3 = (,0), y 3 = z 3 = (,0), h 3 = (,0) ŷ 3 = x 4 = (,), y 4 = - z 4 = (2,), h 4 = (2,) ŷ 4 = - 9
Multi-layer perceptron x z h ŷ input layer: R d x 2 x z 2 z k h 2 h k ŷ 2 ŷ m output layer: R m 0 d hidden layer weights: fully connected R k weights: fully connected
Multi-layer perceptron (stacked) depth width x x 2 x d z z 2 p h ŷx z h ŷ x z h ŷ x z h ŷ x z 2 2 2 z h ŷx k k md 2 2 2 z h ŷ k k m 2 x d h h 2 2 2 z h ŷ k k m ŷ ŷ learned hierarchical nonlinear feature representation linear predictor feed-forward
Activation function Sigmoid saturate Tanh smooth Rectified Linear nonsmooth 2
Underfitting vs Overfitting Linear predictor (perceptron / winnow / linear regression) underfits NNs learn hierarchical nonlinear feature jointly with linear predictor may overfit tons of heuristics (some later) Size of NN: # of weights/connections ~ billion; human brain 0 6 billions (estimated) 3
Weights training x in R d x x 2 x d z h ŷx z h ŷ x 2 2 2 z h ŷx k k md z h ŷ z h ŷ 2 2 2 z h ŷ k k m x x 2 x d z h ŷ z h ŷ 2 2 2 z h ŷ k k m ŷ in R m ŷ = q(x; Θ) Need a loss to measure diff between pred ŷ and truth y Eg, (ŷ-y) 2 ; more later Need a training set {(x,y ),, (x n,y n )} to train weights Θ 4
Gradient Descent (Generalized) gradient O(n)! Step size (learning rate) const, if L is smooth diminishing, otherwise 5
Stochastic Gradient Descent (SGD) a random sample suffices average over n samples diminishing step size, eg, /sqrt{t} or /t averaging, momentum, variance-reduction, etc sample w/o replacement; cycle; permute in each pass 6
A little history on optimization Gradient descent mentioned first in (Cauchy, 847) First rigorous convergence proof (Curry, 944) SGD proposed and analyzed (Robbins & Monro, 95) 7
Herbert Robbins (95 200) 8
Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 9
Backpropogation A fancy name for the chain rule for derivatives: f(x) = g[ h(x) ] f (x) = g [ h(x) ] * h (x) Efficiently computes the derivative in NN Two passes; complexity = O(size(NN)) forward pass: compute function value sequentially backward pass: compute derivative sequentially 20
Algorithm Forward pass f: activation function J = L + λω: training obj Backward pass 2
History (perfect for a course project!) 22
Outline Failure of Perceptron Neural Network Backpropagation Universal Approximator 23
Rationals are dense in R Any real number can be approximated by some rational number arbitrarily well Or in fancy mathematical language domain of interest subset in pocket metric for approx 24
Kolmogorov-Arnold Theorem Theorem (Kolmogorov, Arnold, Lorentz, Sprecher, ) Any continuous function g: [0,] d R can be written (exactly!) as: Binary addition is the only multivariate function! Solves Hilbert s 3 th problem! can be constructed from g, which is unknown 25
Andrey Kolmogorov (903-987) Foundations of the theory of probability "Every mathematician believes that he is ahead of the others The reason none state this belief in public is because they are intelligent people" 26
Universal Approximator Theorem (Cybenko, Hornik et al, Leshno et al, ) Any continuous function g: [0,] d R can be uniformly approximated in arbitrary precision by a two-layer NN with an activation function f that is locally bounded 2 has negligible closure of discontinuous points 3 is not a polynomial conditions are necessary in some sense includes (almost) all activation functions in practice 27
Caveat and Remedy NNs were praised for being universal but shall see later that many kernels are universal as well desirable but perhaps not THE explanation May need exponentially many hidden units Increase depth may reduce network size, exponentially! can be a course project, ask for references 28
Questions? 29