Mul7layer Perceptrons - PDF Free Download

Lecture Slides for INTRODUCTION TO Machine Learning 2nd Edi7on CHAPTER 11: Mul7layer Perceptrons ETHEM ALPAYDIN The MIT Press, 2010 Edited and expanded for CS 4641 by Chris Simpkins alpaydin@boun.edu.tr h1p://www.cmpe.boun.edu.tr/~ethem/i2ml2e

Overview Neural networks, brains, and computers Perceptrons Training Classification and regression Linear separability Multilayer perceptrons Universal approximation Backpropagation 2

Neural Networks Networks of processing units (neurons) with connec7ons (synapses) between them Large number of neurons: 10 10 Large connec7vity: 10 5 Parallel processing Distributed computa7on/memory Robust to noise, failures 3

Understanding the Brain Levels of analysis (Marr, 1982) 1. ComputaOonal theory 2. RepresentaOon and algorithm 3. Hardware implementaoon Reverse engineering: From hardware to theory Parallel processing: SIMD vs MIMD Neural net: SIMD with modifiable local memory Learning: Update by training/experience 4

Perceptron (RosenblaS, 1962) 5

What a Perceptron Does Regression: y=wx+w 0 Classifica7on: y=1(wx+w 0 >0) y y s y w 0 x 0 =+1 w x x w 0 w x w 0 Linear fit Linear discrimination 6

Regression: K Outputs ClassificaOon: 7

Training Online (instances seen one by one) vs batch (whole sample) learning: No need to store the whole sample Problem may change in Ome Wear and degradaoon in system components Stochas7c gradient- descent: Update ayer a single pazern Generic update rule (LMS rule): 8

Training a Perceptron: Regression Regression (Linear output): to Machine Learning The MIT Press (V1.1) 9

ClassificaOon Single sigmoid output K>2 so4max outputs Same as for linear discriminants from chapter 10 except we update after each instance 10

Learning Boolean AND 11

XOR No w 0, w 1, w 2 sa7sfy: (Minsky and Papert, 1969) to Machine Learning The MIT Press (V1.1) 12

MulOlayer Perceptrons (Rumelhart et al., 1986) 13

MLP as Universal Approximator x 1 XOR x 2 = (x 1 AND ~x Lecture Notes 2 ) OR (~x for E Alpaydın 1 AND x 2004 Introduction 2 ) 14

BackpropagaOon 15

Regression Backward Forward x 16

Regression with MulOple Outputs y i v ih w hj z h x j 17

w h x+w 0 z h v h z h 20

Two- Class DiscriminaOon One sigmoid output y t for P(C 1 x t ) and P(C 2 x t ) 1- y t 21

K>2 Classes 22

MulOple Hidden Layers MLP with one hidden layer is a universal approximator (Hornik et al., 1989), but using mul7ple layers may lead to simpler networks 23

Improving Convergence Momentum Adap7ve learning rate 24

Overfibng/Overtraining Number of weights: H (d+1)+(h+1)k 25

Conclusion Perceptrons handle linearly separable problems Multilayer perceptrons handle any problem Logistic discrimination functions enable gradient descent- based packpropagation Solves the structural credit assignment problem Susceptible to local optima Susceptible to overfitting 26

Structured MLP (Le Cun et al, 1989) 28

Weight Sharing 29

Hints (Abu- Mostafa, 1995) Invariance to translaoon, rotaoon, size Virtual examples Augmented error: E =E+λ h E h If x and x are the same : E h =[g(x θ)- g(x θ)] 2 ApproximaOon hint: 30

Tuning the Network Size DestrucOve Weight decay: ConstrucOve Growing networks (Ash, 1989) (Fahlman and Lebiere, 1989) 31

Bayesian Learning Consider weights w i as random vars, prior p(w i ) Weight decay, ridge regression, regularizaoon cost=data- misfit + λ complexity More about Bayesian methods in chapter 14 32

Dimensionality ReducOon 33

Learning Time Applica7ons: Sequence recognioon: Speech recognioon Sequence reproducoon: Time- series predicoon Sequence associaoon Network architectures Time- delay networks (Waibel et al., 1989) Recurrent networks (Rumelhart et al., 1986) 35

Time- Delay Neural Networks 36

Recurrent Networks 37

Unfolding in Time 38