Mul7layer Perceptrons

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Mul7layer Perceptrons"

Transcription

1 Lecture Slides for INTRODUCTION TO Machine Learning 2nd Edi7on CHAPTER 11: Mul7layer Perceptrons ETHEM ALPAYDIN The MIT Press, 2010 Edited and expanded for CS 4641 by Chris Simpkins h1p://

2 Overview Neural networks, brains, and computers Perceptrons Training Classification and regression Linear separability Multilayer perceptrons Universal approximation Backpropagation 2

3 Neural Networks Networks of processing units (neurons) with connec7ons (synapses) between them Large number of neurons: Large connec7vity: 10 5 Parallel processing Distributed computa7on/memory Robust to noise, failures 3

4 Understanding the Brain Levels of analysis (Marr, 1982) 1. ComputaOonal theory 2. RepresentaOon and algorithm 3. Hardware implementaoon Reverse engineering: From hardware to theory Parallel processing: SIMD vs MIMD Neural net: SIMD with modifiable local memory Learning: Update by training/experience 4

5 Perceptron (RosenblaS, 1962) 5

6 What a Perceptron Does Regression: y=wx+w 0 Classifica7on: y=1(wx+w 0 >0) y y s y w 0 x 0 =+1 w x x w 0 w x w 0 Linear fit Linear discrimination 6

7 Regression: K Outputs ClassificaOon: 7

8 Training Online (instances seen one by one) vs batch (whole sample) learning: No need to store the whole sample Problem may change in Ome Wear and degradaoon in system components Stochas7c gradient- descent: Update ayer a single pazern Generic update rule (LMS rule): 8

9 Training a Perceptron: Regression Regression (Linear output): to Machine Learning The MIT Press (V1.1) 9

10 ClassificaOon Single sigmoid output K>2 so4max outputs Same as for linear discriminants from chapter 10 except we update after each instance 10

11 Learning Boolean AND 11

12 XOR No w 0, w 1, w 2 sa7sfy: (Minsky and Papert, 1969) to Machine Learning The MIT Press (V1.1) 12

13 MulOlayer Perceptrons (Rumelhart et al., 1986) 13

14 MLP as Universal Approximator x 1 XOR x 2 = (x 1 AND ~x Lecture Notes 2 ) OR (~x for E Alpaydın 1 AND x 2004 Introduction 2 ) 14

15 BackpropagaOon 15

16 Regression Backward Forward x 16

17 Regression with MulOple Outputs y i v ih w hj z h x j 17

18 18

19 19

20 w h x+w 0 z h v h z h 20

21 Two- Class DiscriminaOon One sigmoid output y t for P(C 1 x t ) and P(C 2 x t ) 1- y t 21

22 K>2 Classes 22

23 MulOple Hidden Layers MLP with one hidden layer is a universal approximator (Hornik et al., 1989), but using mul7ple layers may lead to simpler networks 23

24 Improving Convergence Momentum Adap7ve learning rate 24

25 Overfibng/Overtraining Number of weights: H (d+1)+(h+1)k 25

26 Conclusion Perceptrons handle linearly separable problems Multilayer perceptrons handle any problem Logistic discrimination functions enable gradient descent- based packpropagation Solves the structural credit assignment problem Susceptible to local optima Susceptible to overfitting 26

27 27

28 Structured MLP (Le Cun et al, 1989) 28

29 Weight Sharing 29

30 Hints (Abu- Mostafa, 1995) Invariance to translaoon, rotaoon, size Virtual examples Augmented error: E =E+λ h E h If x and x are the same : E h =[g(x θ)- g(x θ)] 2 ApproximaOon hint: 30

31 Tuning the Network Size DestrucOve Weight decay: ConstrucOve Growing networks (Ash, 1989) (Fahlman and Lebiere, 1989) 31

32 Bayesian Learning Consider weights w i as random vars, prior p(w i ) Weight decay, ridge regression, regularizaoon cost=data- misfit + λ complexity More about Bayesian methods in chapter 14 32

33 Dimensionality ReducOon 33

34 34

35 Learning Time Applica7ons: Sequence recognioon: Speech recognioon Sequence reproducoon: Time- series predicoon Sequence associaoon Network architectures Time- delay networks (Waibel et al., 1989) Recurrent networks (Rumelhart et al., 1986) 35

36 Time- Delay Neural Networks 36

37 Recurrent Networks 37

38 Unfolding in Time 38