Artificial Neural Networks

Size: px

Start display at page:

Download "Artificial Neural Networks"

Berenice Houston
5 years ago
Views:

1 Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 1

2 Connectionist Models Consider humans: Neuron switching time.001 second Number of neurons Connections per neuron Scene recognition time.1 second 100 inference steps doesn t seem like enough much parallel computation 2

3 Connectionist Models Properties of artificial neural nets (ANN s): Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically 3

4 When to Consider Neural Networks Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant 4

5 When to Consider Neural Networks Examples: Speech phoneme recognition [Waibel] Image classification [Kanade, Baluja, Rowley] Financial prediction 5

6 ALVINN drives 70 mph on highways Sharp Left Straight Ahead Sharp Right 30 Output Units 4 Hidden Units 30x32 Sensor Input Retina 6

7 Perceptron x 1 w 1 x 0 =1 x 2 x n... w 2 w n w 0 Σ n Σ w i x i i=0 { n 1 if Σ w > 0 o = i x i=0 i -1 otherwise { 1 if w0 + w o(x 1,..., x n ) = 1 x w n x n > 0 1 otherwise. Sometimes we ll use simpler vector notation: o( x) = { 1 if w x > 0 1 otherwise. 7

8 Decision Surface of a Perceptron + + x x x 1 x (a) Represents some useful functions What weights represent g(x 1, x 2 ) = AND(x 1, x 2 )? But some functions not representable e.g., not linearly separable Will need more complicated models to represent these. (b) 8

9 Perceptron training rule w i w i + w i where w i = η(t o)x i Where: t = c( x) is target value o is perceptron output η is small constant (e.g.,.1) called learning rate 9

10 Perceptron training rule Can prove it will converge If training data is linearly separable and η sufficiently small 10

11 Gradient Descent To understand, consider simpler linear unit, where o = w 0 + w 1 x w n x n Let s learn w i s that minimize the squared error E[ w] 1 2 (t d o d ) 2 d D Where D is set of training examples This defines the Loss function that you want to minimize! 11

12 Gradient Descent E[w] w0 Gradient w1 E[ w] -1-2 [ E, E, E ] w 0 w 1 w n 12

13 Training rule: i.e., Update: w = η E[ w] w i = η E w i w i w i + w i 13

14 Gradient Descent Derivation: E = 1 (t w i w i 2 d o d ) 2 = 1 2 = 1 2 = d E = w i d d d d w i (t d o d ) 2 2(t d o d ) w i (t d o d ) (t d o d ) w i (t d w x d ) (t d o d )( x i,d ) 14

15 Gradient Descent GRADIENT-DESCENT(training examples, η) Each training example is a pair of the form x, t, where x is the vector of input values, and t is the target output value. η is the learning rate (e.g.,.05). Initialize each w i to some small random value Until the termination condition is met, Do Initialize each w i to zero. For each x, t in training examples, Do Input the instance x to the unit and compute the output o 15

16 For each linear unit weight w i, Do w i w i + η(t o)x i For each linear unit weight w i, Do w i w i + w i 16

17 Summary Perceptron training rule guaranteed to succeed if Training examples are linearly separable Sufficiently small learning rate η Linear unit training rule uses gradient descent Guaranteed to converge to hypothesis with minimum squared error Given sufficiently small learning rate η Even when training data contains noise Even when training data not separable by H 17

18 Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Do until satisfied 1. Compute the gradient E D [ w] 2. w w η E D [ w] Incremental mode Gradient Descent: Do until satisfied For each training example d in D 1. Compute the gradient E d [ w] 2. w w η E d [ w] 18

19 E D [ w] 1 2 (t d o d ) 2 d D E d [ w] 1 2 (t d o d ) 2 Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η made small enough 19

20 Multilayer Networks of Sigmoid Units head hid who d hood F1 F2 20

21 21

22 Sigmoid Unit x 1 w 1 x 0 = 1 x 2 x n... w 2 w n w 0 Σ n net = Σ w i x i=0 i σ(x) is the sigmoid function 1 1+e x Nice property: dσ(x) dx = σ(x)(1 σ(x)) o = σ(net) = We can derive gradient decent rules to train One sigmoid unit ē net Multilayer networks of sigmoids Backpropagation 22

23 Error Gradient for a Sigmoid Unit E = 1 (t w i w i 2 d o d ) 2 d D = 1 (t 2 w d o d ) 2 d i = 1 2(t 2 d o d ) (t w d o d ) d i = ( (t d o d ) o ) d w d i = d (t d o d ) o d net d net d w i 23

24 But we know: So: o d net d = σ(net d) net d = o d (1 o d ) net d w i = ( w x d) w i = x i,d E = (t w d o d )o d (1 o d )x i,d i d D 24

25 Backpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, Do For each training example, Do 1. Input the training example to the network and compute the network outputs 2. For each output unit k 3. For each hidden unit h δ k o k (1 o k )(t k o k ) 25

26 δ h o h (1 o h ) k outputs w h,k δ k 4. Update each network weight w i,j where w i,j w i,j + w i,j w i,j = ηδ j x i,j 26

27 More on Backpropagation Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum In practice, often works well (can run multiple times) Often include weight momentum α w i,j (n) = ηδ j x i,j + α w i,j (n 1) Minimizes error over training examples Will it generalize well to subsequent examples? 27

28 Training can take thousands of iterations slow! Using network after training is very fast 28

29 Learning Hidden Layer Representations Inputs Outputs A target function: Input Output Can this be learned?? 29

30 Learning Hidden Layer Representations A network: Learned hidden layer: Inputs Outputs Input Hidden Output Values

31 Training Sum of squared errors for each output unit

32 Training Hidden unit encoding for input

33 Training Weights from inputs to one hidden unit

34 Convergence of Backpropagation Gradient descent to some local minimum Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different inital weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasing non-linearity possible as training progresses 34

35 Expressive Capabilities of ANNs Boolean functions: Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] 35

36 36 Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].

37 Overfitting in ANNs Error Error versus weight updates (example 1) Training set error Validation set error Number of weight updates 37

38 Error Error versus weight updates (example 2) Training set error Validation set error Number of weight updates 38

39 Neural Nets for Face Recognition left strt rght up x32 inputs 39

40 Typical input images 90% accurate learning head pose, and recognizing 1-of-20 faces 40

41 Learned Hidden Unit Weights left strt rght up Learned Weights x32 inputs 41

42 Alternative Error Functions Penalize large weights: E( w) 1 2 d D k outputs(t kd o kd ) 2 + γ i,j Train on target slopes as well as values: w 2 ji [ (tkd o kd ) 2 E( w) 1 2 d D k outputs +µ j inputs ( t kd x j d o kd x j d ) 2 ] 42

43 Tie together weights: e.g., in phoneme recognition network 43

44 Recurrent Networks y(t + 1) y(t + 1) b x(t) x(t) c(t) (a) Feedforward network (b) Recurrent network y(t + 1) x(t) c(t) y(t) x(t 1) c(t 1) y(t 1) (c) Recurrent network unfolded in time x(t 2) c(t 2) 44

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Connectionist Models Consider humans: Neuron switching time ~ :001 second Number of neurons ~ 10 10 Connections per neuron ~ 10 4 5 Scene recognition time ~ :1 second 100 inference steps doesn't seem like