Artificial Neural Networks

Similar documents
Artificial Neural Networks

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Artificial Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

Machine Learning

Machine Learning

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Neural Networks biological neuron artificial neuron 1

Lecture 5: Logistic Regression. Neural Networks

Course 395: Machine Learning - Lectures

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

COMP-4360 Machine Learning Neural Networks

Introduction to Machine Learning

Neural Networks (Part 1) Goals for the lecture

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Multilayer Perceptrons and Backpropagation

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks and the Back-propagation Algorithm

Lab 5: 16 th April Exercises on Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

AI Programming CS F-20 Neural Networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

4. Multilayer Perceptrons

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Neural networks. Chapter 20. Chapter 20 1

Neural Networks DWML, /25

CS:4420 Artificial Intelligence

Unit 8: Introduction to neural networks. Perceptrons

Neural Networks and Deep Learning.

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Artificial Neural Networks. MGS Lecture 2

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Slide04 Haykin Chapter 4: Multi-Layer Perceptrons

Machine Learning. Neural Networks

Multilayer Neural Networks

Neural networks. Chapter 20, Section 5 1

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Linear Discrimination Functions

Feedforward Neural Nets and Backpropagation

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Pattern Classification

Artificial Neural Networks. Edward Gatt

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Incremental Stochastic Gradient Descent

Neural networks. Chapter 19, Sections 1 5 1

Machine Learning Lecture 5

Artificial Neural Networks

Data Mining Part 5. Prediction

Neural Networks and Deep Learning

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Neural Networks. Xiaojin Zhu Computer Sciences Department University of Wisconsin, Madison. slide 1

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Introduction to feedforward neural networks

Introduction to Neural Networks

CSC242: Intro to AI. Lecture 21

Artificial Neural Networks (ANN)

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

y(x n, w) t n 2. (1)

Artifical Neural Networks

Multilayer Neural Networks

Reading Group on Deep Learning Session 1

MODULE -4 BAYEIAN LEARNING

A summary of Deep Learning without Poor Local Minima

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Multilayer Perceptrons (MLPs)

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Classification with Perceptrons. Reading:

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Artificial Neural Networks

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Gradient Descent Training Rule: The Details

Introduction to Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

CSC321 Lecture 5: Multilayer Perceptrons

Introduction to Artificial Neural Networks

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Machine Learning Lecture 10

Multilayer Perceptron

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

CMSC 421: Neural Computation. Applications of Neural Networks

Artificial neural networks

LEARNING & LINEAR CLASSIFIERS

Artificial Neural Network

The Perceptron Algorithm 1

Statistical Machine Learning from Data

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Administration. Registration Hw3 is out. Lecture Captioning (Extra-Credit) Scribing lectures. Questions. Due on Thursday 10/6

Bits of Machine Learning Part 1: Supervised Learning

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Transcription:

0 Artificial Neural Networks Based on Machine Learning, T Mitchell, McGRAW Hill, 1997, ch 4 Acknowledgement: The present slides are an adaptation of slides drawn by T Mitchell

PLAN 1 Introduction Connectionist models 2 NN examples: ALVINN driving system, face recognition 2 The perceptron; the linear unit; the sigmoid unit The gradient descent learning rule 3 Multilayer networks of sigmoid units The Backpropagation algorithm Hidden layer representations Overfitting in NNs 4 Advanced topics Alternative error functions Predicting a probability function [Recurrent networks] [Dynamically modifying the network structure] [Alternative error minimization procedures] 5 Expressive capabilities of NNs 1

2 Connectionist Models Consider humans: Neuron switching time: 001 sec Number of neurons: 10 10 Connections per neuron: 10 4 5 Scene recognition time: 01 sec 100 inference steps doesn t seem like enough much parallel computation Properties of artificial neural nets Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically

A First NN Example: ALVINN drives at 70 mph on highways [Pomerleau, 1993] 3 Sharp Left Straight Ahead Sharp Right 30 Output Units 4 Hidden Units 30x32 Sensor Input Retina

A Second NN Example Neural Nets for Face Recognition left strt rght up Typical input images: http://wwwcscmuedu/ tom/faceshtml 4 30x32 inputs Results: 90% accurate learning head pose, and recognizing 1-of-20 faces

Learned Weights left strt rght up after 1 epoch: 5 30x32 inputs after 100 epochs:

6 Design Issues for these two NN Examples See Tom Mitchell s Machine Learning book, pag 82-83, and 114 for ALVINN, and pag 112-177 for face recognition: input encoding output encoding network graph structure learning parameters: η (learning rate), α (momentum), number of iterations

7 When to Consider Neural Networks Input is high-dimensional discrete or real-valued (eg raw sensor input) Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant

2 The Perceptron [Rosenblat, 1962] 8 x 1 w 1 x 0 =1 x 2 w 2 w 0 x n w n Σ n Σ w i x i i=0 o = n 1 if Σ w i x > 0 i=0 i {-1 otherwise o(x 1,,x n ) = { 1 if w0 +w 1 x 1 + +w n x n 0 1 otherwise Sometimes we ll use simpler vector notation: o( x) = { 1 if w x 0 1 otherwise

9 Decision Surface of a Perceptron + + x 2 + x 2 - + - - - x 1 - + x 1 (a) (b) Represents some useful functions What weights represent g(x 1,x 2 ) = AND(x 1,x 2 )? But certain examples may not be linearly separable Therefore, we ll want networks of these

The Perceptron Training Rule 10 or, in vectorial notation: where: w i w i + w i with w i = η(t o)x i w w + w with w = η(t o) x t = c( x) is target value o is perceptron output η is small positive constant (eg, 1) called learning rate It will converge (proven by [Minsky & Papert, 1969]) if the training data is linearly separable and η is sufficiently small

2 The Linear Unit To understand the perceptron s training rule, consider a (simpler) linear unit, where o = w 0 +w 1 x 1 + +w n x n Let s learn w i s that minimize the squared error E[ w] 1 2 (t d o d ) 2 d D x x =1 0 1 w 1 w 0 x 2 x n w 2 Σ w n o = i=0 wixi n 11 where D is set of training examples The linear unit uses the descent gradient training rule, presented on the next slides Remark: Ch 6 (Bayesian Learning) shows that the hypothesis h = (w 0,w 1,,w n ) that minimises E is the most probable hypothesis given the training data

The Gradient Descent Rule 12 E[w] 25 20 15 10 5 Gradient: [ E E[ w], E, E ] w 0 w 1 w n Training rule: w = w + w, with w = η E[ w] 0 2 Therefore, 1 0 w0-1 3 2 1 w1 0-1 -2 w i = w i + w i, with w i = η E w i

13 The Gradient Descent Rule for the Linear Unit Computation E = 1 w i w i 2 = 1 2 d (t d o d ) 2 = 1 2 d 2(t d o d ) w i (t d o d ) = d d w i (t d o d ) 2 (t d o d ) w i (t d w x d ) = d (t d o d )( x i,d ) Therefore w i = η d (t d o d )x i,d

The Gradient Descent Algorithm for the Linear Unit Gradient-Descent(training examples, η) Each training example is a pair of the form x,t, where x is the vector of input values t is the target output value η is the learning rate (eg, 05) 14 Initialize each w i to some small random value Until the termination condition is met Initialize each w i to zero For each x,t in training examples Input the instance x to the unit and compute the output o For each linear unit weight w i For each linear unit weight w i w i w i +η(t o)x i w i w i + w i

Convergence [Hertz et al, 1991] 15 The gradient descent training rule used by the linear unit is guaranteed to converge to a hypothesis with minimum squared error given a sufficiently small learning rate η even when the training data contains noise even when the training data is not separable by H Note: If η is too large, the gradient descent search runs the risk of overstepping the minimum in the error surface rather than settling into it For this reason, one common modification of the algorithm is to gradually reduce the value of η as the number of gradient descent steps grows

Remark 16 Gradient descent (and similary, gradient ascent: w w +η E) is an important general paradigm for learning It is a strategy for searching through a large or infinite hypothesis space that can be applied whenever the hypothesis space contains continuously parametrized hypotheses the error can be differentiated wrt these hypothesis parameters Practical difficulties in applying the gradient method: if there are multiple local optima in the error surface, then there is no guarantee that the procedure will find the global optimum converging to a local optimum can sometimes be quite slow To alleviate these difficulties, a variation called incremental (or: stochastic) gradient method was proposed

Incremental (Stochastic) Gradient Descent 17 Batch mode Gradient Descent: Do until satisfied 1 Compute the gradient E D [ w] 2 w w η E D [ w] Incremental mode Gradient Descent: Do until satisfied For each training example d in D 1 Compute the gradient E d [ w] 2 w w η E d [ w] E D [ w] 1 2 (t d o d ) 2 E d [ w] 1 2 (t d o d ) 2 d D Covergence: The Incremental Gradient Descent can approximate the Batch Gradient Descent arbitrarily closely if η is made small enough

2 The Sigmoid Unit 18 x 1 w 1 x 0 = 1 x 2 w 2 w 0 x n w n Σ n net = Σ w i x i=0 i o = σ(net) = 1 1 + ē net σ(x) is the sigmoid function 1 1+e x Nice property: dσ(x) dx = σ(x)(1 σ(x)) We can derive gradient descent rules to train One sigmoid unit Multilayer networks of sigmoid units Backpropagation

Error Gradient for the Sigmoid Unit 19 E = 1 (t d o d ) 2 w i w i 2 d D = 1 (t d o d ) 2 2 w i d = 1 2(t d o d ) (t d o d ) 2 w i d = ( (t d o d ) o ) d w i d = o d net d (t d o d ) net d w i d But So: o d = σ(net d) = o d (1 o d ) net d net d net d w i = ( w x d) w i = x i,d E = o d (1 o d )(t d o d )x i,d w i d D where net d = n i=0 w ix i,d

20 Remark Instead of gradient descent, one could use linear programming to find hypothesis consistent with separable data [Duda & Hart, 1973] have shown that linear programming can be extended to the non-linear separable case However, linear programming does not scale to multilayer networks, as gradient descent does (see next section)

21 3 Multilayer Networks of Sigmoid Units An example This network was trained to recognize 1 of 10 vowel sounds occurring in the context h d (eg head, hid ) The inputs have been obtained from a spectral analysis of sound head hid who d hood The 10 network outputs correspond to the 10 possible vowel sounds The network prediction is the output whose value is the highest F1 F2

22 This plot illustrates the highly non-linear decision surface represented by the learned network Points shown on the plot are test examples distinct from the examples used to train the network from [Haug & Lippmann, 1988]

31 The Backpropagation Algorithm 23 for a feed-forward 2-layer network of sigmoid units, the stochastic version Idea: Gradient descent over the entire vector of network weights Initialize all weights to small random numbers Until satisfied, // stopping criterion to be (later) defined for each training example, 1 input the training example to the network, and compute the network outputs 2 for each output unit k: δ k o k (1 o k )(t k o k ) 3 for each hidden unit h: δ h o h (1 o h ) k outputs w khδ k 4 update each network weight w ji : w ji w ji + w ji where w ji = ηδ j x ji, and x ji is the ith input to unit j

Derivation of the Backpropagation rule, (following [Tom Mitchell, 1997], pag 101 103) 24 Notations: x ji : the ith input to unit j; (j could be either hidden or output unit) w ji : the weight associated with the ith input to unit j net j = i w jix ji σ: the sigmoid function o j : the output computed by unit j; (o j = σ(net j )) outputs: the set of units in the final layer of the network Downstream(j): the set of units whose immediate inputs include the output of unit j E d : the training error on the example d (summing over all of the network output units) i w ji j Legend: in magenta color, units belonging to Downstream(j)

25 Preliminaries E d ( w) = 1 2 (t k o k ) 2 = 1 2 k outputs k outputs (t k σ(net k )) 2 ji net j x ji w Σ σ o j Common staff for both hidden and output units: net j E d w ji = w ji = i w ji x ji E d net j net j w ji = E d net j x ji def = η E d w ji = η E d net j x ji x ji w ji net j σ Σ o j w kj Σ Σ Σ net k o k Note: In the sequel we will use the notation: δ j = E d net j w ji = ηδ j x ji

26 Stage/Case 1: Computing the increments ( ) for output unit weights E d net j = E d o j o j net j o j = σ(net j) = o j (1 o j ) net j net j E d o j = 1 o j 2 k outputs (t k o k ) 2 1 = o j 2 (t j o j ) 2 = 1 2 2(t j o j ) (t j o j ) o j = (t j o j ) net j j σ Σ o E d net j = (t j o j )o j (1 o j ) = o j (1 o j )(t j o j ) δ j w ji not = E d net j = o j (1 o j )(t j o j ) = ηδ j x ji = ηo j (1 o j )(t j o j )x ji

Stage/Case 2: Computing the increments ( ) for hidden unit weights 27 E d net j = = = k Downstream(j) k Downstream(j) k Downstream(j) E d net k net k net j δ k net k net j δ k net k o j o j net j net j σ Σ o j w kj Σ Σ Σ net k o k E d net j = k Downstream(j) δ k w kj o j net j = k Downstream(j) δ k w kj o j (1 o j ) Therefore: δ j w ji not = E d net j = o j (1 o j ) k Downstream(j) δ k w kj def = η E d w ji = η E d net j net j w ji = η E d net j x ji = ηδ j x ji = η [ o j (1 o j ) k Downstream(j) δ k w kj ] x ji

Convergence of Backpropagation for NNs of Sigmoid units 28 Nature of convergence The weights are initialized near zero; therefore, initial decision surfaces are near-linear Explanation: o j is of the form σ( w x), therefore w ji 0 for all i,j; note that the graph of σ is approximately liniar in the vecinity of 0 Increasingly non-linear functions are possible as training progresses Will find a local, not necessarily global error minimum In practice, often works well (can run multiple times)

More on Backpropagation 29 Easily generalized to arbitrary directed graphs Training can take thousands of iterations slow! Often include weight momentum α Effect: w i,j (n) = ηδ j x ij +α w ij (n 1) speed up convergence (increase the step size in regions where the gradient is unchanging); keep the ball rolling through local minima (or along flat regions) in the error surface Using network after training is very fast Minimizes error over training examples; Will it generalize well to subsequent examples?

32 Stopping Criteria when Training ANNs and Overfitting (see Tom Mitchell s Machine Learning book, pag 108-112) 30 001 0009 0008 Error versus weight updates (example 1) Training set error Validation set error 008 007 006 Error versus weight updates (example 2) Training set error Validation set error 0007 005 Error 0006 0005 Error 004 003 0004 002 0003 001 0002 0 5000 10000 15000 20000 Number of weight updates 0 0 1000 2000 3000 4000 5000 6000 Number of weight updates Plots of the error E, as a function of the number of weights updates, for two different robot perception tasks

33 Learning Hidden Layer Representations 31 An Example: Learning the identity function f( x) = x Inputs Outputs Input Output 10000000 10000000 01000000 01000000 00100000 00100000 00010000 00010000 00001000 00001000 00000100 00000100 00000010 00000010 00000001 00000001

32 Learned hidden layer representation: Input Hidden Output Values 10000000 89 04 08 10000000 01000000 15 99 99 01000000 00100000 01 97 27 00100000 00010000 99 97 71 00010000 00001000 03 05 02 00001000 00000100 01 11 88 00000100 00000010 80 01 98 00000010 00000001 60 94 01 00000001 After 8000 training epochs, the 3 hidden unit values encode the 8 distinct inputs Note that if the encoded values are rounded to 0 or 1, the result is the standard binary encoding for 8 distinct values (however not the usual one, ie 1 001, 2 010, etc)

Training (I) 33 09 08 07 06 05 04 03 02 01 0 Sum of squared errors for each output unit 0 500 1000 1500 2000 2500

Training (II) 34 1 09 08 07 06 05 04 03 02 01 Hidden unit encoding for input 01000000 0 500 1000 1500 2000 2500

35 Training (III) 4 3 2 1 0-1 -2-3 -4-5 Weights from inputs to one hidden unit 0 500 1000 1500 2000 2500

4 Advanced Topics 41 Alternative Error Functions (see ML book, pag 117-118): Penalize large weights; E( w) 1 2 d D k outputs(t kd o kd ) 2 +γ i,j Train on target slopes as well as values E( w) 1 (t kd o kd ) 2 +µ 2 d D k outputs j inputs ( t kd x j d w 2 ji o kd x j d Tie together weights: eg, in phoneme recognition network; Minimizing the cross entropy (see next 3 slides): ) 2 36 d D t d logo d +(1 t d )log(1 o d ) where o d, the output of the network, represents the estimated probability that the training instance x d is associated the label (target value) 1

42 Predicting a probability function: Learning the ML hypothesis using a NN (see Tom Mitchell s Machine Learning book, pag 118, 167-171) 37 Let us consider a non-deterministic function (LC: one-to-many relation) f : X {0,1} Given a set of independently drawn examples D = {< x 1,d 1 >,,< x m,d m >} where d i = f(x i ) {0,1}, we would like to learn the probability function g(x) def = P(f(x) = 1) The ML hypotesis h ML = argmax h H P(D h) in such a setting is h ML = argmax h H G(h,D) where G(h,D) = m i=1 [d ilnh(x i )+(1 d i )ln(1 h(x i ))] We will use a NN for this task For simplicity, a single layer with sigmoidal units is considered The training will be done by gradient ascent: w w +η G(D,h)

The partial derivative of G(D,h) with respect to w jk, which is the weight for the kth input to unit j, is: G(D, h) w jk = = m i=1 m i=1 = m = i=1 G(D, h) h(x i ) h(x i) w jk (d i lnh(x i )+(1 d i )ln(1 h(x i ))) h(x i ) d i h(x i ) h(x i )(1 h(x i )) h(x i) w jk h(x i) w jk and because h(x i ) = σ (x i )x i,jk = h(x i )(1 h(x i ))x i,jk it follows that w jk G(D, h) m = (d i h(x i ))x i,jk w jk i=1 Note: Here above we denoted x i,jk the kth input to unit j for the ith training example, and σ the derivative of the sigmoid function 38

39 Therefore with w jk = η G(D,h) w jk w jk w jk + w jk = η m (d i h(x i ))x i,jk i=1

43 Recurrent Networks applied to time series data can be trained using a version of Backpropagation algorithm see [Mozer, 1995] 40 An example:

44 Dynamically Modifying the Network Structure 41 Two ideas: Begin with a network with no hidden unit, then grow the network until the training error is reduced to some acceptable level Example: Cascade-Correlation algorithm, [Fahlman & Lebiere, 1990] Begin with a complex network and prune it as you find that certain connections w are inessential Eg see whether w 0, or analyze E w, ie the effect that a small variation in w has on the error E Example: [LeCun et al 1990]

42 45 Alternative Optimisation Methods for Training ANNs See Tom Mitchell s Machine Learning book, pag 119 linear search conjugate gradient

46 Other Advanced Issues 43 Ch 6: A Bayesian justification for choosing to minimize the sum of square errors Ch 7: The estimation of the number of needed training examples to reliably learn boolean functions; The Vapnik-Chervonenkis dimension of certain types of ANNs Ch 12: How to use prior knowledge to improve the generalization acuracy of ANNs

5 Expressive Capabilities of [Feed-forward] ANNs 44 Boolean functions: Every boolean function can be represented by a network with a single hidden layer, but it might require exponential (in the number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by a network with one hidden layer [Cybenko 1989; Hornik et al 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]

Summary / What you should know The gradient descent optimisation method The thresholded perceptron; the training rule, the test rule; convergence result The linear unit and the sigmoid unit; the gradient descent rule (including the proofs); convergence result Multilayer networks of sigmoid units; the Backpropagation algorithm (including the proof for the stochastic version); convergence result Batch/online vs stochastic/incremental gradient descent for artifical neurons and neural networks; convergence result Overfitting in neural networks; solutions 45