Lecture 4: Perceptrons and Multilayer Perceptrons

Similar documents
Artificial Neural Networks

Artificial Neural Networks

Artificial Neural Networks

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Neural Networks biological neuron artificial neuron 1

AI Programming CS F-20 Neural Networks

Machine Learning

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

22c145-Fall 01: Neural Networks. Neural Networks. Readings: Chapter 19 of Russell & Norvig. Cesare Tinelli 1

Machine Learning

Course 395: Machine Learning - Lectures

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

2015 Todd Neller. A.I.M.A. text figures 1995 Prentice Hall. Used by permission. Neural Networks. Todd W. Neller

Neural networks. Chapter 19, Sections 1 5 1

Artifical Neural Networks

y(x n, w) t n 2. (1)

CS:4420 Artificial Intelligence

Lab 5: 16 th April Exercises on Neural Networks

Unit 8: Introduction to neural networks. Perceptrons

Neural networks. Chapter 20. Chapter 20 1

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

Machine Learning. Neural Networks

Introduction to Machine Learning

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Neural networks. Chapter 20, Section 5 1

Data Mining Part 5. Prediction

Artificial Neural Networks. Edward Gatt

Feedforward Neural Nets and Backpropagation

Incremental Stochastic Gradient Descent

Multilayer Perceptrons and Backpropagation

CMSC 421: Neural Computation. Applications of Neural Networks

Linear discriminant functions

Multilayer Neural Networks

18.6 Regression and Classification with Linear Models

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Multilayer Perceptron

Artificial Neural Networks Examination, June 2005

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Machine Learning. Neural Networks. Le Song. CSE6740/CS7641/ISYE6740, Fall Lecture 7, September 11, 2012 Based on slides from Eric Xing, CMU

ARTIFICIAL INTELLIGENCE. Artificial Neural Networks

Lecture 5: Logistic Regression. Neural Networks

Last update: October 26, Neural networks. CMSC 421: Section Dana Nau

Simple Neural Nets For Pattern Classification

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

CSC242: Intro to AI. Lecture 21

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Neural Networks and the Back-propagation Algorithm

4. Multilayer Perceptrons

Neural Networks (Part 1) Goals for the lecture

CSC321 Lecture 5: Multilayer Perceptrons

Neural Networks DWML, /25

Artificial Neural Networks Examination, June 2004

Revision: Neural Network

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Artificial Neural Networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Supervised Learning in Neural Networks

Introduction to Artificial Neural Networks

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Introduction To Artificial Neural Networks

Pattern Classification

Logistic Regression & Neural Networks

Neural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Artificial Intelligence

Multilayer Neural Networks

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Artificial Neural Networks

COMP-4360 Machine Learning Neural Networks

Lecture 7 Artificial neural networks: Supervised learning

Linear Discrimination Functions

Introduction to Neural Networks

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Artificial neural networks

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

CSC321 Lecture 4: Learning a Classifier

Multilayer Perceptron

Artificial Neural Networks D B M G. Data Base and Data Mining Group of Politecnico di Torino. Elena Baralis. Politecnico di Torino

Artificial Neural Network

Multi-layer Neural Networks

Learning and Memory in Neural Networks

The error-backpropagation algorithm is one of the most important and widely used (and some would say wildly used) learning techniques for neural

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Artificial Neural Networks Examination, March 2004

CS 4700: Foundations of Artificial Intelligence

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Machine Learning Lecture 5

Artificial Neural Networks. MGS Lecture 2

The Perceptron Algorithm 1

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Artificial Neural Networks. Historical description

Chapter ML:VI. VI. Neural Networks. Perceptron Learning Gradient Descent Multilayer Perceptron Radial Basis Functions

Lecture 6. Notes on Linear Algebra. Perceptron

Transcription:

Lecture 4: Perceptrons and Multilayer Perceptrons Cognitive Systems II - Machine Learning SS 2005 Part I: Basic Approaches of Concept Learning Perceptrons, Artificial Neuronal Networks Lecture 4: Perceptrons and Multilayer Perceptrons p.

Biological Motivation biological learning systems are built of complex webs of interconnected neurons motivation: goal: capture kind of highly parallel computation based on distributed representation obtain highly effective machine learning algorithms, independent of whether these algorithms fit biological processes (no cognitive modeling!) Lecture 4: Perceptrons and Multilayer Perceptrons p.

Biological Motivation Computer Brain computation units 1 CPU (> 10 7 Gates) 10 11 neurons memory units 512 MB RAM 10 11 neurons 500 GB HDD 10 14 synapses clock 10 8 sec 10 3 sec transmission > 10 9 bits/sec > 10 14 bits/sec Computer: serial, quick Brain: parallel, slowly, robust to noisy data Lecture 4: Perceptrons and Multilayer Perceptrons p.

Appropriate Problems BACKPROPAGATION algorithm is the most commonly used ANN learning technique with the following characteristics: instances are representated as many attribute-value pairs input values can be any real values target function output may be discrete-, real- or vector-valued training examples may contain errors long training times are acceptable fast evaluation of the learned target function may be required many iterations may be neccessary to converge to a good approximation ability of humans to understand the learned target function is not important learned weights are not intuitively understandable Lecture 4: Perceptrons and Multilayer Perceptrons p.

Perceptrons x 1 w 1 x 0 =1 x 2 x n... w 2 w n w 0 Σ n Σ w i x i i=0 { n 1 if Σ w > 0 o = i x i=0 i -1 otherwise takes a vector of real-valued inputs (x 1,..., x n ) weighted with (w 1,..., w n ) calculates the linear combination of these inputs n i=0 w ix i = w 0 x 0 + w 1 x 1 +... + w n x n w 0 denotes a threshold value x 0 is always 1 outputs 1 if the result is greater than 1, otherwise 1 Lecture 4: Perceptrons and Multilayer Perceptrons p.

Representational Power a perceptron represents a hyperplane decision surface in the n-dimensional space of instances some sets of examples cannot be separated by any hyperplane, those that can be separated are called linearly separable many boolean functions can be representated by a perceptron: AND, OR, NAND, NOR x 2 + x 2 + + - + - x 1 x 1 - - - + (a) (b) Lecture 4: Perceptrons and Multilayer Perceptrons p.

Perceptron Training Rule problem: determine a weight vector w that causes the perceptron to produce the correct output for each training example perceptron training rule: w i = w i + w i where w i = η(t o)x i algorithm: t target output o perceptron output η learning rate (usually some small value, e.g. 0.1) 1. initialize w to random weights 2. repeat, until each training example is classified correctly (a) apply perceptron training rule to each training example convergence guaranteed provided linearly separable training examples and sufficiently small η Lecture 4: Perceptrons and Multilayer Perceptrons p.

Delta Rule perceptron rule fails if data is not linearly separable delta rule converges toward a best-fit approximation uses gradient descent to search the hypothesis space perceptron cannot be used, because it is not differentiable hence, a unthresholded linear unit is appropriate error measure: E( w) 1 2 d D (t d o d ) 2 to understand gradient descent, it is helpful to visualize the entire hypothesis space with all possible weight vectors and associated E values Lecture 4: Perceptrons and Multilayer Perceptrons p.

Error Surface the axes w 0, w 1 represent possible values for the two weights of a simple linear unit 25 20 15 E[w] 10 5 0 2 1 0 w0-1 3 2 1 w1 0-1 -2 error surface must be parabolic with a single global minimum Lecture 4: Perceptrons and Multilayer Perceptrons p.

Derivation of Gradient Descent problem: How calculate the steepest descent along the error surface? derivation of E with respect to each component of w this vector derivate is called gradient of E, written E( w) E( w) [ E w 0, E w 1,..., E w n ] E( w) specifies the steepest ascent, so E( w) specifies the steepest descent training rule: w i = w i + w i w i = η E w i and E w i = d D (t d o d )( x id ) w i = d D (t d o d )x id Lecture 4: Perceptrons and Multilayer Perceptrons p. 1

Incremental Gradient Descent application difficulties of gradient descent convergence may be quite slow in case of many local minima, the global minimum may not be found idea: approximate gradient descent search by updating weights incrementally, following the calculation of the error for each individual example w i = η(t o)x i where E d ( w) = 1 2 (t d o d ) 2 key differences: weights are not summed up over all examples before updating requires less computation better for avoidance of local minima Lecture 4: Perceptrons and Multilayer Perceptrons p. 1

Gradient Descent Algorithm GRADIENT-DESCENT(training_examples, η) Each training example is a pair of the form < x, t >, where x is the vector of input values, and t is the target output value. η is the learning rate. Initialize each w i to some small random value Until the termination condition is met, Do Initialize each w i to zero For each < x, t > in training_examples, Do Input the instance x to the unit and compute the output o For each linear unit weight w i, Do w i = w i + η(t o)x i For each linear unit weight w i, Do w i w i + w i To implement incremental approximation, equation is deleted and equation is replaced by w i w i + η(t o)x i. Lecture 4: Perceptrons and Multilayer Perceptrons p. 1

Perceptron vs. Delta Rule perceptron training rule: uses thresholded unit converges after a finite number of iterations output hypothesis classifies training data perfectly linearly separability neccessary delta rule: uses unthresholded linear unit converges asymptotically toward a minimum error hypothesis termination is not guaranteed linear separability not neccessary Lecture 4: Perceptrons and Multilayer Perceptrons p. 1

Multilayer Networks (ANNs) capable of learning nonlinear decision surfaces normally directed and acyclic Feed-forward Network based on sigmoid unit much like a perceptron but based on a smoothed, differentiable threshold function σ(net) = 1 1+e net lim net + σ(net) = 1 lim net σ(net) = 0 x 1 w 1 x 0 = 1 x 2 x n... w 2 w n w 0 Σ n net = Σ w i x i=0 i o = σ(net) = 1 1 + ē net Lecture 4: Perceptrons and Multilayer Perceptrons p. 1

BACKPROPAGATION learns weights for a feed-forward multilayer network with a fixed set of neurons and interconnections employs gradient descent to minimize error redefinition of E has to sum the errors over all units E( w) 1 2 d D k outputs (t kd o kd ) 2 problem: search through a large H defined over all possible weight values for all units in the network Lecture 4: Perceptrons and Multilayer Perceptrons p. 1

BACKPROPAGATION algorithm BACKPROPAGATION(training_examples, η, n in, n out, n hidden ) The input from unit i to unit j is denoted x ji and the weight from unit i to unit j is denoted w ji. create a feed-forward network with n in inputs. n hidden hidden untis, and n out output units Initialize all network weights to small random numbers Until the termination condition is met, Do For each < x, t > in training_examples, Do Propagate the input forward through the network: 1. Input x to the network and compute o u of every unit u Propagate the errors back trough the network: 2. For each network output unit k, calculate its error term δ k δ k o k (1 o k )(t k o k ) 3. For each hidden unit h, calculate its error term δ h δ h o h (1 o h ) k outputs w khδ k 4. Update each weight w ji w ji w ji + w ji where w ji = ηδ j x ji Lecture 4: Perceptrons and Multilayer Perceptrons p. 1

Termination conditions fixed number of iterations error falls below some threshold error on a separate validation set falls below some threshold important: too few iterations reduce error insufficiently too many iterations can lead to overfitting the data Lecture 4: Perceptrons and Multilayer Perceptrons p. 1

Adding Momentum one way to avoid local minima in the error surface or flat regions make the weight update in the n th iteration depend on the update in the (n 1) th iteration w ji (n) = ηδ j x ji + α w ji (n 1) 0 α 1 Lecture 4: Perceptrons and Multilayer Perceptrons p. 1

Representational Power boolean functions: every boolean function can be representated by a two-layer network continuous functions: every continuous function can be approximated with arbitrarily small error by a two-layer network (sigmoid units at the hidden layer and linear units at the output layer) arbitrary functions: each arbitrary function can be approximated to arbitrary accuracy by a three-layer network Lecture 4: Perceptrons and Multilayer Perceptrons p. 1

Inductive Bias every possible assignment of network weights represents a syntactically different hypothesis H = { w w R (n+1) } inductive bias: smooth interpolation between data points Lecture 4: Perceptrons and Multilayer Perceptrons p. 2

Illustrative Example - Face Recognition task: classifying camera image of faces of various people images of 20 people were made, including approximately 32 different images per person image resolution 120 128 with each pixel described by a greyscale intensity between 0 and 255 identifying the direction in which the persons are looking (i.e., left, right, up, ahead) Lecture 4: Perceptrons and Multilayer Perceptrons p. 2

Illustrative Example - Design Choices input encoding: image encoded as a set of 30 32 pixel intensitiy values ranging from 0 to 255 linearly scaled from 0 to 1 reduces the number of inputs and network weights reduces computational demands output encoding: network must output one of four values indicating the face direction 1-of-n output encoding: 1 output unit for each direction more degrees of freedom difference between highest and second-highest output can be used as a measure of classification confidence Lecture 4: Perceptrons and Multilayer Perceptrons p. 2

Illustrative Example - Design Choices network graph structure: BACKPROPAGATION works with any DAG of sigmoid units question of how many units and how to interconnect them using standard design: hidden layer and output layer where every unit in the hidden layer is connected with every unit in the output layer 30 hidden units test accuracy of 90% Lecture 4: Perceptrons and Multilayer Perceptrons p. 2

Advanced Topics hidden layer representations alternative error functions recurrent networks dynamically modifying network structure Lecture 4: Perceptrons and Multilayer Perceptrons p. 2

Summary able to learn discrete-, real- and vector-valued target functions noise in the data is allowed perceptrons learn hyperplane decision surfaces (linear separability) multilayer networks even learn nonlinear decision surfaces BACKPROPAGATION works on arbitrary feed-forward networks and uses gradient-descent to minimize the squared error over the set of training examples an arbitrary function can be approximated to arbitrary accuracy by a three-layer network Inductive Bias: smooth interpolation between data points Lecture 4: Perceptrons and Multilayer Perceptrons p. 2