L20: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs

Similar documents
Lecture 3: Pattern Classification. Pattern classification

L5: Quadratic classifiers

L11: Pattern recognition principles

Neural Networks Lecture 4: Radial Bases Function Networks

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

Lecture 3: Pattern Classification

STA 4273H: Statistical Machine Learning

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Linear Discriminant Functions

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

ARTIFICIAL NEURAL NETWORKS گروه مطالعاتي 17 بهار 92

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

1 Kernel methods & optimization

Artificial Neuron (Perceptron)

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Machine Learning Lecture 3

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

L3: Review of linear algebra and MATLAB

Machine Learning Lecture 2

ECE521 Lectures 9 Fully Connected Neural Networks

Discriminative Models

Intro to Neural Networks and Deep Learning

Machine Learning 2017

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

ECE521 week 3: 23/26 January 2017

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Discriminative Models

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Machine Learning Linear Classification. Prof. Matteo Matteucci

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Bayes Decision Theory - I

6.867 Machine learning

Machine Learning Lecture 5

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Ch 4. Linear Models for Classification

Machine Learning Lecture 7

L2: Review of probability and statistics

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Pattern Recognition and Machine Learning

Linear & nonlinear classifiers

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

5. Discriminant analysis

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Artificial Neural Networks (ANN)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning - Waseda University Logistic Regression

Machine Learning Basics III

COMS 4771 Introduction to Machine Learning. Nakul Verma

Lecture 5: Logistic Regression. Neural Networks

Machine Learning for Signal Processing Bayes Classification and Regression

Radial Basis Function (RBF) Networks

Intelligent Systems Discriminative Learning, Neural Networks

MRC: The Maximum Rejection Classifier for Pattern Detection. With Michael Elad, Renato Keshet

In the Name of God. Lectures 15&16: Radial Basis Function Networks

Notes on Discriminant Functions and Optimal Classification

CMSC858P Supervised Learning Methods

Machine Learning Lecture 2

Parametric Techniques

Vorlesung Neuronale Netze - Radiale-Basisfunktionen (RBF)-Netze -

CS798: Selected topics in Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Logistic Regression. Machine Learning Fall 2018

Lecture 4: Perceptrons and Multilayer Perceptrons

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Announcements. Proposals graded

Classification Methods II: Linear and Quadratic Discrimminant Analysis

CSC321 Lecture 4: Learning a Classifier

CIS 520: Machine Learning Oct 09, Kernel Methods

Unsupervised Learning

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Support Vector Machines

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Neural Networks DWML, /25

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Neural Networks Lecturer: J. Matas Authors: J. Matas, B. Flach, O. Drbohlav

ECE521 Lecture7. Logistic Regression

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

CSC242: Intro to AI. Lecture 21

Linear & nonlinear classifiers

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Classification 1: Linear regression of indicators, linear discriminant analysis

Introduction to Machine Learning

Deep Neural Networks (1) Hidden layers; Back-propagation

p(x ω i 0.4 ω 2 ω

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

18.6 Regression and Classification with Linear Models

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Computational Intelligence Winter Term 2017/18

CSC 411: Lecture 04: Logistic Regression

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Transcription:

L0: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU

Bayes discriminants and MLPs As we have seen throughout the course, the classifier that minimizes p error is defined by the MAP rule ω = arg ma ω i g i = P ω i How does the MLP relate to this optimal classifier? Assume a MLP with a one-of-c target encoding t i = ω i 0 o. w. The contribution to the error of the i th output is g i ; W g i = N N i N J W = g i ; W t i g i ; W + g i ; W 0 ω i ω i N i g i ; W ω i + N N i N CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU = N N i g i ; W 0 ω i where g i ; W is the discriminant function computed by the MLP for the i th class and the set of weights W

Assuming infinite number of eamples, the previous function becomes lim N N i N N i g i ; W = lim N ω i + N i N lim N lim N N N i N J W = N N i g i ; W ω i = P ω i g i ; W p ω i d N N i g i ; W 0 ω i + lim N N N i N lim N N N i g i ; W ω i + P ω k i g i ; W 0 p ω k i d CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU

Epanding the quadratic terms and rearranging lim J W = N N = g i g i + p, ω i d g i p, ω i + p, ω k i d g i p d g i P ω i g i P ω i p d P ω i p d p d + g i p, ω k i d g i p, ω i d = + p, ω i d + P ω i p d + p, ω i d P ω i p d = + p, ω i d = independent of W CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 4

Therefore, by minimizing J(W) we also minimize g i ; W P ω i p d Summing over all classes (output neurons), we can then conclude that back-prop attempts to minimize the following epression N C g i ; W P ω i p d i= Therefore, in the large sample limit, the outputs of the MLP will approimate (in a least-squares sense) the posteriors g i ; W P ω i Notice that nothing said here is specific to MLPs Any discriminant function with adaptive parameters trained to minimize the sum squared error at the output of a -of-c encoding will approimate the a posteriori probabilities CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU

This result will be true if and only if We use a -of-c encoding with [0,] outputs, and The MLP has enough hidden units to represent the posteriors, and We have an infinite number of eamples, and The MLP does not get trapped in a local minima In practice we will have a limited number of eamples The outputs will not always represent probabilities For instance, there is no guarantee that they will sum up to We can use this result to determine if the network has trained properly If the sum of the outputs differs significantly from, it will be an indication that the MLP is not modeling the posteriors properly and that we may have to change the MLP (topology, number of hidden units, etc.) If we still want to interpret MLP outputs as probabilities, we must then enforce that they add up to This can be achieved by choosing the output s non-linearity to be eponential, and normalizing across the output layer y k = ep net k N O k= ep net k This is known as the soft-ma method, which receives its name from the fact that it can be interpreted as a smoothed version of the winner-take-all rule CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 6

The role of MLP hidden units Assume MLP with non-linear activation functions for the hidden layer(s) and linear activation function for the output layer Let us also hold constant the set of hidden weights w ij m ; m =.. M In this case, minimizing J(W) w.r.t. output weights w M < m w > < M > ij is a linear problem, and its solution is ij, m =..M - w the pseudo-inverse ij W M = argmin W = argmin W N N O n= k= y k M (n t k (n Y M W M T W M = Y M Y M Y M T = Y M T W M denotes the weights of the (output) linear layer and Y M denotes the outputs of the last hidden layer and T denotes the targets CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 7 D [Bishop, 99] C

It can be shown that the role of the output biases is to compensate for the difference between the averages of the targets and that of the hidden unit outputs N HM w M M M 0k = E t k w jk E y j j= Knowing that the output biases can be computed in this manner, we will assume for convenience that both T and Y M are zero-mean As we will see, this will allow us to interpret certain terms as scatter matrices Plugging the pseudo-inverse into the objective function we obtain J W = Y M W M T = Y M Y M T T And realizing that the SSE is the sum of ONLY the diagonal terms in J W = Tr Y M Y M T T Y M Y M T T which, after some manipulation [Bishop, 99], can be epressed as J W = Tr T T S B S TOT where S TOT = Y M Y M S B = Y M TT Y M CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 8

Since T T is independent of W, minimizing J(W) is equivalent to maimizing J (W) J W = Tr S BS TOT Since we are using -of-c encoding in the output layer, it can be shown that S B becomes C S B = N k y k M y M y k M y M k= with y k M = E y k M y M = E y M ω k M where E y k is the mean activation vector at the last hidden layer for all ω k eamples of class ω k, and E y M is the mean activation vector regardless of class Notice that this S B only differs from the conventional between-class covariance matri by having N k instead of N k This means that this scatter matri will have a strong bias in favor of classes that have a large number of eamples and S TOT is the total covariance matri at the output of the final hidden layer CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 9

Conclusion Minimization of the sum squared error between the desired targets and the outputs of an MLP with linear output neurons forces the hidden layer(s) to perform a non-linear transformation of the inputs that maimizes the discriminant function Tr S B S TOT measured at the output of the (last) hidden layer In other words, the hidden layers of an MLP perform a non-linear version of Fisher s discriminant analysis (L0) This is precisely why MLPs have been demonstrated to perform so well in pattern classification tasks CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 0

An eample Train an MLP to classify five odors with a 60-sensor array MLP has 60 inputs, one per sensor There are outputs, one per odor Output neurons use -of-c encoding Output layer has linear activation function Four hidden neurons are used (as many as LDA projections) Hidden layer has the logistic sigmoidal activation function Training Hidden weights and biases trained with steepest descent rule Output weights and biases trained with the pseudo-inverse rule Sensor response 4 0 - - - 0 0 0 0 40 0 60 70 80 90 orange apple cherry fruit-punch tropical-punch 4 CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 60

Sensor response output neurons hidden neuron hidden neuron 4 0.8 0.8 4 4 0.6 0.4 0. 4 4 4 0. 0.4 0.6 0.8 hidden neuron 0.6 0.4 0. 0. 0.4 0.6 0.8 hidden neuron 4 0 - - - 0 0 0 0 40 0 60 70 80 90 orange apple cherry fruit-punch tropical-punch 60 4 4 0 40 60 80 eample CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU

Bayes discriminants and RBFs Recall from previous lectures that the posterior can be epressed as p ω k = C k = p ω k p ω k p ω k p ω k If we model each class with a single kernel p ω k, this can be viewed as a simple network with normalized basis functions given by p ω k φ k = C p ω k p ω k k = and hidden-to-output weights defined by w k = p ω k This provides a very simple interpretation of RBFs as Bayesian classifiers and vice versa CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU

In some situations, however, a single kernel per class may not be sufficient to model the input space density Let s then use M different basis functions, labeled by inde j The class conditional density is then given by p ω k = j= p j p j ω k and the unconditional density is given by C M p = p ω k p ω k k= M = p j p j j= where the priors p(j) can be defined by C p j = p j ω k p ω k k= With these epressions in mind, the posterior probabilities become p ω k = p ω M k p ω k j= p j p j ω k p ω k = p p j p j M j = CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 4

Moving the denom. inside the sum, and adding the term p(j)/p(j) = p ω k = M j= p j p j ω k M j = p ω k p j p j p j p j This operation allows us to regroup the epression as where φ j = M j = p j p j M p ω k = w jk φ j p j p j j= = p j and w jk = p j ω k p ω k p j = p ω k j INTERPRETATION The activation of a basis function can be interpreted as the posterior probability of the presence of its prototype pattern in the input space H-O weights can be interpreted as the posterior probabilities of the classes given the prototype pattern of the basis function, and The output of the RBF can also be interpreted as the posterior probability of class membership CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU

Comparison between MLPs and RBFs Similarities Either model can function as an universal approimator MLPs and RBFs can approimate any functional continuous mapping with arbitrary accuracy, provided that the number of hidden units is sufficiently large Differences MLPs perform a global and distributed approimation of the target function, whereas RBF perform a local approimation MLP partition feature space with hyper-planes; RBF decision boundaries are hyper-ellipsoids The distributed representation of MLPs causes the error surface to have multiple local minima and nearly flat regions with very slow convergence. As a result training times for MLPs are usually larger than those for RBFs MLPs generalizes better than RBFs in regions of feature space outside of the local neighborhoods defined by the training set. On the other hand, etrapolation far from training data is oftentimes unjustified and dangerous MLPs typically require fewer parameters than RBFs to approimate a nonlinear function with the same accuracy CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 6

Differences (cont.) All the parameters in an MLP are trained simultaneously; parameters in the hidden and output layers of an RBF network are typically trained separately using an efficient, faster hybrid algorithm MLPs may have multiple hidden layers with comple connectivity, whereas RBFs typically have only one hidden layer and full connectivity The hidden neurons of an MLP compute the inner product between an input vector and their weight vector; RBFs compute the Euclidean distance between an input vector and the radial basis centers The hidden layer of an RBF is non-linear, whereas the output layer is linear. In an MLP classifier, all layers are typically non-linear. Only when an MLP is used for non-linear regression, the output layer is typically linear CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 7

Q U A D R A T IC KNN F e a tu re F e a tu re M L P RBF F e a tu re F e a tu re F e a tu re F e a tu re F e a tu re F e a tu re CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU 8