MULTILAYER PERCEPTRONS

Similar documents
NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Temporal-Difference Learning

4/18/2005. Statistical Learning Theory

Value Prediction with FA. Chapter 8: Generalization and Function Approximation. Adapt Supervised Learning Algorithms. Backups as Training Examples [ ]

Research Design - - Topic 17 Multiple Regression & Multiple Correlation: Two Predictors 2009 R.C. Gardner, Ph.D.

Chapter 8: Generalization and Function Approximation

Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline. Machine. Learning. Problems. Measuring. Performance.

The Substring Search Problem

FE FORMULATIONS FOR PLASTICITY

Conjugate Gradient Methods. Michael Bader. Summer term 2012

Reliability analysis examples

A Deep Convolutional Neural Network Based on Nested Residue Number System

New problems in universal algebraic geometry illustrated by boolean equations

As is natural, our Aerospace Structures will be described in a Euclidean three-dimensional space R 3.

CSCE 478/878 Lecture 4: Experimental Design and Analysis. Stephen Scott. 3 Building a tree on the training set Introduction. Outline.

Internet Appendix for A Bayesian Approach to Real Options: The Case of Distinguishing Between Temporary and Permanent Shocks

Information Retrieval Advanced IR models. Luca Bondi

Compactly Supported Radial Basis Functions

Numerical Integration

Multiple Experts with Binary Features

Light Time Delay and Apparent Position

Absorption Rate into a Small Sphere for a Diffusing Particle Confined in a Large Sphere

Scientific Computing II

INTRODUCTION. 2. Vectors in Physics 1

x 1 b 1 Consider the midpoint x 0 = 1 2

Quantum theory of angular momentum

A Bijective Approach to the Permutational Power of a Priority Queue

RANSAC for (Quasi-)Degenerate data (QDEGSAC)

Do Managers Do Good With Other People s Money? Online Appendix

Graphs of Sine and Cosine Functions

Stanford University CS259Q: Quantum Computing Handout 8 Luca Trevisan October 18, 2012

EM Boundary Value Problems

Relating Branching Program Size and. Formula Size over the Full Binary Basis. FB Informatik, LS II, Univ. Dortmund, Dortmund, Germany

School of Electrical and Computer Engineering, Cornell University. ECE 303: Electromagnetic Fields and Waves. Fall 2007

AP-C WEP. h. Students should be able to recognize and solve problems that call for application both of conservation of energy and Newton s Laws.

Geometry of the homogeneous and isotropic spaces

Topic 4a Introduction to Root Finding & Bracketing Methods

Numerical solution of diffusion mass transfer model in adsorption systems. Prof. Nina Paula Gonçalves Salau, D.Sc.

Transfer Matrix Method

A Relativistic Electron in a Coulomb Potential

Goodness-of-fit for composite hypotheses.

PROBLEM SET #1 SOLUTIONS by Robert A. DiStasio Jr.

4. Some Applications of first order linear differential

Part V: Closed-form solutions to Loop Closure Equations

Lecture 04: HFK Propagation Physical Optics II (Optical Sciences 330) (Updated: Friday, April 29, 2005, 8:05 PM) W.J. Dallas

Mathematical Model of Magnetometric Resistivity. Sounding for a Conductive Host. with a Bulge Overburden

Quantum Fourier Transform

Teachers notes. Beyond the Thrills excursions. Worksheets in this book. Completing the worksheets

Easy. r p 2 f : r p 2i. r p 1i. r p 1 f. m blood g kg. P8.2 (a) The momentum is p = mv, so v = p/m and the kinetic energy is

Recent Advances in Chemical Engineering, Biochemistry and Computational Chemistry

Limited Feedback Scheme for Device to Device Communications in 5G Cellular Networks with Reliability and Cellular Secrecy Outage Constraints

Directed Regression. Benjamin Van Roy Stanford University Stanford, CA Abstract

Explosive Contagion in Networks (Supplementary Information)

Pearson s Chi-Square Test Modifications for Comparison of Unweighted and Weighted Histograms and Two Weighted Histograms

A Machine Learned Model of a Hybrid Aircraft

Math 301: The Erdős-Stone-Simonovitz Theorem and Extremal Numbers for Bipartite Graphs

MAPPING LARGE PARALLEL SIMULATION PROGRAMS TO MULTICOMPUTER SYSTEMS

Psychometric Methods: Theory into Practice Larry R. Price

6.4 Period and Frequency for Uniform Circular Motion

In statistical computations it is desirable to have a simplified system of notation to avoid complicated formulas describing mathematical operations.

arxiv: v1 [math.co] 4 May 2017

An Exact Solution of Navier Stokes Equation

COLLAPSING WALLS THEOREM

On the integration of the equations of hydrodynamics

LET a random variable x follows the two - parameter

Rejection Based Face Detection

7.2. Coulomb s Law. The Electric Force

Physics 121 Hour Exam #5 Solution

DonnishJournals

15.081J/6.251J Introduction to Mathematical Programming. Lecture 6: The Simplex Method II

COMPUTATIONS OF ELECTROMAGNETIC FIELDS RADIATED FROM COMPLEX LIGHTNING CHANNELS

2.5 The Quarter-Wave Transformer

EQUI-PARTITIONING OF HIGHER-DIMENSIONAL HYPER-RECTANGULAR GRID GRAPHS

MATH 415, WEEK 3: Parameter-Dependence and Bifurcations

11) A thin, uniform rod of mass M is supported by two vertical strings, as shown below.

Lecture 18: Graph Isomorphisms

Physics 235 Chapter 5. Chapter 5 Gravitation

Review: Electrostatics and Magnetostatics

APPLICATION OF MAC IN THE FREQUENCY DOMAIN

Classical Worm algorithms (WA)

Chapter 2: Introduction to Implicit Equations

Suggested Solutions to Homework #4 Econ 511b (Part I), Spring 2004

3.1 Random variables

Machine Learning and Rendering

gr0 GRAPHS Hanan Samet

A Power Method for Computing Square Roots of Complex Matrices

Grouped data clustering using a fast mixture-model-based algorithm

Δt The textbook chooses to say that the average velocity is

Chapter Eight Notes N P U1C8S4-6

Probablistically Checkable Proofs

Prediction of Motion Trajectories Based on Markov Chains

2-Monoid of Observables on String G

Computational Methods of Solid Mechanics. Project report

Chem 453/544 Fall /08/03. Exam #1 Solutions

FUSE Fusion Utility Sequence Estimator

A New Approach to General Relativity

The evolution of the phase space density of particle beams in external fields

Lifting Private Information Retrieval from Two to any Number of Messages

1 Explicit Explore or Exploit (E 3 ) Algorithm

Homework 7 Solutions

Transcription:

Last updated: Nov 26, 2012 MULTILAYER PERCEPTRONS

Outline 2 Combining Linea Classifies Leaning Paametes

Outline 3 Combining Linea Classifies Leaning Paametes

Implementing Logical Relations 4 AND and OR opeations ae linealy sepaable poblems

The XOR Poblem 5 XOR is not linealy sepaable. x 1 x 2 XOR Class 0 0 0 B 0 1 1 A 1 0 1 A 1 1 0 B How can we use linea classifies to solve this poblem?

Combining two linea classifies 6 Idea: use a logical combination of two linea classifies. g 2 (x) = x 1 + x 2 3 2 g 1 (x) = x 1 + x 2 1 2

Combining two linea classifies 7 Let f (x) be the unit step activation function: f (x) = 0, x < 0 f (x) = 1, x 0 Obseve that the classification poblem is then solved by f y 1 y 2 1 2 whee y 1 = f g 1 (x) ( ) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2 3 2 g 1 (x) = x 1 + x 2 1 2

Combining two linea classifies 8 This calculation can be implemented sequentially: 1. Compute y 1 and y 2 fom x 1 and x 2. 2. Compute the decision fom y 1 and y 2. Each laye in the sequence consists of one o moe linea classifications. This is theefoe a two-laye pecepton. f y 1 y 2 1 2 whee y 1 = f g 1 (x) ( ) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2 3 2 g 1 (x) = x 1 + x 2 1 2

The Two-Laye Pecepton 9 Laye 1 Laye 2 x 1 x 2 y 1 y 2 0 0 0(-) 0(-) B(0) 0 1 1(+) 0(-) A(1) 1 0 1(+) 0(-) A(1) 1 1 1(+) 1(+) B(0) f y 1 y 2 1 2 whee y 1 = f g 1 (x) ( ) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2 3 2 Laye 1 Laye 2 y 1 y 2 1 2 g 1 (x) = x 1 + x 2 1 2

The Two-Laye Pecepton 10 The fist laye pefoms a nonlinea mapping that makes the data linealy sepaable. y 1 = f ( g 1 (x)) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2 3 2 Laye 1 Laye 2 y 1 y 2 1 2 g 1 (x) = x 1 + x 2 1 2

The Two-Laye Pecepton Achitectue 11 Input Laye Hidden Laye Output Laye g 1 (x) = x 1 + x 2 1 2 y 1 y 2 1 2 g 2 (x) = x 1 + x 2 3 2-1

The Two-Laye Pecepton 12 Note that the hidden laye maps the plane onto the vetices of a unit squae. y 1 = f ( g 1 (x)) and y 2 = f ( g 2 (x)) g 2 (x) = x 1 + x 2 3 2 Laye 1 Laye 2 y 1 y 2 1 2 g 1 (x) = x 1 + x 2 1 2

Highe Dimensions 13 Each hidden unit ealizes a hypeplane disciminant function. The output of each hidden unit is 0 o 1 depending upon the location of the input vecto elative to the hypeplane. l x R x { 0,1} i 1, 2 p T y = [ y 1,... yp ], yi =,...

Highe Dimensions 14 Togethe, the hidden units map the input onto the vetices of a p-dimensional unit hypecube. l x R x y { 0,1} i 1, 2 p T = [ y 1,... yp ], yi =,...

Two-Laye Pecepton 15 These p hypeplanes patition the l-dimensional input space into polyhedal egions Each egion coesponds to a diffeent vetex of the p- dimensional hypecube epesented by the outputs of the hidden laye.

Two-Laye Pecepton 16 In this example, the vetex (0, 0, 1) coesponds to the egion of the input space whee: g 1 (x) < 0 g 2 (x) < 0 g 3 (x) > 0

Limitations of a Two-Laye Pecepton 17 The output neuon ealizes a hypeplane in the tansfomed space that patitions the p vetices into two sets. Thus, the two laye pecepton has the capability to classify vectos into classes that consist of unions of polyhedal egions. But NOT ANY union. It depends on the elative position of the coesponding vetices. How can we solve this poblem?

The Thee-Laye Pecepton 18 Suppose that Class A consists of the union of K polyheda in the input space. Use K neuons in the 2 nd hidden laye. Tain each to classify one Class A vetex as positive, the est negative. Now use an output neuon that implements the OR function.

The Thee-Laye Pecepton 19 Thus the thee-laye pecepton can sepaate classes esulting fom any union of polyhedal egions in the input space.

The Thee-Laye Pecepton 20 The fist laye of the netwok foms the hypeplanes in the input space. The second laye of the netwok foms the polyhedal egions of the input space The thid laye foms the appopiate unions of these egions and maps each to the appopiate class.

Outline 21 Combining Linea Classifies Leaning Paametes

Taining Data 22 The taining data consist of N input-output pais: ( y(i), x(i) ), i 1, N whee y(i) = and x(i) = y 1 (i),, y kl (i) x 1 (i),, x k0 (i) t t

Choosing an Activation Function 23 The unit step activation function means that the eo ate of the netwok is a discontinuous function of the weights. This makes it difficult to lean optimal weights by minimizing the eo. To fix this poblem, we need to use a smooth activation function. A popula choice is the sigmoid function we used fo logistic egession:

Smooth Activation Function 24 f (a) = 1 1+ exp( a) w t φ x 1

Output: Two Classes 25 Fo a binay classification poblem, thee is a single output node with activation function given by f (a) = 1 1+ exp( a) Since the output is constained to lie between 0 and 1, it can be intepeted as the pobability of the input vecto belonging to Class 1.

Output: K > 2 Classes 26 Fo a K-class poblem, we use K outputs, and the softmax function given by y k = j ( ) exp a k ( ) exp a j Since the outputs ae constained to lie between 0 and 1, and sum to 1, y k can be intepeted as the pobability that the input vecto belongs to Class K.

Non-Convex 27 Now each laye of ou multi-laye pecepton is a logistic egesso. Recall that optimizing the weights in logistic egession esults in a convex optimization poblem. Unfotunately the cascading of logistic egessos in the multi-laye pecepton makes the poblem non-convex. This makes it difficult to detemine an exact solution. Instead, we typically use gadient descent to find a locally optimal solution to the weights. The specific leaning algoithm is called the backpopagation algoithm.

Nonlinea Classification and Regession: Outline 28 Multi-Laye Peceptons The Back-Popagation Leaning Algoithm Genealized Linea Models Radial Basis Function Netwoks Spase Kenel Machines n Nonlinea SVMs and the Kenel Tick n Relevance Vecto Machines

The Backpopagation Algoithm Paul J. Webos. Beyond Regession: New Tools fo Pediction and Analysis in the Behavioal Sciences. PhD thesis, Havad Univesity, 1974 Rumelhat, David E.; Hinton, Geoffey E., Williams, Ronald J. (8 Octobe 1986). "Leaning epesentations by back-popagating eos". Natue 323 (6088): 533 536. Webos Rumelhat Hinton

Notation 30 Assume a netwok with L layes k 0 nodes in the input laye. k nodes in the th laye.

Notation 31 Let y k 1 be the output of the kth neuon of laye 1. Let w jk be the weight of the synapse on the jth neuon of laye fom the kth neuon of laye 1.

Input 32 y k 0 (i) = x k (i), k = 1,,k 0

Notation 33 Let v j v j (i) = w j be the total input to the jth neuon of laye : ( ) t k 1 y 1 (i) = w jk Then y j (i) = f v j (i) k =0 y k 1 (i) whee we define y 0 (i) = +1 to incopoate the bias tem. k 1 ( ) = f w jk k =0 y 1 k (i)

Cost Function 34 A common cost function is the squaed eo: J = N i =1 ε(i) whee ε(i) 1 2 and k L ( e m (i)) 2 = 1 2 m =1 k L ( y m (i) ŷ m (i)) 2 m =1 ŷ m (i) = y k (i) is the output of the netwok.

Cost Function 35 To summaize, the eo fo input i is given by ε(i) = 1 2 whee ŷ m (i) = y k (i) is the output of the output laye and each laye is elated to the pevious laye though and k L ( e m (i)) 2 = 1 2 m =1 ( ) y j (i) = f v j (i) v j (i) = ( w ) t j y 1 (i) k L ( ŷ m (i) y m (i)) 2 m =1

Gadient Descent 36 Gadient descent stats with an initial guess at the weights ove all layes of the netwok. We then use these weights to compute the netwok output ŷ(i) fo each input vecto x(i) in the taining data. ε(i) = 1 2 k L ( e m (i)) 2 = 1 2 m =1 k L ( ŷ m (i) y m (i)) 2 m =1 This allows us to calculate the eoε(i) fo each of these inputs. Then, in ode to minimize this eo, we incementally update the weights in the negative gadient diection: w j (new) = w j (old) - µ J w j = w j (old) - µ N i =1 ε(i) w j

Gadient Descent 37 ( ) t y 1 (i) Since v j (i) = w j, the influence of the jth weight of the th laye on the eo can be expessed as: ε(i) w j = ε(i) v j (i) v j (i) w j = δ j (i)y 1 (i) whee δ j (i) ε(i) v j (i)

Gadient Descent 38 ε(i) w j = δ j (i)y 1 (i), whee δ j (i) ε(i) v j (i) Fo an intemediate laye, we cannot compute δ j (i) diectly. Howeve, δ j (i) can be computed inductively, stating fom the output laye.

Backpopagation: The Output Laye 39 ε(i) w j = δ j (i)y 1 (i), whee δ j (i) ε(i) v j (i) and ε(i) = 1 2 k L ( e m (i)) 2 = 1 2 m =1 k L m =1 Recall that ŷ m (i) = y L j (i) = f ( v L j (i)) Thus at the output laye we have δ j L (i) = ε(i) v j L (i) = ε(i) e j L (i) ( ŷ m (i) y m (i)) 2 ( ) e L j (i) v L j (i) = e L (i) f v L j j (i) 1 f (a) = 1+ exp( a) f (a) = f (a) 1 f (a) δ j L (i) = e j L (i)f v j L (i) ( ) ( ) ( ) 1 f ( v L j (i))

Backpopagation: Hidden Layes 40 Obseve that the dependence of the eo on the total input to a neuon in a pevious laye can be expessed in tems of the dependence on the total input of neuons in the following laye: δ j 1 (i) = ε(i) v j 1 (i) = k ε(i) v k k (i) = δ v k (i) v 1 j (i) k (i) v (i) k v 1 j (i) k =1 k =1 whee v k (i) = k 1 k 1 w km y 1 m (i) = w km m =0 m =0 ( ) Thus we have v (i) k v 1 j (i) = w f v 1 kj j (i) and so δ j 1 (i) = ε(i) v 1 j (i) = f ( ) f v m 1 (i) k v ( 1 (i) j ) δ k (i)w kj ( ) 1 f ( v L j (i)) = f v j L (i) k ( ) δ k (i)w kj Thus once the δ k (i) ae detemined they can be popagated backwad to calculate δ j 1 (i) using this inductive fomula. k =1 k =1

41 Repeat until convegence Backpopagation: Summay of Algoithm 1. Initialization Initialize all weights with small andom values 2. Fowad Pass Fo each input vecto, un the netwok in the fowad diection, calculating: 3. Backwad Pass Stating with the output laye, use ou inductive fomula to compute the δ 1 j (i) : n n v j (i) = w j ( ) t y 1 (i); and finally ε(i) = 1 2 Output Laye (Base Case): 4. Update Weights Hidden Layes (Inductive Case): w j (new) = w j (old) - µ y j (i) = f v j (i) k L ( e m (i)) 2 = 1 2 m =1 N i =1 δ L j (i) = e L j (i) f ε(i) w j ( ) k L ( ŷ m (i) y m (i)) 2 m =1 δ 1 j (i) = f ( v L j (i)) whee ε(i) w j k ( v 1 j (i)) δ k (i)w kj k =1 = δ j (i)y 1 (i)

Batch vs Online Leaning 42 As descibed, on each iteation backpop updates the weights based upon all of the taining data. This is called batch leaning. w j (new) = w j (old) - µ N i =1 ε(i) w j whee ε(i) w j = δ j (i)y 1 (i) An altenative is to update the weights afte each taining input has been pocessed by the netwok, based only upon the eo fo that input. This is called online leaning. w j (new) = w j (old) - µ ε(i) w j whee ε(i) w j = δ j (i)y 1 (i)

Batch vs Online Leaning 43 One advantage of batch leaning is that aveaging ove all inputs when updating the weights should lead to smoothe convegence. On the othe hand, the andomness associated with online leaning might help to pevent convegence towad a local minimum. Changing the ode of pesentation of the inputs fom epoch to epoch may also impove esults.

Remaks 44 Local Minima The objective function is in geneal non-convex, and so the solution may not be globally optimal. Stopping Citeion Typically stop when the change in weights o the change in the eo function falls below a theshold. Leaning Rate The speed and eliability of convegence depends on the leaning ate μ.