EEE 241: Linear Systems

Similar documents
For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Multi-layer neural networks

Multilayer neural networks

1 Convex Optimization

Week 5: Neural Networks

Multilayer Perceptron (MLP)

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Introduction to the Introduction to Artificial Neural Network

Linear Feature Engineering 11

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17

Generalized Linear Methods

Supporting Information

Evaluation of classifiers MLPs

NUMERICAL DIFFERENTIATION

Chapter Newton s Method

Gradient Descent Learning and Backpropagation

Solving Nonlinear Differential Equations by a Neural Network Method

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Multigradient for Neural Networks for Equalizers 1

Errors for Linear Systems

DUE: WEDS FEB 21ST 2018

Lecture 21: Numerical methods for pricing American type derivatives

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Chapter 12. Ordinary Differential Equation Boundary Value (BV) Problems

Lecture 10 Support Vector Machines II

1 GSW Iterative Techniques for y = Ax

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Numerical Heat and Mass Transfer

ECONOMICS 351*-A Mid-Term Exam -- Fall Term 2000 Page 1 of 13 pages. QUEEN'S UNIVERSITY AT KINGSTON Department of Economics

36.1 Why is it important to be able to find roots to systems of equations? Up to this point, we have discussed how to find the solution to

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

IV. Performance Optimization

Design and Optimization of Fuzzy Controller for Inverse Pendulum System Using Genetic Algorithm

Why feed-forward networks are in a bad shape

Lecture Notes on Linear Regression

10-701/ Machine Learning, Fall 2005 Homework 3

MMA and GCMMA two methods for nonlinear optimization

Development of a General Purpose On-Line Update Multiple Layer Feedforward Backpropagation Neural Network

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

763622S ADVANCED QUANTUM MECHANICS Solution Set 1 Spring c n a n. c n 2 = 1.

Lecture 12: Discrete Laplacian

6.3.4 Modified Euler s method of integration

Feature Selection: Part 1

Markov Chain Monte Carlo Lecture 6

Introduction to Neural Networks. David Stutz

Report on Image warping

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Kernel Methods and SVMs Extension

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Classification as a Regression Problem

Linear Approximation with Regularization and Moving Least Squares

Inexact Newton Methods for Inverse Eigenvalue Problems

Lecture 5 Decoding Binary BCH Codes

1 Input-Output Mappings. 2 Hebbian Failure. 3 Delta Rule Success.

Ensemble Methods: Boosting

4DVAR, according to the name, is a four-dimensional variational method.

Foundations of Arithmetic

Global Sensitivity. Tuesday 20 th February, 2018

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

x i1 =1 for all i (the constant ).

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

MATH 567: Mathematical Techniques in Data Science Lab 8

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 13

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Topic 5: Non-Linear Regression

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Some modelling aspects for the Matlab implementation of MMA

CS294A Lecture notes. Andrew Ng

Linear Regression Analysis: Terminology and Notation

SDMML HT MSc Problem Sheet 4

Supervised Learning NNs

Open Systems: Chemical Potential and Partial Molar Quantities Chemical Potential

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Iterative General Dynamic Model for Serial-Link Manipulators

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

CHAPTER 3 ARTIFICIAL NEURAL NETWORKS AND LEARNING ALGORITHM

Convexity preserving interpolation by splines of arbitrary degree

CS294A Lecture notes. Andrew Ng

Differentiating Gaussian Processes

The optimal delay of the second test is therefore approximately 210 hours earlier than =2.

Exercises. 18 Algorithms

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

ME 501A Seminar in Engineering Analysis Page 1

MULTISPECTRAL IMAGE CLASSIFICATION USING BACK-PROPAGATION NEURAL NETWORK IN PCA DOMAIN

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Appendix for Causal Interaction in Factorial Experiments: Application to Conjoint Analysis

Math1110 (Spring 2009) Prelim 3 - Solutions

APPENDIX A Some Linear Algebra

Model of Neurons. CS 416 Artificial Intelligence. Early History of Neural Nets. Cybernetics. McCulloch-Pitts Neurons. Hebbian Modification.

CSC 411 / CSC D11 / CSC C11

Transcription:

EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they are only able to solve lnearly separable classfcaton problems. Ths mples that some smple classfcaton problems such as XOR cannot be solved usng neural networks. The obvous soluton s to use mult-layer networks. Backpropagaton s a learnng algorthm that can be used to tran multple networks. MULTILAYER PERCEPTRON Multlayer networks are shown n fgure. The network conssts of three cascaded networks, the output of the frst networks s the second s network nput. Superscrpts are used to ndcate the layer. In general the followng notaton s used for the total number of neurons n a network of M layers N L L L 3... L M () where N refers to the number of nputs. PATTERN CLASSIFICATION The exclusve XOR problem cannot be solved by a sngle layer network. Ths example was used by Mnsky Papert (969) to demonstrate the lmtatons of sngle layer networks. In order to llustrate ths, consder the nput/output pars for the XOR gate [ 0 {x, t 0] 0} () 0 {x, t } (3) [ {x 3, t 0] 3 } () [ {x, t ] 0} () A representaton s shown n fgure. A two layer network can solve the problem easly. One possblty s to use a two layer network where the frst layer creates two lnear decson boundares the second layer combnes the decson boundares as shown n fgure -top. Example Let us take x, x 0. We have n () + (0) y (6) n () (0) +. 0. y (7) n +() + (). 0. y (8) whch confrms the XOR operaton. FUNCTION APPROXIMATION The other mportant applcaton of neural networks s functon approxmaton (partcularly nonlnear functons). Functon approxmaton can be accomplshed usng multple layers. Consder the -- network shown n fgure -bottom. The network parameters are w 3 (9) 3 b (0) [ w () ] b 3 () Ths network gves the nonlnear functon shown n fgure 3. By changng the weghts we can approxmate other functons of smlar complexty. It has been shown that multlayer networks wth sgmod transfer functon n the hdden layer suffcently large number of hdden layers can approxmate any functon of nterest. The queston now s how to derve a learnng algorthm for multple layer networks. Ths s dscussed n the next secton. BACKPROPAGATION Backpropagaton s a learnng algorthm for multple layer networks. It uses an optmzaton technque called the steepest or gradent descent. We contnue to use the notaton where the superscrpt s used to ndcate the layer. The equaton descrbng the output at layer m s gven by y m+ f m+ (W m+ y m + b m+ ) (3) for m 0,,..., M, where M s the total number of layers. It s possble to wrte y 0 x. In a smlar way to the Wdrow Hoff algorthm, Backpropagaton uses the mean square error as the performance ndex. The algorthm s provded wth pars f tranng data: (x, t ), (x, t ),..., (x K, t K ) where x s the nput to the network t s the target output correspondng to x. The algorthm adjusts the weghts bases to mnmze the expectaton (E) of the square error, that s F (W ) E[e ] E[(t y) ] () where W n equaton () refers to the augmented weght vector (weghts bases). The expectaton of the square error s replaced by the squared error at teraton k as follows F (W ) (t y) T (t y) () (e T e) (6)

EEE, sprng 07 Backpropagaton Fg.. Mult layer networks. Top: general case bottom: network for functon approxmaton.8.6.. y 3.8 3.6 3. 3.. 0. 0 0.. x Fg.. XOR mplementaton Fg. 3. Functon approxmaton The steepest descent algorthm conssts of the followng equatons wj m (k + ) wj m (k) α wj m b m (k + ) b m (k) α b m (7) where α s the learnng rate. The calculaton of the partal dervatves was straghtforward n the case of sngle layer networks, t s not n ths case (multple layers) where the chan rule s used. Because the error s functon of the weghts, the cost functon also depends on the weghts. The chan rule can

EEE, sprng 07 Backpropagaton be summarzed as follows: assume f s a functon of x x s a functon of t, the dervatve of f wth respect to t s gven by the chan rule as follows df dt df dx (8) dx dt Usng the chan rule, the second terms n system (7) become w m j b m w m j b m (9) (0) For each layer, we know that n s a weghted sum of the nputs+ bas, that s n m L m j w m j y m j + b m () where L m s the number of neurons n layer m. Therefore, t s possble to calculate the partal dervatves as follows: w m j y m j () b m (3) We defne the senstvty of F as follows In ths case, we can wrte s m w m j b m () s m y m j () s m (6) Fnally, the gradent descent algorthm becomes or under matrx form: wj m (k + ) wj m (k) αs m y m j (7) b m (k + ) b m (k) αs m (8) W m (k + ) W m (k) αs m (y m j ) T (9) b m (k + ) b m (k) αs m (30) where the senstvty vector s gven by s m n m. (3) To calculate the weght the bas, equatons (9) (30), we need to have the prevous weght bas as well as the senstves. Therefore, we need to calculate the senstvtes Fg.. Illustraton of the relatonshp between n m+ n m frst to obtan next values for the weghts bases. In order to mplement the gradent descent, we need to estmate the senstvtes. BACKPROPAGATING THE SENSITIVITIES The chan rule s used for ths purpose, the senstvty at layer m s computed from the senstvty at layer m +. It s possble to derve the Jacoban matrx n m n m n m. Lm+ n m Lm+ Lm+ (3) We want to fnd an expresson for each element n the matrx. Element, j of the Jacoban matrx are: ( L m l wm+ l y m l + b m+ ) Note that n general, t s possble to wrte n m nm+ y m (33) y m n m (3) Thus, we get for each element n the Jacoban matrx where w m+ yj m j w m+ f m (n m j ) j (3) (36) w m+ j g m (n m j ) (37) g m (n m j ) f m (n m j ) (38) 3

EEE, sprng 07 Backpropagaton Therefore, the Jacoban matrx can be wrtten n matrx form as follows n m W m+ G m (n m ) (39) wth g m (n m ) 0 0 0 g m (n m ) 0 G m (n m ) 0 0 g m (n m 3 ). 0 0 0 g m (n m L m ) (0) Now we can wrte a recurrence relatonshp for the senstvtes: s m n m+ T n m n m () thus s m G m (n m )[W m+ ] T s m G m (n m ) [ W m+] T s m+ () (3) Equaton (3) shows that the senstvty s propagated backwards throughout the network from the last layer to the frst layer, that s s M s M s s () Let us take a look at the fnal layer: s M n M whch can be wrtten as therefore For the last layer That s y M n M s M s M or under matrx form (t y)t (t y) n M LM j (t j y j ) n M (t y ) y n M f M (n M ) n M () (6) g M (n M ) (7) s M (t y )g M (n M ) (8) s M G M (n M )(t y) (9) The algorthm conssts of three steps: Forward propagaton of the nputs throughout the network y m+ f m+ (W m+ y m + b m+ ) (0) wth m 0,,,..., M y 0 x () Backward propagaton of the senstvtes, we start wth y. 0. 0 0.. 0 Fg.. Neural netwok approxmaton of the functon gven by (7) the output layer x s M G M (n M )(t y M ) () s m G m (n m )(W m+ ) T s m+ (3) Update the weghts bases m M, M,...,, () W m (k + ) W m (k) αs m (y m ) T () Example: Functon approxmaton b m (k + ) b m (k) αs m (6) We propose to use a -- network to approxmate the nonlnear functon ( πx ) g(x) sn (7) n the nterval x [, ]. The backpropagaton algorthm converges to the followng soluton W.889 (8).39 b 0.77 (9) 0.730 W [.697 0.686 ] (60) b.90 (6) The actual functon the neural network Example Ths example llustrates the backpropagaton algorthm. The network s as shown n the fgure -bottom. Layer has a logstc sgmod transfer functon layer has a lnear transfer functon. The ntal guess for the wesghts bases s below. W 0.700 (6) 0.00

EEE, sprng 07 b 0.800 0.300 (63) W [ 0.0900 0.700 ] (6) b 0.8 (6) Forward progapagaton The ntal nput s x. The output of the frst layer s fnally n W x + b (66) n 0.700 0.800 + 0.00 0.300 n [ +e 0.7 +e 0. 0.700 0.00 Now, the output of the frst layer can be obtaned ] y 0.308 0.638 The output of the second layer s (67) (68) (69) n W y + b (70) n [ 0.0900 0.700 ] 0.308 + 0.8 0.69 0.638 (7) y n (7) The error would be s Backpropagaton ( y )(y) 0 0.09 [ 0.6 ] 0 ( y)(y ) 0.7 (8) s 0.00 (8) 0.0 The fnal step conssts of updatng the weghts (wth α 0.0) W W α s ( y ) T (83) b b α s (8) W W α s (x) T (8) b b α s (86) W [ 0.09 0.7 ] 0.0[ 0.63] [ 0.308 0.638 ] (87) Usng these equatons we get the numercal values below W [ 0.090 0.66 ] (88) b [ 0.86 ] (89) W 0.699 0.098 (90) b 0.799 0.98 (9) The dfference wth the ntal guess s small because the learnng rate s small. e t y sn(π/ ) 0.69 0.033 (73) Propagaton of the senstvtes The goal s obtan matrx G for each one of the layers. For layer, we have g (n) d ( ) dn + e n (7) ( g e n ) (n) ( + e n ) (7) For the second layer, we have g (n) ( y)(y) (76) g (n) (77) Now t s possble to wrte for the senstvtes s G (n )(t y ) (78) s 0.033 0.63 (79) The frst layer senstvty s s G (n )[W ] T s (80)