Classification (klasifikácia) Feedforward Multi-Layer Perceptron (Dopredná viacvrstvová sieť) 14/11/2016. Perceptron (Frank Rosenblatt, 1957)

Similar documents
Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Evaluation of classifiers MLPs

Introduction to the Introduction to Artificial Neural Network

Multigradient for Neural Networks for Equalizers 1

1 Convex Optimization

Multilayer neural networks

MATH 567: Mathematical Techniques in Data Science Lab 8

Multilayer Perceptron (MLP)

Support Vector Machines CS434

Multi-layer neural networks

Classification learning II

Hidden Markov Model Cheat Sheet

Lecture Notes on Linear Regression

Advanced Topics in Optimization. Piecewise Linear Approximation of a Nonlinear Function

Pattern Recognition. Approximating class densities, Bayesian classifier, Errors in Biometric Systems

Linear Classification, SVMs and Nearest Neighbors

Bayesian Decision Theory

Machine Learning. Classification. Theory of Classification and Nonparametric Classifier. Representing data: Hypothesis (classifier) Eric Xing

EEE 241: Linear Systems

10-701/ Machine Learning, Fall 2005 Homework 3

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Week 5: Neural Networks

Lesson 16: Basic Control Modes

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Fuzzy approach to solve multi-objective capacitated transportation problem

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Pattern Classification

Boostrapaggregating (Bagging)

Support Vector Machines

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Linear Feature Engineering 11

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Neural Networks & Learning

Hopfield Training Rules 1 N

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Kristin P. Bennett. Rensselaer Polytechnic Institute

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17

Internet Engineering. Jacek Mazurkiewicz, PhD Softcomputing. Part 3: Recurrent Artificial Neural Networks Self-Organising Artificial Neural Networks

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Fundamentals of Neural Networks

Neural Networks. Class 22: MLSP, Fall 2016 Instructor: Bhiksha Raj

Support Vector Machines

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Multi layer feed-forward NN FFNN. XOR problem. XOR problem. Neural Network for Speech. NETtalk (Sejnowski & Rosenberg, 1987) NETtalk (contd.

Naïve Bayes Classifier

Mixture of Gaussians Expectation Maximization (EM) Part 2

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Supervised Learning NNs

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Bayesian classification CISC 5800 Professor Daniel Leeds

Natural Language Processing and Information Retrieval

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Ensemble Methods: Boosting

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Generative classification models

Confidence intervals for weighted polynomial calibrations

A total variation approach

Classification as a Regression Problem

Lecture 10 Support Vector Machines II

Kernel Methods and SVMs Extension

Design of Recursive Digital Filters IIR

Lecture 23: Artificial neural networks

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

Negative Binomial Regression

Logistic regression with one predictor. STK4900/ Lecture 7. Program

Combinational Circuit Design

Pattern Classification

Generalized Linear Methods

CSCI B609: Foundations of Data Science

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Lecture 20: November 7

The Bellman Equation

Pattern Classification (II) 杜俊

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Homework Assignment 3 Due in class, Thursday October 15

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

A Quadratic Cumulative Production Model for the Material Balance of Abnormally-Pressured Gas Reservoirs F.E. Gonzalez M.S.

Video Data Analysis. Video Data Analysis, B-IT

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Classification Bayesian Classifiers

A Quadratic Cumulative Production Model for the Material Balance of Abnormally-Pressured Gas Reservoirs F.E. Gonzalez M.S.

1 Input-Output Mappings. 2 Hebbian Failure. 3 Delta Rule Success.

An Accurate Heave Signal Prediction Using Artificial Neural Network

ˆ f. Contents. Overview. Function Approximation. f ˆ : X Y. y x m. Introduction to Radial Basis Function Networks RBF

Solving Nonlinear Differential Equations by a Neural Network Method

Logistic Regression Maximum Likelihood Estimation

x i 2. FEEDFORWARD NETWORKS

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Mechanics Physics 151

Using Genetic Algorithms in System Identification

Neural Networks: Algorithms and Special Architectures

15-381: Artificial Intelligence. Regression and cross validation

Chapter 6 Support vector machine. Séparateurs à vaste marge

Which Separator? Spring 1

Transcription:

4//06 IAI: Lecture 09 Feedforard Mult-Layer Percetron (Doredná vacvrstvová seť) Lubca Benuskova AIMA 3rd ed. Ch. 8.6.4 8.7.5 Classfcaton (klasfkáca) In machne learnng and statstcs, classfcaton s the roblem of dentfyng to hch of a set of categores (classes) a ne observaton belongs, on the bass of a tranng set of data contanng observatons hose category (class) membersh s kno n. Let us no consder the task of classfcaton of onts nto t o categores,.e. our classfer must fnd a boundary that searates t o classes of obects. e assume the boundary bet een the classes s not lnear, but curved,.e. nonlnear. Percetron (Frank Rosenblatt, 957) Percetron: nut/outut formulas a R + s the actvaton of nut ; a real number from (0, ), R s the eght of an nut and s the ndex of outut 3 Total nut of unt s: Outut = actvaton functon g: n n n 0 n,, a a 0, a g n ) ( a g n, a 0, 4 Nonlnear actvaton functon g(n ) Percetron tranng (learnng) Contnuous dfferentable sgmod (logstc) functon, here l s the sloe of sgmod a The goal of a learnng algorthm s to automatcally fnd the values of eghts to ft any tranng set of examles. a g( n ) e l n e ll call such a ercetron a contnuous ercetron (as oosed to a bnary ercetron) n The task can be any nonlnear roblem. For a sngle ercetron ntalzed th small random eghts, uon resentaton of each examle a ne eght array =, s calculated to move the outut of ercetron closer to the desred (target) outut. 5 6

4//06 Tranng set and error functon Let the tranng set be x s the nut array or vector (also called a attern) y s the target or desred outut, beng + for one class of nuts and 0 for the other class of nuts Error functon: here A tran g P P ( x, y )( x, y )...( x, y )...( x, y ) P E ( y g ( x )) 0 e ( x) g( x) l( x) 7 Generalsaton to contnuous ercetron Let an error functon be 0 e ant to adust ercetron s eghts after each nut attern to mnmze the error E ste by ste n order to reach a (global) mnmum n the end of tranng. E ( y g ( x )) Ths algorthm of ste-lke error mnmsaton s called gradent descent. P Mnmum of E 8 eghts otmsaton by gradent descent Gradent of a functon Mnmsaton of the error functon E movng aganst the gradent of E E (, ) ( +, + ) The second term, artal dervatve of E accordng to the eght, s the so-called generalsed error sgnal 9 The gradent of a scalar functon s a vector, hch onts n the drecton of the greatest rate of ncrease of functon and hose magntude s the greatest rate of change. The gradent of an error functon E() th resect to an ndeendent vector varable = (,..., n ) s defned as a vector, the comonents of hch are artal dervatves of E accordng to eghts, such that E E grad ( E ) E,..., n 0 Gradent descent rule The eghts are udated n the drecton of negatve gradent, thus E It s guaranteed that ths rule al ays fnds the local mnmum of E(*), hch s nearest to the ntal state (defned by ntal eghts,, etc.) Generalsed (delta) rule The eghts are udated due to each examle x as g x Delta s the error sgnal,.e. ( y Constant > 0 s the learnng seed g ( x )) If x > 0, g > 0 and the error sgnal > 0, then the eght s ncreased. * If x > 0, g > 0 & < 0, then the eght s decreased.

4//06 Dervatve of actvaton functon g After resentaton of each nut attern, each eght s udated here g s the sgmod functon g g x e ( x) g( x) l( x) From math e kno that dervatve of sgmod functon: g g( g) Thus ( y g) g( g) x 3 // Percetron tranng algorthm n seudo-code Start th random ntal eghts (e.g., n [-.5,. 5]) Do For Al l Patt erns from the tr anng set Ca lculat eactv aton Er ror = Target Value_ for_pa ttern_ - Ac tvat on Fo r All Inut eght s Delta eght_ = al ha * Error * Inu t_ * g_der eght _ = eght_ + De ltae ght_ Untl Total Error for all atterns < e " Or "Tme-out" Start th random ntal eghts n [-.5,.5]) and alha = 0.5 do eoch++ CORRECT = 0 for = to = P // loo through all tranng samles n = 0 for = 0 to = N n = n + _ * x_ f (n > 0) out = else out = 0 f (out = = desred) CORECT++ else for = 0 to N _ = _ + alha * (desred out) * x_ * g_der hle (E(*) < e) //tranng stos hen error s mnmal Contnuous ercetron Nonlnear unt th sgmod actv. functon: g(n) = / (+e -n ) has good roertes (boundedness, monotoncty, dfferentablty) Gradent descent learnng corresonds to the total Error mnmzaton: necessary condton for stong E(*) e ths tye of learnng haens onlne and s determnstc sgmod Percetron as a nonlnear classf er E Ultmate goal: comlex nonlnear boundary A sngle contnuous ercetron can rovde only one sgmod boundary. Feedforard mult-layer ercetron (MLP) T o (or more) layers of contnuous ercetrons connected th feedfor ard connectons: x hat needs to be done to be able to fnd a more comlex nonlnear boundary? x And hat needs to be done to fnd several nonlnear boundares? 7 x k 8 3

4//06 MLP (Mult-Layer Percetron) Tranng set and error Inut Layer Hdden Layer Outut Layer K k, J I x x x 3 x 4 x 5, a y a x k g(n) s a nonlnear dfferentable functon (sgmod, hyerbolc tanh, Gaussan functon, etc.) 9 Let the tranng set be A tran P P ( x, y )( x, y )...( x, y )...( x, y ) x s the nut vector (also called a attern) y s the target or desred outut value The goal of tranng s to mnmse the total error: P E ( y g ( x )) 0 g (x ) = a s the actual outut for current eght matrx 0 Gradent of a functon The gradent of a scalar functon s a vector, hch onts n the drecton of the greatest rate of ncrease of functon and hose magntude s the greatest rate of change. The gradent of an error functon E() th resect to an ndeendent vector varable = (,..., n ) s defned as a vector, the comonents of hch are artal dervatves of E accordng to eghts, such that E E grad ( E ) E,..., n eghts otmsaton by gradent descent Mnmsaton of the error functon E movng aganst the gradent of E E (, ) ( +, + ) The second term, artal dervatve of E accordng to the eght, s the so-called generalsed error sgnal Gradent descent rule All the eghts are udated n the drecton of negatve gradent: E It s guaranteed that ths rule al ays fnds the local mnmum of E, hch s nearest to the ntal state (defned by ntal eghts) 3 Error-backroagaton algorthm. Choose (0, ], and generate randomly (0) [-0.5,0.5]. Set E = 0, nut attern counter = 0, eoch counter k = 0.. For attern calculate the actual outut of MLP. 3. Calculate the generalsed learnng sgnal delta for outut unt(s). 4. Udate each eght beteen the hdden and outut unt(s). 5. Calculate the generalsed learnng sgnal delta for hdden unts. 6. Udate each eght beteen the nut and hdden unts. 7. If < P, go to ste, else contnue. 8. Freeze eghts and calculate total error E. 9. If E < e, sto. Else set E = 0, e = 0, k = k +, and go to ste. 4 4

4//06 Nonlnear un- or multvarate egresson th a MLP The comlexty of the ftted curve deends on the number of hdden unts. In ths examle, the green functon s the unkno n or target functon, hch generates the data onts, hch have some random nose added to them. Ftted functon n red for, 3 and 9 hdden unts. Smle examle of MLP Inuts are x coordnates of onts of some unknon functon y = f(x). hdden unts ( and ) have hyerbolc tanh actvaton functon. One outut unt (No. 3) sums lnearly the oututs of hdden unts mnus an adustable bas. Inut x 3 Desred outut s the functonal value y 6 Note on the nut and target outut Task: aroxmate the nonlnear functon y = f(x) Inuts = x coordnates of onts, desred outut s the value y The nut ll be a sngle number, the value of the x coordnate of the ont n the D sace. The target outut ll be the value of the y coordnate of that ont. y = 0. 7 x =. 8 MLP outut after 8 sees through all data onts MLP outut after sees through all data onts 9 30 5

4//06 Aroxmaton of nonlnear data has been acheved MLP reresentatonal oer Contnuous functons: Any bounded contnuous functon can be aroxmated th arbtrarly small error by a to-layer feedforard MLP. Sgmod functons n the hdden layer act as a set of bass functons for comosng more comlex functons, lke sne aves n Fourer analyss. Arbtrary functon: Any functon can be aroxmated to arbtrary accuracy by a multle-layer ercetron. 3 Boolean functons: Any Boolean functon can be reresented by MLP. 3 Generalsaton (redcton) A netork s sad to generalse f t gves correct results for nuts not n the tranng set = redcton of ne values. Generalsaton s tested by usng an addtonal test set After tranng, the eghts are frozen For each test nut e evaluate the MLP redcton of the functonal value Examle of good and bad generalsaton To feed-forard MLPs, one th 5 hdden sgmod neurons and the other one th hdden sgmod neurons: Outut layer : one lnear neuron Often tranng set & test set are obtaned by searatng orgnal data set nto arts. 33 F(x) = sn(x), nterval / x /, Tranng set =(- / + /8, sn(- / + /8)) 0 8. 34 Tranng results Overtranng (overfttng, reučene) Both MLPs gve good aroxmaton to sn(x) n all tranng set onts Examle taken from Alexandra I. Crstea 35 Small MLP generalses ell, bg MLP very bad: It memorzed the tranng set but gves rong results for other nuts => overfttng Too many neurons and eghts lead to olynomal of a hgh degree Overtranng : a netork that erforms very ell on the tranng set but very bad on tes t onts s sad to be overtraned. 36 6

4//06 Early stong: ho to avod overftng: Model selecton e do not kno ho many hdden unts to use for MLP to aroxmate ell the gven nonlnear functon and obtan a good generalsaton. Sto the tranng here! 37 Model selecton, e exermentally evaluate several MLPs th dfferent number of hdden unts ho ell they erform on test data. K-fold cross-valdaton: run K exerments, each tme settng asde a dfferent /K of the data to test on; Leave-one-out, e leave only one examle for test, and ( examles reeat testng N tmes (for the set of N 38 Pattern classfcaton th a MLP Summary Suervsed learnng by error-backroagaton can be used ether for nonlnear regresson or attern (obect) classfcaton. In case of nonlnear regresson: the tranng set conssts of real data values, hch the set of ars x, F(x) F(x) s the unknon functon, x s the nut vector. The task s to aroxmate F(x) by the outut of the MLP: G(x) The error sgnal s calculated based on dfference beteen G(x) and F(x) for the tranng set. Crcles and crosses are obects belongng to dfferent classes (e.g. cats and dogs). Durng learnng the eghts n MLP are gradually adusted to the values of arameters of searatng boundares bet een the classes. Number of outut neurons = number of classes In case of nonlnear regresson: the tranng set conssts of ars: nut & desred outut (.e. class labels) The task s to learn ho to correctly classfy nut vectors e kno, hch class the obect/nut falls nto and e rovde an error sgnal based on desred or target oututs. 40 The draback of error-backroagaton Some hstorcal notes Qualty of soluton deends on the startng values of eghts. Error-backroagaton ALAYS converges to the nearest mnmum of total Error. There are varous ays ho to mrove the chances to fnd the global mnmum, hch e are not gong to deal th n ths course. err eght sace Paul erbos : Beyond Regresson: Ne Tools for Predcton and Analyss n the Behavoral Scence, Ph.D. thess, Harvard Unversty, 974. Rumelhart, Hnton, llams: Learnng nternal reresentatons by backroagatng errors, Nature 33(99),. 533-536,986. Rumelhart Prze s aarded annually to an ndvdual or collaboratve team makng a sgnfcant contemorary contrbuton to the theoretcal foundatons of human cognton US$ 00,000. Mathematcal roof of the theorem of unversal aroxmaton of functons: Hornk 989 and Kurkova 989. dm. dm. Next lecture: ractcal alcatons of MLP 4 7