Artificial Neural Networks MLP, Backpropagation

Similar documents
Artificial Neural Networks Backpropagation & Deep Neural Networks

COMP9444 Neural Networks and Deep Learning 3. Backpropagation

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

A Scalable Recurrent Neural Network Framework for Model-free

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

IAML: Support Vector Machines

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

COMP 551 Applied Machine Learning Lecture 4: Linear classification

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

Chapter 3: Cluster Analysis

Support-Vector Machines

Pattern Recognition 2014 Support Vector Machines

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

Part 3 Introduction to statistical classification techniques

Online Handwritten Character Recognition Using an Optical Backpropagation Neural Network

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Feedforward Neural Networks

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

Resampling Methods. Chapter 5. Chapter 5 1 / 52

NEURAL NETWORKS. Neural networks

Neural Networks with Wavelet Based Denoising Layers for Time Series Prediction

Agenda. What is Machine Learning? Learning Type of Learning: Supervised, Unsupervised and semi supervised Classification

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur

* o * * o o 1.5. o o. o * 0.5 * * o * * o. o o. oo o. * * * o * 0.5. o o * * * o o. * * oo. o o o o. * o * * o* o

Chapter 7. Neural Networks

Tree Structured Classifier

Day 7: Optimization, regularization for NN

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Kinetic Model Completeness

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

Data Mining: Concepts and Techniques. Classification and Prediction. Chapter February 8, 2007 CSE-4412: Data Mining 1

Checking the resolved resonance region in EXFOR database

Physics 2010 Motion with Constant Acceleration Experiment 1

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

Distributions, spatial statistics and a Bayesian perspective

Differentiation Applications 1: Related Rates

The blessing of dimensionality for kernel methods

, which yields. where z1. and z2

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp

This section is primarily focused on tools to aid us in finding roots/zeros/ -intercepts of polynomials. Essentially, our focus turns to solving.

What is Statistical Learning?

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Dead-beat controller design

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Section 5.8 Notes Page Exponential Growth and Decay Models; Newton s Law

Training Algorithms for Recurrent Neural Networks

Rigid Body Dynamics (continued)

Chapter 11: Neural Networks

ELT COMMUNICATION THEORY

[COLLEGE ALGEBRA EXAM I REVIEW TOPICS] ( u s e t h i s t o m a k e s u r e y o u a r e r e a d y )

Churn Prediction using Dynamic RFM-Augmented node2vec

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

Computational modeling techniques

EEO 401 Digital Signal Processing Prof. Mark Fowler

ENSC Discrete Time Systems. Project Outline. Semester

Computational modeling techniques

Sections 15.1 to 15.12, 16.1 and 16.2 of the textbook (Robbins-Miller) cover the materials required for this topic.

Name AP CHEM / / Chapter 1 Chemical Foundations

The Research on Flux Linkage Characteristic Based on BP and RBF Neural Network for Switched Reluctance Motor

Lead/Lag Compensator Frequency Domain Properties and Design Methods

NEURAL NETWORKS. modifications of EBP

Homology groups of disks with holes

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

Learning to Control an Unstable System with Forward Modeling

Lecture 13: Markov Chain Monte Carlo. Gibbs sampling

SPH3U1 Lesson 06 Kinematics

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 13 Offprint

19 Better Neural Network Training; Convolutional Neural Networks

Dataflow Analysis and Abstract Interpretation

Analysis on the Stability of Reservoir Soil Slope Based on Fuzzy Artificial Neural Network

AP Statistics Notes Unit Two: The Normal Distributions

Thermodynamics Partial Outline of Topics

and the Doppler frequency rate f R , can be related to the coefficients of this polynomial. The relationships are:

Chapter 4 The debroglie hypothesis

Chapter 3 Kinematics in Two Dimensions; Vectors

You need to be able to define the following terms and answer basic questions about them:

Margin Distribution and Learning Algorithms

Building to Transformations on Coordinate Axis Grade 5: Geometry Graph points on the coordinate plane to solve real-world and mathematical problems.

Simple Linear Regression (single variable)

Reinforcement Learning" CMPSCI 383 Nov 29, 2011!

Preparation work for A2 Mathematics [2017]

Determining the Accuracy of Modal Parameter Estimation Methods

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

B. Definition of an exponential

Lecture 17: Free Energy of Multi-phase Solutions at Equilibrium

Data Mining with Linear Discriminants. Exercise: Business Intelligence (Part 6) Summer Term 2014 Stefan Feuerriegel

Flipping Physics Lecture Notes: Simple Harmonic Motion Introduction via a Horizontal Mass-Spring System

Lecture 5: Equilibrium and Oscillations

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Introduction to Smith Charts

Quantum Harmonic Oscillator, a computational approach

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

MODULE FOUR. This module addresses functions. SC Academic Elementary Algebra Standards:

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Surface and Contact Stress

/ / Chemistry. Chapter 1 Chemical Foundations

Transcription:

Artificial Neural Netwrks MLP, Backprpagatin 01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001 00100000 01101011 01100001 01110100 01100101 01100100 01110010 01111001 00100000 01110000 01101111 01100011 01101001 01110100 01100001 01100011 01110101 00101100 00100000 01000110 01000101 01001100 00100000 01000011 01010110 01010101 01010100 00101100 00100000 Jan Drcal drcajan@fel.cvut.cz Cmputatinal Intelligence Grup Department f Cmputer Science and Engineering Faculty f Electrical Engineering Czec Tecnical University in Prague

Outline MultiLayer Perceptrn (MLP). Hw many layers and neurns? Hw t train ANNs? Backprpagatin. Derived and ter metds.

Layered ANNs single layer tw layers feed-frward netwrks recurrent netwrk

MultiLayer Perceptrn (MLP) Inputs x 1 x 2 x 3 Input Layer 1 1 1 w ij Hidden Layers ϕ ϕ ϕ ϕ w jk ϕ ϕ ϕ ϕ w kl Output Layer ϕ ϕ y 1 y 2 Outputs

Neurns in MLPs nnlinear transfer functin x 1 x 2 w 1 w 2 neurn utput McCullc-Pitts perceptrn. x 3 w 3 y Θ tresld x n w n weigts neurn inputs

Lgistic Sigmid Functin S s = 1 1 e s Sigmid fr different gain/slpe parameter γ. But als many ter (nn)-linear functins...

Hw Many Hidden Layers? MLPs wit Discrete Activatin Functins see ftp://ftp.sas.cm/pub/neural/faq3.tml#a_l fr verview Structure Types f Decisin Regins Exclusive-OR Prblem Classes wit Mesed regins Mst General Regin Sapes Single-Layer Half Plane Bunded By Hyperplane A B B A B A Tw-Layer Cnvex Open Or Clsed Regins A B B A B A Tree-Layer Arbitrary (Cmplexity Limited by N. f Ndes) A B B A B A

Hw Many Hidden Layers? Cntinuus MLPs Universal Apprximatin prperty. Kurt Hrnik: Fr MLP using cntinuus, bunded, and nn-cnstant activatin functins a single idden layer is enug t apprximate any functin. Q: wat abut linear activatin?

Hw Many Hidden Layers? Cntinuus MLPs Universal Apprximatin prperty. Kurt Hrnik: Fr MLP using cntinuus, bunded, and nn-cnstant activatin functins a single idden layer is enug t apprximate any functin. Q: wat abut linear activatin? A: it is cntinuus, nn-cnstant but nt bunded! We will get back t tis tpic later...

Cntinuus MLPs Altug ne idden layer is enug fr a cntinuus MLP: we dn't knw w many neurns t use, fewer neurns are ften sufficient fr ANN arcitectures wit tw (r mre) idden layers. See ftp://ftp.sas.cm/pub/neural/faq3.tml#a_l fr example.

Hw Many Neurns? N ne knws :( we ave nly rug estimates (upper bunds): ANN wit a single idden layer: N id = N in. N ut, ANN wit tw idden layers: N id 1 =N ut. 3 N in N ut 2, N id 2 =N ut. 3 N in N ut. Yu ave t experiment.

Generalizatin vs. Overfitting Wen training ANNs we typically want tem t perfrm accurately n new previusly unseen data. Tis ability is knwn as te generalizatin. Wen ANN rater memrizes te training data wile giving bad results n new data, we talk abut verfitting (vertraining).

Training/Testing Sets Training set Testing set Randm samples. ~70% ~30% Used t train ANN mdel. Testing set errr generalizatin

Overfitting Example ere, we suld stp training testing errr training errr ttp://uplad.wikimedia.rg/wikipedia/cmmns/1/1f/overfitting_svg.svg

Example: Bad Cice f A Training Set

Training/Validatin/Testing Sets Ripley Pattern Recgnitin and Neural Netwrks, 1996: Training set: A set f examples used fr learning, tat is t fit te parameters [i.e., weigts] f te ANN. Validatin set: A set f examples used t tune te parameters [i.e., arcitecture, nt weigts] f an ANN, fr example t cse te number f idden units. Test set: A set f examples used nly t assess te perfrmance (generalizatin) f a fully-specified ANN. Separated: ~60%, ~20%, ~20%. Nte: meaning f te validatin and test sets is ften reversed in literature (macine-learning vs. statististics). Fr example see Priddy, Keller: Artificial neural netwrks: an intrductin (Ggle bks, p. 44)

Training/Validatin/Testing Sets II Taken frm Ricard Gutierrez-Osuna's slides: ttp://curses.cs.tamu.edu/rgutier/ceg499_s02/l13.pdf

k-fld Crss-validatin Example 10-fld crss-validatin: Split training data t 10 flds f equal size. Training data Repeat 10 times, always cse different fld fr validatin set (1 f 10). 90% 10% Training set Create 10 ANN mdels. Cse te best fr testing data. Suitable fr small datasets, reduces te prblems caused by randm selectin f training/testing sets Validatin set Te crss-validatin errr is te average ver all (10) validatin sets.

Cleaning Dataset Imputing missing values. Outlier identificatin: te instance f te data distant frm te rest f te data Smting-ut te nisy data.

Data Reductin Often needed fr large data sets. Te reduced dataset suld be representative sample f te riginal dataset. Te simplest algritm: randmly remve data instances.

Data Transfrmatin Nrmalizatin: scaling/sifting values t fit given interval.. Aggregatin: i.e. binning - discretizatin (cntinuus values t classes).

Learning Prcess Ntes We dn't ave t cse instances sequentially randm selectin. We can apply certain instances mre frequently tan ters. We need ften undreds t tusands epcs t train te netwrk. Gd strategy migt speed tings up.

Backprpagatin (BP) Paul Werbs, 1974, Harvard, PD tesis. Still ppular metd, many mdificatins. BP is a learning metd fr MLP: cntinuus, differentiable activatin functins!

BP Overview randm weigt initializatin repeat repeat // epc cse an instance frm te training set apply it t te netwrk evaluate netwrk utputs cmpare utputs t desired values mdify te weigts until all instances selected frm te training set until glbal errr < criterin

ANN Energy Backprpagatin is based n a minimalizatin f ANN energy (= errr). Energy is a measure describing w te netwrk is trained n given data. Fr BP we define te energy functin: were E TOTAL = p E p E p = 1 2 i=1 N d i y i 2 Te ttal sum cmputed ver all patterns f te training set. we will mit p in fllwing slides Nte, ½ nly fr cnvenience we will see later...

ANN Energy II Te energy is a functin f: E= f X, W W X weigts (treslds) variable, inputs fixed (fr given pattern).

Backprpagatin Keynte Fr given values at netwrk inputs we btain an energy value. Our task is t minimize tis value. Te minimizatin is dne via mdificatin f weigts and treslds. We ave t identify w te energy canges wen a certain weigt is canged by w. Tis crrespnds t partial derivatives. w We emply a gradient metd.

Gradient Descent in Energy Landscape Energy/Errr Landscape (w 1,w 2 ) (w 1 + w 1,w 2 +Δw 2 ) 36NaN - Neurnvé sítě M. Šnrek 28

Weigt Update We want t update weigts in ppsite directin t te gradient: w jk = w jk weigt delta learning rate Nte: gradient f energy functin is a vectr wic cntains partial derivatives fr all weigts (treslds)

Ntatin w jk m w 0k m w jk s k m m y k x k N i, N, N weigt f cnnectin frm neurn j t neurn k tresld f neurn k in layer m weigt f cnnectin frm layer m-1 t m inner ptential f neurn k in layer m utput f neurn k in layer m k-t input number f neurns in input, idden and utput layers x 1 x 2 x N i w 11 w 12 w 01 +1 y 1 +1 y 2 +1 y N w 11 w 12 y 1 y 2 y N input idden utput +1 w 01 +1 +1

Energy as a Functin Cmpsitin = w s jk k w jk

Energy as a Functin Cmpsitin = w s jk k w jk use s k = j w jk y j w jk = y j

Energy as a Functin Cmpsitin = w s jk k w jk use s k = j w jk y j dente k = w jk = y j

Energy as a Functin Cmpsitin = w s jk k w jk use s k = j w jk y j dente k = w jk = y j w jk = k y j Remember te delta rule?

Output Layer k = utput layer k =

k = Output Layer utput layer = s k k =

Output Layer k = utput layer = s k k = derivate f activatin functin =S ' s k

Output Layer E= 1 2 i=1 use k = N d i y i 2 y = d k y k k utput layer = s k k = derivate f activatin functin =S ' s k

Output Layer E= 1 2 i=1 use dependency f energy n a netwrk utput k = N d i y i 2 y = d k y k k utput layer = s k k = derivate f activatin functin =S ' s k

Output Layer E= 1 2 i=1 use k = N d i y i 2 y = d k y k k dependency f energy n a netwrk utput Tat is wy we used 1/2. utput layer = s k k = derivate f activatin functin =S ' s k

E= 1 2 i=1 use k = N d i y i 2 y = d k y k k dependency f energy n a netwrk utput Tat is wy we used 1/2. Output Layer utput layer = s k Again, remember te delta rule? k = derivate f activatin functin =S ' s k

E= 1 2 i=1 use k = N d i y i 2 y = d k y k k dependency f energy n a netwrk utput Tat is wy we used 1/2. Output Layer utput layer = s k Again, remember te delta rule? k = derivate f activatin functin =S ' s k w = y = d y S ' s y jk k j k k k j

Hidden Layer k = idden layer k =

Hidden Layer k = = s k idden layer k =

Hidden Layer k = = s k idden layer k = =S ' s k Same as utput layer.

Hidden Layer k = = s k idden layer k = Nte, tis is utput f a idden neurn. =S ' s k Same as utput layer.

Hidden Layer k = = s k idden layer k = Nte, tis is utput f a idden neurn.? let's lk at tis partial derivatin =S ' s k Same as utput layer.

Hidden Layer II y = N l=1 k s l s l N = l=1 s l i=1 N wil y i = Apply te cain rule (ttp://en.wikipedia.rg/wiki/cain_rule). y k w k1 w k2 2 1 y 1 y 2 N w kn y N

Hidden Layer II y = N l=1 k s l s l N = l=1 s l i=1 N wil y i = N = l=1 Apply te cain rule (ttp://en.wikipedia.rg/wiki/cain_rule). s w N kl= l=1 l l w kl y k w k1 w k2 2 N 1 w kn y 1 y 2 y N

Hidden Layer II y = N l=1 k s l s l N = l=1 s l i=1 N wil y i = N = l=1 But we knw tis already. k = Apply te cain rule (ttp://en.wikipedia.rg/wiki/cain_rule). s w N kl= l=1 l l w kl y k w k1 w k2 2 N 1 w kn y 1 y 2 y N

Hidden Layer II y = N l=1 k s l s l N = l=1 s l i=1 N wil y i = N = l=1 But we knw tis already. k = Apply te cain rule (ttp://en.wikipedia.rg/wiki/cain_rule). s w N kl= l=1 l Take te errr f te utput neurn l w kl and multiply it by te input weigt. y k w k1 w k2 2 N 1 w kn y 1 y 2 y N

Hidden Layer II y = N l=1 k s l s l N = l=1 s l i=1 N wil y i = N = l=1 But we knw tis already. k = Apply te cain rule (ttp://en.wikipedia.rg/wiki/cain_rule). s w N kl= l=1 l Take te errr f te utput neurn l w kl and multiply it by te input weigt. y k Here te back-prpagatin actually appens... w k1 w k2 2 N 1 w kn y 1 y 2 y N

Hidden Layer III Nw, let's put it all tgeter! s = k N = l=1 k w kl S ' s k

Hidden Layer III Nw, let's put it all tgeter! s = k N = l=1 k w kl S ' s k k = s = N l=1 k k w kl S ' s k

Hidden Layer III Nw, let's put it all tgeter! s = k N = l=1 k = s = N l=1 k k w kl S ' s k k w kl S ' s k Te derivatin f te activatin functin is te last ting t deal wit! N w = x = jk k j l=1 w S ' s x k kl k j

Sigmid Derivatin S ' s k = 1 1 e s k ' = 1 e s k e s k 1 e s k = y k 1 y k Tat is wy we needed cntinuus & differentiable activatin functins!

Hw Abut General Case? Arbitrary number f idden layers? x 1 1 w 11 1 w 12 y 1 1 w 11 w 12 y 1 w 11 w 12 y 1 x 2 y 2 y 2 1 y 2 x N i y N input 1 y N 1 y N utput idden k It's te same: fr layer -1 use.

BP Put All Tgeter Output layer: w jk = y k 1 y k d k y k y j Tis is equal t x j wen we get t inputs. Hidden layer m (nte tat +1 = ): w m jk = m k y m 1 j = y m k 1 y m k l=1 N m 1 k m 1 w kl m 1 y j m 1 Weigt (tresld) updates: w jk t 1 =w jk t w jk t

Ptential Prblems Hig dimensin f weigt (tresld) space. Cmplexity f energy functin. Different sape f energy functin fr different input vectrs.

Mdificatins t Standard BP Mmentum Simple, but greatly elps wen aviding lcal minima: w ij t = j t y i t w ij t 1 mmentum cnstant: [0,1

Mdificatins t Standard BP II Nrmalized cumulated delta, all canges applied tgeter, Delta-bar-delta rule, using als previus gradient Extended, bt abve tgeter, Quickprp, Rprp: secnd rder metds (apprximate Hessian).

Oter Appraces Based n Numerical Optimizatin Cmpute partial derivatives ver te ttal energy: TOTAL w jk and use any numerical ptimizatin metd, i.e.: Cnjugated gradients, Quasi-Newtn metds, Levenberg-Marquardt

Typical Energy Beaviur During Learning

Typical Learning Runs E E E # iter # iter # iter Tis is best! Cnstant and fast decrease f errr. Seems like lcal minimum. Cange learning rate, mmentum, ANN arcitecture, run again (randm weigt initializatin)... Nisy data.

Next Lecture Unsupervised learning, SOM etc.