Fundamentals of Neural Networks

Similar documents
Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

EEE 241: Linear Systems

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Multilayer Perceptron (MLP)

Week 5: Neural Networks

Other NN Models. Reinforcement learning (RL) Probabilistic neural networks

1 Convex Optimization

Generalized Linear Methods

Supporting Information

Learning Theory: Lecture Notes

Natural Language Processing and Information Retrieval

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

MAXIMUM A POSTERIORI TRANSDUCTION

Multilayer neural networks

Lecture Notes on Linear Regression

Lecture 10 Support Vector Machines II

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Feature Selection: Part 1

10-701/ Machine Learning, Fall 2005 Homework 3

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Evaluation of classifiers MLPs

Radial-Basis Function Networks

SDMML HT MSc Problem Sheet 4

Homework Assignment 3 Due in class, Thursday October 15

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Classification as a Regression Problem

Maximum Likelihood Estimation (MLE)

Multi-layer neural networks

Logistic Classifier CISC 5800 Professor Daniel Leeds

Ensemble Methods: Boosting

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

Vapnik-Chervonenkis theory

Boostrapaggregating (Bagging)

Engineering Risk Benefit Analysis

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Multigradient for Neural Networks for Equalizers 1

Lecture 3: Probability Distributions

Convergence of random processes

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Support Vector Machines

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Solving Nonlinear Differential Equations by a Neural Network Method

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

3.1 ML and Empirical Distribution

Fuzzy Systems (2/2) Francesco Masulli

Support Vector Machines

Support Vector Machines

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Probabilistic Classification: Bayes Classifiers. Lecture 6:

The exam is closed book, closed notes except your one-page cheat sheet.

Introduction to the Introduction to Artificial Neural Network

Conjugacy and the Exponential Family

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Support Vector Machines

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

9.913 Pattern Recognition for Vision. Class IV Part I Bayesian Decision Theory Yuri Ivanov

Online Classification: Perceptron and Winnow

Errors for Linear Systems

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Gaussian process classification: a message-passing viewpoint

Nonlinear Classifiers II

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Lecture 4: Constant Time SVD Approximation

Why feed-forward networks are in a bad shape

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Solutions Homework 4 March 5, 2018

Training Convolutional Neural Networks

Generative classification models

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Statistical machine learning and its application to neonatal seizure detection

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

1 The Mistake Bound Model

Chapter 7 Channel Capacity and Coding

Video Data Analysis. Video Data Analysis, B-IT

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Neural Networks. Perceptrons and Backpropagation. Silke Bussen-Heyen. 5th of Novemeber Universität Bremen Fachbereich 3. Neural Networks 1 / 17

Outline. Multivariate Parametric Methods. Multivariate Data. Basic Multivariate Statistics. Steven J Zeil

Linear Feature Engineering 11

Estimation: Part 2. Chapter GREG estimation

CS294A Lecture notes. Andrew Ng

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Numerical Heat and Mass Transfer

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Lecture 12: Classification

Excess Error, Approximation Error, and Estimation Error

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Computational and Statistical Learning theory Assignment 4

Transcription:

Fundamentals of Neural Networks Xaodong Cu IBM T. J. Watson Research Center Yorktown Heghts, NY 10598 Fall, 2018

Outlne Feedforward neural networks Forward propagaton Neural networks as unversal approxmators Back propagaton Jacoban Vanshng gradent problem Generalzaton EECS 6894, Columba Unversty 2/33

Feedforward Neural Networks output layer h z a +1 +1 hdden layers x nput layer a (l) z (l) = j = h (l) ( w (l) j z(l 1) j + b (l) a (l) ) EECS 6894, Columba Unversty 3/33

Actvaton Functons Sgmod also known as logstc functon 1 h(a) = 1 + e a h e a (a) = (1 + e a ) 2 = h(a)[ 1 h(a) ] 1 0.9 f df/dx 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-10 -8-6 -4-2 0 2 4 6 8 10 EECS 6894, Columba Unversty 4/33

Actvaton Functons Hyperbolc Tangent (tanh) h(a) = ea e a e a + e a h 4 (a) = (e a + e a ) 2 = 1 [ h(a) ] 2 1 0.8 f df/dx 0.6 0.4 0.2 0-0.2-0.4-0.6-0.8-1 -5-4 -3-2 -1 0 1 2 3 4 5 EECS 6894, Columba Unversty 5/33

Actvaton Functons Rectfed Lnear Unt (ReLU) { a f a > 0, h(a) = max(0, a) = 0 f a 0. { 1 f a > 0, h (a) = 0 f a 0 3 f df/dx 2.5 2 1.5 1 0.5 0-3 -2-1 0 1 2 3 EECS 6894, Columba Unversty 6/33

Actvaton Functons Softmax also known as normalzed exponental functon, a generalzaton of the logstc functon to multple classes. h(a k ) = ea k j ea j h(a k ) a j = h(a k) log h(a k ) Why "softmax"? ( ) log h(a k ) = h(a k ) δ kj h(a j ) a j max{x 1,, x n } log(e x 1 + + e xn ) max{x 1,, x n } + log n EECS 6894, Columba Unversty 7/33

Loss Functons Regresson Classfcaton EECS 6894, Columba Unversty 8/33

Regresson: Least Square Error Let y n,k and ȳ n,k be the output and target respectvely. For regresson, the typcal loss functon s least square error (LSE): L(y, ȳ) = 1 (y n ȳ n ) 2 2 n = 1 (y nk ȳ nk ) 2 2 where n s the sample ndex and k s the dmenson ndex. The dervatve of LSE wth respect to each dmenson of each sample: n L(y, ȳ) y nk = (y nk ȳ nk ) k EECS 6894, Columba Unversty 9/33

Classfcaton: Cross-Entropy (CE) Suppose there are two (dscrete) probablty dstrbutons p and q, the "dstance" between the two dstrbutons can be measured by the Kullback-Lebler dvergence: D KL (p q) = k p k log p k q k For classfcaton, the targets ȳ n are gven y n = {ȳ n1,, ȳ nk,, ȳ nk } The predctons after the softmax layer are posteror probabltes after normalzaton y n = {y n1,, y nk,, y nk } Now measure the "dstance" of these two dstrbutons from all samples D KL (ȳ n y n ) = ȳ nk log ȳnk = ȳ nk log ȳ nk ȳ nk log y nk n n y k nk n k n k = ȳ nk log y nk + C n k EECS 6894, Columba Unversty 10/33

Classfcaton: Cross-Entropy (CE) Cross-entropy as the loss functon: L(ȳ, y) = ȳ nk log y nk n Typcally, targets are gven n the form of 1-of-K codng ȳ n = {0,, 1,, 0} where { 1 f k = kn, ȳ n = 0 otherwse It follows that the cross-entropy has an even smpler form: k L(ȳ, y) = n log y n kn EECS 6894, Columba Unversty 11/33

Classfcaton: Cross-Entropy (CE) dervatve of the composte functon of cross-entropy and softmax: E n a nk = L(ȳ, y) = E n = n n y n = a nk = = = ean j eanj ( ( ȳ n log y n ) = ȳ n log y n ) ȳ n log y n a nk log y n y n ȳ n = ȳ n y n log y n y n a nk y n log y n a nk ȳ n y n y n (δ k y nk ) = ȳ n δ k + ȳ n (δ k y nk ) ȳ n y nk = ȳ nk + y nk ( ȳ n, ) = y nk ȳ nk What s the dervatve for an arbtrary dfferentable functon? leave t for exercse EECS 6894, Columba Unversty 12/33

Algorthm 1 Defnton of a DNN wth 3 hdden layers functon create_model() MODEL local n_nputs = 460 local n_outputs = 3000 local n_hdden = 1024 local model = nn.sequental() model:add(nn.lnear(n_nputs, n_hdden)) model:add(nn.sgmod()) model:add(nn.lnear(n_hdden,n_hdden)) model:add(nn.sgmod()) model:add(nn.lnear(n_hdden,n_hdden)) model:add(nn.sgmod()) model:add(nn.lnear(n_hdden,n_outputs)) model:add(nn.logsoftmax()) LOSS FUNCTION local crteron = nn.classnllcrteron() return model:cuda(), crteron:cuda() end A DNN Example In Torch EECS 6894, Columba Unversty 13/33

Neural Networks As Unversal Approxmators Unversal Approxmaton Theorem[1][2][3] Let φ( ) be a nonconstant, bounded, and monotoncally-ncreasng contnuous functon. Let I n represent an n-dmensonal unt cube [0, 1] n and C(I n ) be the space of contnuous functons on I n. Then gven any functon f C(I n ) and ɛ > 0, there exst an nteger N and real constants a, b R and w R n, where = 1, 2,, N, such that one may defne N F (x) = a φ(w T x + b ) =1 as an approxmate realzaton of the functon f where f s ndependent of φ such that for all x I n. F (x) f(x) < ɛ [1] Cybenko, G. (1989). "Approxmaton by Superpostons of a Sgmodal Functon," Math. Control Sgnals Systems, 2, 303-314. [2] Hornk, K., Stnchcombe, M., and Whte, H. (1989). "Multlayer Feedforward Networks are Unversal Approxmators," Neural Networks, 2(5), 359-366. [3] Hornk, K. (1991). "Approxmaton Capabltes of Multlayer Feedforward Networks," Neural Networks, 4(2), 251-257. EECS 6894, Columba Unversty 14/33

Neural Networks As Unversal Classfers Extends f(x) to decson functons of the form [1] f(x) = j, ff x P j, j = 1, 2,, K where P j partton A n nto K dsjont measurable subsets and A n s a compact subset of R n. Arbtrary decson regons can be arbtrarly well approxmated by contnuous feedforward neural networks wth only a sngle nternal, hdden layer and any contnuous sgmodal nonlnearty! [1] Cybenko, G. (1989). "Approxmaton by Superpostons of a Sgmodal Functon," Math. Control Sgnals Systems, 2, 303-314. [2] Huang, W. Y. and Lppmann, R. P. (1988). "Neural Nets and Tradtonal Classfers," n Neural Informaton Processng Systems (Denver 1987), D. Z. Anderson, Edtor. Amercan Insttute of Physcs, New York, 387-396. EECS 6894, Columba Unversty 15/33

Back Propagaton L How to compute the gradent of the loss functon wth respect to a weght w j? L k w j j wj wk D. E. Rumelhart, G. E. Hnton and R. J. Wllams (1986). "Learnng representatons by back-propagatng errors," Nature, vol. 323, 533-536. EECS 6894, Columba Unversty 16/33

Back Propagaton a (l) z (l) = j w(l) j = h (l) ( a (l) ) j z(l 1) L w (l) j = L a (l) a (l) w (l) j It follows that a (l) w (l) = z (l) j and δ (l) L j δ s often referred to as errors. By the chan rule, a (l) δ (l) = L a (l) = k L a (l+1) k a (l+1) k a (l) = k δ (l+1) a (l+1) k k a (l) EECS 6894, Columba Unversty 17/33

Back Propagaton a (l+1) k = w (l+1) k h(a (l) ) therefore a (l+1) k a (l) = w (l+1) k h (a (l) ) It follows that δ (l) = h (a (l) ) k w (l+1) k δ (l+1) k whch tells us that the errors can be evaluated recursvely. EECS 6894, Columba Unversty 18/33

A neural network wth multple layers Softmax output layer Cross-entropy loss functon Back Propagaton: An Example Forward propagaton: push the nput x n through the network to get the actvatons to the hdden layers ) a (l) = j w (l) j z(l 1) j, z (l) ( = h (l) a (l) Back propagaton: start from the output layer δ (L) k = y nk ȳ nk back-propagate errors from the output layer all the way down to the nput layer δ (l) = h (a (l) ) k w (l+1) k δ (l+1) k evaluate gradents L w (l) j = δ (l) z (l) j EECS 6894, Columba Unversty 19/33

Gradent Checkng In practce, when you mplement the gradent of a network or a module, one way to check the correctness of your mplementaton s va the followng gradent checkng by a (symmetrc) fnte dfference: L = L(w j + ɛ) L(w j ɛ) w j 2ɛ check the "closeness" of the two gradents (your own mplementaton and the one from the above dfference) g 1 g 2 g 1 + g 2 EECS 6894, Columba Unversty 20/33

Jacoban Matrx y k x measure the senstvty of outputs wth respect to nputs of a (sub-)network can be used as a modular operaton for error backprop n a larger network EECS 6894, Columba Unversty 21/33

Jacoban Matrx The Jacoban matrx can be computed usng a smlar back-propagaton procedure. y k x = l y k a l a l x where a l are the actvaton functons havng mmedate connectons wth the nputs x. y k x = l Analogous to the error propagaton, defne and t can be computed recursvely δ kl = y k a l = j y k a j a j a l = j δ kl y k a l where j sweeps all the connectons w jl wth l. w l y k a l y k a j [ wjl h (a l ) ] = j δ kj [ wjl h (a l ) ] What s the Jacoban matrx of softmax outputs to the nputs? leave t as an exercse EECS 6894, Columba Unversty 22/33

Vanshng Gradents functon nonlnearty gradent sgmod x = 1 L L 1+e a a = x(1 x) x tanh x = ea e a e a +e a L a = (1 x2 ) L x softsgn x = a 1+ a L a = (1 x )2 L x What s the problem??? x 1 L a L x Nonlnearty causes vanshng gradents. EECS 6894, Columba Unversty 23/33

Generalzaton Statstcal Learnng Theory Statstcal learnng theory refers to the process of nferrng general rules by machnes from the observed samples. It attempts to answer the followng questons n terms of learnng Whch learnng tasks can be performed by machnes n general? What knd of assumptons do we have to make such that machne learnng can be successful? What are the key propertes a learnng algorthm needs to satsfy n order to be successful? Whch performance guarantees can we gve on the results of certan learnng algorthms? [1] U. v. Luxburg and B. Scholkopf, "Statstcal learnng theory: models, concepts, and results," arxv:0810.4752v1, 2008. EECS 6894, Columba Unversty 24/33

Formulaton of Supervsed Learnng Suppose X s the nput space and Y s the output (label) space, the learnng s to estmate a mappng functon f between the two spaces f : X Y More specfcally, we choose f from a functon space F. In order to estmate f, we have access to a set of tranng samples (x 1, y 1),, (x, y ),, (x n, y n) X Y whch are ndependently drawn from the underlyng jont dstrbuton p(x, y) on X Y. A loss functon l s defned on X Y to measure the goodness of f l(x, y, f(x)) On top of that, a rsk functon R(f) s defned to measure the average loss over the underlyng data dstrbuton p(x, y): R(f) = E p[l(x, y, f(x))] = p(x, y)l(x, y, f(x))dx and we pck the mnmzer of the rsk f F = argmn R(f). f F EECS 6894, Columba Unversty 25/33

Formulaton of Supervsed Learnng If the functon space F ncludes all the functons, then the Bayesan rsk s the mnmal rsk we can ever acheve. Assumng we know the underlyng dstrbuton p(x, y), we can compute ts condtonal dstrbuton p(y x) from whch we can then compute the Bayes classfer f Bayes (x) = argmax p(y x). y Y In the formulaton of the supervsed learnng, we make no assumptons on the underlyng dstrbuton p(x, y). It can be any dstrbuton on X Y. However, we assume p(x, y) s fxed but unknown to us at the tme of learnng. In practce, we deal wth emprcal rsk and pck ts mnmzer R emp (f) = 1 n l(x, y, f(x )) n =1 f n = argmn R emp (f). f F EECS 6894, Columba Unversty 26/33

The Bas-Varance Trade-off Fall f Bayes ff F fn [ ] [ ] R(f n ) R(f Bayes ) = R(f n ) R(f F ) + R(f F ) R(f Bayes ) EECS 6894, Columba Unversty 27/33

Generalzaton and Consstency Let (x, y ) be an nfnte sequence of tranng samples whch have been drawn ndependently from some underlyng dstrbuton p(x, y). Let l be a loss functon. For each n, let f n be a classfer constructed by some learnng algorthm on the bass of the frst n tranng samples. 1. the learnng algorthm s called consstent wth respect to F and p f the rsk R(f n ) converges n probablty to the rsk R(f F ) of the best classfer n F, that s for all ɛ > 0 P (R(f n ) R(f F ) > ɛ) 0 as n 2. the learnng algorthm s called Bayes-consstent wth respect to F and p f the rsk R(f n ) converges n probablty to the rsk R(f Bayes ) of the Bayes classfer, that s for all ɛ > 0 P (R(f n ) R(f Bayes ) > ɛ) 0 as n 3. the learnng algorthm s called unversally consstent wth respect to F f t s consstent wth respect to F for all dstrbutons p. 4. the learnng algorthm s called unversally Bayes-consstent f t s Bayes-consstent for all dstrbutons p. EECS 6894, Columba Unversty 28/33

Emprcal Rsk Mnmzaton where f n = argmn R emp (f). f F R emp (f) = 1 n l(x, y, f(x )) n =1 As n changes, we are actually dealng wth a sequence of classfers {f n }. We hope that R(f n ) s consstent wth respect to R(f F ): R(f n ) R(f F ), when n where R(f F ) s the best rsk we can acheve gven F. EECS 6894, Columba Unversty 29/33

An Overfttng Example Suppose the data space s X = [0, 1], the underlyng dstrbuton on X s unform, the label y for nput x s defned as follows [1]: y = { 1 f x < 0.5 1 f x 0.5 Obvously, we have R(f Bayes ) = 0. Suppose we observe a set of samples (x, y ), = 1,, n, and construct the classfer below f n(x) = { y f x = x, = 1,, n 1 otherwse The constructed classfer f n perfectly classfes all tranng samples, whch mnmzes the emprcal rsk and drves t to 0. Suppose we draw test samples from the underlyng dstrbuton and assume they are not dentcal to the tranng samples then the f n constructed above smply predct every sample wth label 1 whch s wrong on half of the test samples. 1 2 = R(fn) R(f Bayes) = 0 as n The classfer f n s not consstent. Obvously the classfer does not learn anythng from the tranng samples other than memorze them. [1] U. v. Luxburg and B. Scholkopf, "Statstcal learnng theory: models, concepts, and results," arxv:0810.4752v1, 2008. EECS 6894, Columba Unversty 30/33

Unform Convergence of Emprcal Rsk For learnng consstency wth respect to F: R(f n) R(f F ) R(f n) R emp(f n) + R emp(f F ) R(f F ) The Chernoff-Hoeffdng nequalty: ( ) 1 n P ξ E(ξ) n ɛ 2 exp( 2nɛ 2 ) =1 It follows that P ( R emp(f) R(f) ɛ) 2 exp( 2nɛ 2 ) whch shows that for any fxed functon and suffcently large number of samples, t s hghly probable that the tranng error provdes a good estmate of the test error. Theorem (Vapnk & Chervonenks) Unform convergence ( P sup f F ) R(f) R emp(f) > ɛ 0 as n for all ɛ > 0 s a necessary and suffcent condton for consstency of emprcal rsk mnmzaton wth respect to F. EECS 6894, Columba Unversty 31/33

Capacty of Functon Spaces What knd of property should a functon space F have to ensure such unform convergence? ( ) Larger F gves rse to larger P sup f F R(f) R emp(f) > ɛ, whch makes t harder to ensure a unform convergence. Ths leads to the concept of the capacty of the functon space F. Unform Convergence Bounds: ( P sup f F ) R(f) R emp(f) ) > ɛ 2N(F, 2n) exp( nɛ 2 ) The quantty N(F, n) s referred to as the shatterng coeffcent of the functon class F wth respect to sample sze n, whch s also known as the growth functon. It measures the number of ways that the functon space can separate the patterns nto two classes. It measures the sze" of a functon space by countng the effectve number of functons gven a sample sze n. Theorem (Vapnk & Chervonenks) 1 log N(F, n) 0 n s a necessary and suffcent condton for consstency of emprcal rsk mnmzaton on F. EECS 6894, Columba Unversty 32/33

Generalzaton Bounds Gven δ > 0, wth a probablty of 1 δ, any functon f F satsfes R(f) R emp (f) + 1 ( ) log (2N(F, n)) log δ n Or R(f) R emp (f) + C + log 1 δ n happens wth probablty at least 1 δ, where C s a constant representng the complexty of the functon space F. Generalzaton bounds are typcally wrtten n the followng form for any f F. Other common capacty measures VC dmenson Rademacher complexty R(f) R emp (f) + capacty(f) + confdence(δ) EECS 6894, Columba Unversty 33/33