Fundamentals of Neural Networks

Fundamentals of Neural Networks Xaodong Cu IBM T. J. Watson Research Center Yorktown Heghts, NY 10598 Fall, 2018

Outlne Feedforward neural networks Forward propagaton Neural networks as unversal approxmators Back propagaton Jacoban Vanshng gradent problem Generalzaton EECS 6894, Columba Unversty 2/33

Feedforward Neural Networks output layer h z a +1 +1 hdden layers x nput layer a (l) z (l) = j = h (l) ( w (l) j z(l 1) j + b (l) a (l) ) EECS 6894, Columba Unversty 3/33

Actvaton Functons Sgmod also known as logstc functon 1 h(a) = 1 + e a h e a (a) = (1 + e a ) 2 = h(a)[ 1 h(a) ] 1 0.9 f df/dx 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-10 -8-6 -4-2 0 2 4 6 8 10 EECS 6894, Columba Unversty 4/33

Actvaton Functons Hyperbolc Tangent (tanh) h(a) = ea e a e a + e a h 4 (a) = (e a + e a ) 2 = 1 [ h(a) ] 2 1 0.8 f df/dx 0.6 0.4 0.2 0-0.2-0.4-0.6-0.8-1 -5-4 -3-2 -1 0 1 2 3 4 5 EECS 6894, Columba Unversty 5/33

Actvaton Functons Rectfed Lnear Unt (ReLU) { a f a > 0, h(a) = max(0, a) = 0 f a 0. { 1 f a > 0, h (a) = 0 f a 0 3 f df/dx 2.5 2 1.5 1 0.5 0-3 -2-1 0 1 2 3 EECS 6894, Columba Unversty 6/33

Actvaton Functons Softmax also known as normalzed exponental functon, a generalzaton of the logstc functon to multple classes. h(a k ) = ea k j ea j h(a k ) a j = h(a k) log h(a k ) Why "softmax"? ( ) log h(a k ) = h(a k ) δ kj h(a j ) a j max{x 1,, x n } log(e x 1 + + e xn ) max{x 1,, x n } + log n EECS 6894, Columba Unversty 7/33

Loss Functons Regresson Classfcaton EECS 6894, Columba Unversty 8/33

Regresson: Least Square Error Let y n,k and ȳ n,k be the output and target respectvely. For regresson, the typcal loss functon s least square error (LSE): L(y, ȳ) = 1 (y n ȳ n ) 2 2 n = 1 (y nk ȳ nk ) 2 2 where n s the sample ndex and k s the dmenson ndex. The dervatve of LSE wth respect to each dmenson of each sample: n L(y, ȳ) y nk = (y nk ȳ nk ) k EECS 6894, Columba Unversty 9/33

Classfcaton: Cross-Entropy (CE) Suppose there are two (dscrete) probablty dstrbutons p and q, the "dstance" between the two dstrbutons can be measured by the Kullback-Lebler dvergence: D KL (p q) = k p k log p k q k For classfcaton, the targets ȳ n are gven y n = {ȳ n1,, ȳ nk,, ȳ nk } The predctons after the softmax layer are posteror probabltes after normalzaton y n = {y n1,, y nk,, y nk } Now measure the "dstance" of these two dstrbutons from all samples D KL (ȳ n y n ) = ȳ nk log ȳnk = ȳ nk log ȳ nk ȳ nk log y nk n n y k nk n k n k = ȳ nk log y nk + C n k EECS 6894, Columba Unversty 10/33

Classfcaton: Cross-Entropy (CE) Cross-entropy as the loss functon: L(ȳ, y) = ȳ nk log y nk n Typcally, targets are gven n the form of 1-of-K codng ȳ n = {0,, 1,, 0} where { 1 f k = kn, ȳ n = 0 otherwse It follows that the cross-entropy has an even smpler form: k L(ȳ, y) = n log y n kn EECS 6894, Columba Unversty 11/33

Classfcaton: Cross-Entropy (CE) dervatve of the composte functon of cross-entropy and softmax: E n a nk = L(ȳ, y) = E n = n n y n = a nk = = = ean j eanj ( ( ȳ n log y n ) = ȳ n log y n ) ȳ n log y n a nk log y n y n ȳ n = ȳ n y n log y n y n a nk y n log y n a nk ȳ n y n y n (δ k y nk ) = ȳ n δ k + ȳ n (δ k y nk ) ȳ n y nk = ȳ nk + y nk ( ȳ n, ) = y nk ȳ nk What s the dervatve for an arbtrary dfferentable functon? leave t for exercse EECS 6894, Columba Unversty 12/33

Algorthm 1 Defnton of a DNN wth 3 hdden layers functon create_model() MODEL local n_nputs = 460 local n_outputs = 3000 local n_hdden = 1024 local model = nn.sequental() model:add(nn.lnear(n_nputs, n_hdden)) model:add(nn.sgmod()) model:add(nn.lnear(n_hdden,n_hdden)) model:add(nn.sgmod()) model:add(nn.lnear(n_hdden,n_hdden)) model:add(nn.sgmod()) model:add(nn.lnear(n_hdden,n_outputs)) model:add(nn.logsoftmax()) LOSS FUNCTION local crteron = nn.classnllcrteron() return model:cuda(), crteron:cuda() end A DNN Example In Torch EECS 6894, Columba Unversty 13/33

Neural Networks As Unversal Approxmators Unversal Approxmaton Theorem[1][2][3] Let φ( ) be a nonconstant, bounded, and monotoncally-ncreasng contnuous functon. Let I n represent an n-dmensonal unt cube [0, 1] n and C(I n ) be the space of contnuous functons on I n. Then gven any functon f C(I n ) and ɛ > 0, there exst an nteger N and real constants a, b R and w R n, where = 1, 2,, N, such that one may defne N F (x) = a φ(w T x + b ) =1 as an approxmate realzaton of the functon f where f s ndependent of φ such that for all x I n. F (x) f(x) < ɛ [1] Cybenko, G. (1989). "Approxmaton by Superpostons of a Sgmodal Functon," Math. Control Sgnals Systems, 2, 303-314. [2] Hornk, K., Stnchcombe, M., and Whte, H. (1989). "Multlayer Feedforward Networks are Unversal Approxmators," Neural Networks, 2(5), 359-366. [3] Hornk, K. (1991). "Approxmaton Capabltes of Multlayer Feedforward Networks," Neural Networks, 4(2), 251-257. EECS 6894, Columba Unversty 14/33

Neural Networks As Unversal Classfers Extends f(x) to decson functons of the form [1] f(x) = j, ff x P j, j = 1, 2,, K where P j partton A n nto K dsjont measurable subsets and A n s a compact subset of R n. Arbtrary decson regons can be arbtrarly well approxmated by contnuous feedforward neural networks wth only a sngle nternal, hdden layer and any contnuous sgmodal nonlnearty! [1] Cybenko, G. (1989). "Approxmaton by Superpostons of a Sgmodal Functon," Math. Control Sgnals Systems, 2, 303-314. [2] Huang, W. Y. and Lppmann, R. P. (1988). "Neural Nets and Tradtonal Classfers," n Neural Informaton Processng Systems (Denver 1987), D. Z. Anderson, Edtor. Amercan Insttute of Physcs, New York, 387-396. EECS 6894, Columba Unversty 15/33

Back Propagaton L How to compute the gradent of the loss functon wth respect to a weght w j? L k w j j wj wk D. E. Rumelhart, G. E. Hnton and R. J. Wllams (1986). "Learnng representatons by back-propagatng errors," Nature, vol. 323, 533-536. EECS 6894, Columba Unversty 16/33

Back Propagaton a (l) z (l) = j w(l) j = h (l) ( a (l) ) j z(l 1) L w (l) j = L a (l) a (l) w (l) j It follows that a (l) w (l) = z (l) j and δ (l) L j δ s often referred to as errors. By the chan rule, a (l) δ (l) = L a (l) = k L a (l+1) k a (l+1) k a (l) = k δ (l+1) a (l+1) k k a (l) EECS 6894, Columba Unversty 17/33

Back Propagaton a (l+1) k = w (l+1) k h(a (l) ) therefore a (l+1) k a (l) = w (l+1) k h (a (l) ) It follows that δ (l) = h (a (l) ) k w (l+1) k δ (l+1) k whch tells us that the errors can be evaluated recursvely. EECS 6894, Columba Unversty 18/33

A neural network wth multple layers Softmax output layer Cross-entropy loss functon Back Propagaton: An Example Forward propagaton: push the nput x n through the network to get the actvatons to the hdden layers ) a (l) = j w (l) j z(l 1) j, z (l) ( = h (l) a (l) Back propagaton: start from the output layer δ (L) k = y nk ȳ nk back-propagate errors from the output layer all the way down to the nput layer δ (l) = h (a (l) ) k w (l+1) k δ (l+1) k evaluate gradents L w (l) j = δ (l) z (l) j EECS 6894, Columba Unversty 19/33

Gradent Checkng In practce, when you mplement the gradent of a network or a module, one way to check the correctness of your mplementaton s va the followng gradent checkng by a (symmetrc) fnte dfference: L = L(w j + ɛ) L(w j ɛ) w j 2ɛ check the "closeness" of the two gradents (your own mplementaton and the one from the above dfference) g 1 g 2 g 1 + g 2 EECS 6894, Columba Unversty 20/33

Jacoban Matrx y k x measure the senstvty of outputs wth respect to nputs of a (sub-)network can be used as a modular operaton for error backprop n a larger network EECS 6894, Columba Unversty 21/33

Jacoban Matrx The Jacoban matrx can be computed usng a smlar back-propagaton procedure. y k x = l y k a l a l x where a l are the actvaton functons havng mmedate connectons wth the nputs x. y k x = l Analogous to the error propagaton, defne and t can be computed recursvely δ kl = y k a l = j y k a j a j a l = j δ kl y k a l where j sweeps all the connectons w jl wth l. w l y k a l y k a j [ wjl h (a l ) ] = j δ kj [ wjl h (a l ) ] What s the Jacoban matrx of softmax outputs to the nputs? leave t as an exercse EECS 6894, Columba Unversty 22/33

Vanshng Gradents functon nonlnearty gradent sgmod x = 1 L L 1+e a a = x(1 x) x tanh x = ea e a e a +e a L a = (1 x2 ) L x softsgn x = a 1+ a L a = (1 x )2 L x What s the problem??? x 1 L a L x Nonlnearty causes vanshng gradents. EECS 6894, Columba Unversty 23/33

Generalzaton Statstcal Learnng Theory Statstcal learnng theory refers to the process of nferrng general rules by machnes from the observed samples. It attempts to answer the followng questons n terms of learnng Whch learnng tasks can be performed by machnes n general? What knd of assumptons do we have to make such that machne learnng can be successful? What are the key propertes a learnng algorthm needs to satsfy n order to be successful? Whch performance guarantees can we gve on the results of certan learnng algorthms? [1] U. v. Luxburg and B. Scholkopf, "Statstcal learnng theory: models, concepts, and results," arxv:0810.4752v1, 2008. EECS 6894, Columba Unversty 24/33

Formulaton of Supervsed Learnng Suppose X s the nput space and Y s the output (label) space, the learnng s to estmate a mappng functon f between the two spaces f : X Y More specfcally, we choose f from a functon space F. In order to estmate f, we have access to a set of tranng samples (x 1, y 1),, (x, y ),, (x n, y n) X Y whch are ndependently drawn from the underlyng jont dstrbuton p(x, y) on X Y. A loss functon l s defned on X Y to measure the goodness of f l(x, y, f(x)) On top of that, a rsk functon R(f) s defned to measure the average loss over the underlyng data dstrbuton p(x, y): R(f) = E p[l(x, y, f(x))] = p(x, y)l(x, y, f(x))dx and we pck the mnmzer of the rsk f F = argmn R(f). f F EECS 6894, Columba Unversty 25/33

Formulaton of Supervsed Learnng If the functon space F ncludes all the functons, then the Bayesan rsk s the mnmal rsk we can ever acheve. Assumng we know the underlyng dstrbuton p(x, y), we can compute ts condtonal dstrbuton p(y x) from whch we can then compute the Bayes classfer f Bayes (x) = argmax p(y x). y Y In the formulaton of the supervsed learnng, we make no assumptons on the underlyng dstrbuton p(x, y). It can be any dstrbuton on X Y. However, we assume p(x, y) s fxed but unknown to us at the tme of learnng. In practce, we deal wth emprcal rsk and pck ts mnmzer R emp (f) = 1 n l(x, y, f(x )) n =1 f n = argmn R emp (f). f F EECS 6894, Columba Unversty 26/33

The Bas-Varance Trade-off Fall f Bayes ff F fn [ ] [ ] R(f n ) R(f Bayes ) = R(f n ) R(f F ) + R(f F ) R(f Bayes ) EECS 6894, Columba Unversty 27/33

Generalzaton and Consstency Let (x, y ) be an nfnte sequence of tranng samples whch have been drawn ndependently from some underlyng dstrbuton p(x, y). Let l be a loss functon. For each n, let f n be a classfer constructed by some learnng algorthm on the bass of the frst n tranng samples. 1. the learnng algorthm s called consstent wth respect to F and p f the rsk R(f n ) converges n probablty to the rsk R(f F ) of the best classfer n F, that s for all ɛ > 0 P (R(f n ) R(f F ) > ɛ) 0 as n 2. the learnng algorthm s called Bayes-consstent wth respect to F and p f the rsk R(f n ) converges n probablty to the rsk R(f Bayes ) of the Bayes classfer, that s for all ɛ > 0 P (R(f n ) R(f Bayes ) > ɛ) 0 as n 3. the learnng algorthm s called unversally consstent wth respect to F f t s consstent wth respect to F for all dstrbutons p. 4. the learnng algorthm s called unversally Bayes-consstent f t s Bayes-consstent for all dstrbutons p. EECS 6894, Columba Unversty 28/33

Emprcal Rsk Mnmzaton where f n = argmn R emp (f). f F R emp (f) = 1 n l(x, y, f(x )) n =1 As n changes, we are actually dealng wth a sequence of classfers {f n }. We hope that R(f n ) s consstent wth respect to R(f F ): R(f n ) R(f F ), when n where R(f F ) s the best rsk we can acheve gven F. EECS 6894, Columba Unversty 29/33

An Overfttng Example Suppose the data space s X = [0, 1], the underlyng dstrbuton on X s unform, the label y for nput x s defned as follows [1]: y = { 1 f x < 0.5 1 f x 0.5 Obvously, we have R(f Bayes ) = 0. Suppose we observe a set of samples (x, y ), = 1,, n, and construct the classfer below f n(x) = { y f x = x, = 1,, n 1 otherwse The constructed classfer f n perfectly classfes all tranng samples, whch mnmzes the emprcal rsk and drves t to 0. Suppose we draw test samples from the underlyng dstrbuton and assume they are not dentcal to the tranng samples then the f n constructed above smply predct every sample wth label 1 whch s wrong on half of the test samples. 1 2 = R(fn) R(f Bayes) = 0 as n The classfer f n s not consstent. Obvously the classfer does not learn anythng from the tranng samples other than memorze them. [1] U. v. Luxburg and B. Scholkopf, "Statstcal learnng theory: models, concepts, and results," arxv:0810.4752v1, 2008. EECS 6894, Columba Unversty 30/33

Unform Convergence of Emprcal Rsk For learnng consstency wth respect to F: R(f n) R(f F ) R(f n) R emp(f n) + R emp(f F ) R(f F ) The Chernoff-Hoeffdng nequalty: ( ) 1 n P ξ E(ξ) n ɛ 2 exp( 2nɛ 2 ) =1 It follows that P ( R emp(f) R(f) ɛ) 2 exp( 2nɛ 2 ) whch shows that for any fxed functon and suffcently large number of samples, t s hghly probable that the tranng error provdes a good estmate of the test error. Theorem (Vapnk & Chervonenks) Unform convergence ( P sup f F ) R(f) R emp(f) > ɛ 0 as n for all ɛ > 0 s a necessary and suffcent condton for consstency of emprcal rsk mnmzaton wth respect to F. EECS 6894, Columba Unversty 31/33

Capacty of Functon Spaces What knd of property should a functon space F have to ensure such unform convergence? ( ) Larger F gves rse to larger P sup f F R(f) R emp(f) > ɛ, whch makes t harder to ensure a unform convergence. Ths leads to the concept of the capacty of the functon space F. Unform Convergence Bounds: ( P sup f F ) R(f) R emp(f) ) > ɛ 2N(F, 2n) exp( nɛ 2 ) The quantty N(F, n) s referred to as the shatterng coeffcent of the functon class F wth respect to sample sze n, whch s also known as the growth functon. It measures the number of ways that the functon space can separate the patterns nto two classes. It measures the sze" of a functon space by countng the effectve number of functons gven a sample sze n. Theorem (Vapnk & Chervonenks) 1 log N(F, n) 0 n s a necessary and suffcent condton for consstency of emprcal rsk mnmzaton on F. EECS 6894, Columba Unversty 32/33

Generalzaton Bounds Gven δ > 0, wth a probablty of 1 δ, any functon f F satsfes R(f) R emp (f) + 1 ( ) log (2N(F, n)) log δ n Or R(f) R emp (f) + C + log 1 δ n happens wth probablty at least 1 δ, where C s a constant representng the complexty of the functon space F. Generalzaton bounds are typcally wrtten n the followng form for any f F. Other common capacty measures VC dmenson Rademacher complexty R(f) R emp (f) + capacty(f) + confdence(δ) EECS 6894, Columba Unversty 33/33