Chapter 7. Neural Networks

Size: px

Start display at page:

Download "Chapter 7. Neural Networks"

Eustace Parrish
6 years ago
Views:

1 Chapter 7. Neural Netwrks Wei Pan Divisin f Bistatistics, Schl f Public Health, University f Minnesta, Minneaplis, MN weip@bistat.umn.edu PubH 7475/8475 c Wei Pan

2 Intrductin Chapter 11. nly fcus n Feedfrward NNs. Related t prjectin pursuit regressin: f (x) = M m=1 g m(w mx), where each w m is a vectr f weights and g m is a smth nnparametric functin; t be estimated. really? Here: + CNN; but n recurrent NNs (fr seq data), autencders (unsupervised),... Gdfellw, Bengi, Curville (2016). Deep Learning. Tw high waves in 1960s and late 1980s-90s. McCullch & Pitts mdel (1943): n j (t) = I ( i j w ijn i (t 1) > θ j ). w ij can be > 0 (excitatry) r < 0 (inhibitry).

3 A bilgical neurn vs an artificial neurn (perceptrn). Ggle: images bilgical neural netwrk tutrial Minsky & Papert s (1969) XOR prblem: XOR(X 1, X 2 ) = 1 if X 1 X 2 ; = 0 /w. X 1, X 2 {0, 1}. Perceptrn: f = I (α 0 + α X > 0). Feldman s (1985) ne hundred step prgram : at mst 100 steps within a human reactin time. because a human can recgnize anther persn in 100 ms, while the prcessing time f a neurn is 1ms. = human brain wrks in a massively parallel and distributed way. Cgnitive science: human visin is perfrmed in a series f layers in the brain. Human can learn. Hebb (1949) mdel: w ij w ij + ηy i y j, reinfrcing learning by simultaneus activatins.

4 Feed-frward NNs Fig 11.2 Input: X A (hidden) layer: fr m = 1,..., M, Z m = σ(α 0m + α mx ), Z = (Z 1,..., Z M ). activatin functin: σ(v) = 1/(1 + exp( v)), sigmid (r lgit 1 ); hyperblic tangent: tanh(v) = 2σ(v) 1....(may have multiple (hidden) layers)... Output: f 1 (X ),..., f K (X ). T k = β 0k + β k Z, T = (T 1,..., T K ), f k (X ) = g k (T ). regressin: g k (T ) = T k ; classificatin: g k (T ) = exp(t k )/ K j=1 exp(t j); sftmax r multi-lgit 1 functin.

5 Elements f Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap Y Y YK Y1 2 K Z 1 Z2 3 mm Z Z X X2 X3 X p-1 1 Xp FIGURE Schematic f a single hidden layer, feed-frward neural netwrk. X

6 Elements f Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 11 1/(1 + e v ) FIGURE Plt f the sigmid functin σ(v) = 1/(1 + exp( v)) (red curve), cmmnly used in the hidden layer f a neural netwrk. Included are σ(sv) fr s = 1 (blue curve) and s =10(purple curve). 2 The scale parameter s cntrls the activatin rate, and we can see that large s amunts t a hard activatin at v =0. Nte that σ(s(v v 0)) shifts the activatin threshld frm 0 t v 0. v

7 Hw t fit the mdel? Given training data: (Y i, X i ), i = 1,..., n. Fr regressin, minimize R(θ) = K k=1 n i=1 (Y ik f k (X i )) 2. Fr classificatin, minimize R(θ) = K n k=1 i=1 Y ik lg f k (X i ). And G(x) = arg max f k (x). Can use ther lss functins. Hw t minimize R(θ)? Gradient descent, called back-prpagatin Very ppular and appealing! recall Hebb mdel Other algrithms: Newtn s, cnjugate-gradient,...

8 Back-prpagatin algrithm Given: training data (Y i, X i ), i = 1,..., n. Gal: estimate α s and β s. Cnsider R(θ) = i k (Y ik f k (X i )) 2 := i R i := i r i 2. NN: input X i, utput (f 1 (X i ),..., f K (X i )). Z mi = σ(α 0m + α mx i ), Z i = (Z 1i,..., Z Mi ), T ki = β 0k + β k Z i, T i = (T 1i,..., T Ki ), f k (X i ) = g k (T i ) = T ki. Chain rule: R i = R i r i g k β km r i g k T i T i β km R i β km = 2(Y ik f k (X i ))g k (β k Z i)z mi := δ ki Z mi,

9 Back-prpagatin algrithm (cnt ed) R i = R i r i g k T i α ml r i g k T i Z i Z i α ml R i α ml = k 2(Y ik f k (X i ))g k (β k Z i)β km σ (α mx i )X il := s mi X il. where δ ki, s mi are errrs frm the current mdel. Update at step r + 1: β (r+1) km = β (r) km γ R i r β km, α (r+1) ml = α (r) ml γ r i β (r),α (r) i γ r : learning rate; a tuning parameter; can be fixed r selected/decayed. t large/small then... training epch: a cycle f updating R i α ml. β (r),α (r)

10 Sme issues Starting values: Existence f many lcal minima and saddle pints. Multiple tries; mdel averaging,... Data preprcessing: centering at 0 and scaling Stchastic gradient descent (SGD): use a minibatch (i.e. a randm subset) f the training data fr a few iteratins; minimbatch size: 32 r 64 r 128 r..., a tuning parameter. +: simple and intuitive; -: slw Mdificatins: SGD + Mmentum SGD: x t+1 = x t γ f (x t ). SGD+M: v t+1 = ρv t + f (x t ), x t+1 = x t γv t (AdaGrad, RMSPrp)... Adam, default (nw!)

11 Sme issues (cnt ed) Over-fitting? Universal Apprx Thm If add mre units r layers, then... 1) Early stpping! 2) Regularizatin: add a penalty term, e.g. Ridge; use R(θ) + λj(θ) with J(θ) = km β2 km + ml α2 ml ; called weight decay; Fig Perfrmance: Figs ) Regularizatin: Drput (randmly) a subset/prprtin f ndes/units r cnnectins during training; an ensemble; mre rbust. A main technical issue with a deep NN: gradients vanishing r explding, why? use ReLU: f (x) = max(0, x); batch nrmalizatin;... Transfer learning: reusing trained netwrks: why? http: //jmlr.rg/prceedings/papers/v32/dnahue14.pdf Example cde: ex7.1.r

12 Elements f Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 11 Neural Netwrk - 10 Units, N Weight Decay Training Errr: Test Errr: Bayes Errr: Neural Netwrk - 10 Units, Weight Decay= Training Errr: Test Errr: Bayes Errr: 0.210

13 Elements f Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 11 Sum f Sigmids Radial Test Errr Test Errr Number f Hidden Units Number f Hidden Units FIGURE Bxplts f test errr, fr simulated data example, relative t the Bayes errr (brken hrizntal line). True functin is a sum f tw sigmids n the left, and a radial functin is n the right. The test errr is displayed fr 10 different starting weights, fr a single hidden layer neural netwrk with the number f units as indicated.

14 Elements f Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 11 N Weight Decay Weight Decay=0.1 Test Errr Test Errr Number f Hidden Units Number f Hidden Units FIGURE Bxplts f test errr, fr simulated data example, relative t the Bayes errr. True functin is a sum f tw sigmids. The test errr is displayed fr ten different starting weights, fr a single hidden layer neural netwrk with the number units as indicated. The tw panels represent n weight decay (left) and strng weight decay λ = 0.1 (right).

15 Elements f Statistical Learning (2nd Ed.) c Hastie, Tibshirani & Friedman 2009 Chap 11 Sum f Sigmids, 10 Hidden Unit Mdel Test Errr Weight Decay Parameter FIGURE Bxplts f test errr, fr simulated data example. True functin is a sum f tw sigmids. The test errr is displayed fr ten different starting weights, fr a single hidden layer neural netwrk with ten hidden units and weight decay parameter value as indicated.

16 Current and future... Deep learning: deep NNs (Wikipedia; ggle) Facebk hired Yann LeCun at NYU; Ggle hired Geffrey Hintn at U Trnt; Bengi stays in U Mntreal; Baidu hired Andrew Ng, wh recently left;... Impressive applicatins: imaging recgnitin (Krizhevsky et al); playing the game f G (Silver et al 2016, Nature);... Keys: AlexNet (Krizhevsky et al), 60 millin parameters... f five cnvlutinal layers... three fully-cnnected layers witha final 1000-way sftmax. there are rughly 1.2 millin training images, 50,000 validatin images, and 150,000 testing images. Needs regularizatin t! Qs: anther wave? yes! just check cnstantly appearing papers n arxiv, ICLR, NIPS,...

18 Cnvlutinal NNs LeCun et al (1998, Prc f the IEEE); Keys: t ensure sme degree f shift, scale, and distrtin invariance: lcal receptive fields, shared weights... and spatial r tempral sub-sampling. Lcal crrelatins are the reasns fr the well-knwn advantages f extracting and cmbining lcal features... Hubel and Wiesel (1962): lcally-sensitive, rientatin-selective neurns in the cat s visual system. New: a cnvlutin layer uses rectified linear functin, ReLU(x) = max(0, x).

19 Figure: LeCun et al 1998, Prc f the IEEE.

20 Figure: Angermueller et al 2016, Ml Sys Bil.

21 Resurces Tday s standards : mstly in Pythn 1. Caffe (UC Berkeley) = Caffe2 (Facebk); 2. Trch (NYU/Facebk) = PyTrch (Facebk); 3. Thean(U Mntreal) = TensrFlw (Ggle); 3b. Keras: n tp f TensrFlw. Others: MXNet (Amazn), Paddle (Baidu), CNTK (Micrsft)... CPU vs GPU Matlab: CnvNet, DeepLearnTlBx, MatCnvNet,... Java: Deeplearning4j,... R packages: deepnet, darch, mxnet, h2,... nw...

19 Better Neural Network Training; Convolutional Neural Networks

19 Better Neural Network Training; Convolutional Neural Networks 108 Jnathan Richard Shewchuk 19 Better Neural Netwrk Training; Cnvlutinal Neural Netwrks [I m ging t talk abut a bunch f heuristics that make gradient descent faster, r make it find better lcal minima,