19 Better Neural Network Training; Convolutional Neural Networks

Size: px

Start display at page:

Download "19 Better Neural Network Training; Convolutional Neural Networks"

Evangeline Ophelia Owen
5 years ago
Views:

1 108 Jnathan Richard Shewchuk 19 Better Neural Netwrk Training; Cnvlutinal Neural Netwrks [I m ging t talk abut a bunch f heuristics that make gradient descent faster, r make it find better lcal minima, r prevent it frm verfitting. I suggest yu implement vanilla stchastic backprp first, and experiment with the ther heuristics nly after yu get that wrking.] Heuristics fr Faster Training [A big disadvantage f neural nets is that they take a lng, lng time t train cmpared t ther classificatin methds we ve studied. Here are sme ways t speed them up. Unfrtunately, yu usually have t experiment with techniques and hyperparameters t find which nes will help with yur particular applicatin.] (1) (5) abve. [Fix the unit saturatin prblem.] Stchastic gradient descent: faster than batch n large, redundant data sets. [Whereas batch gradient descent walks dwnhill n ne cst functin, stchastic descent takes a very shrt step dwnhill n ne pint s lss functin and then anther shrt step n anther pint s lss functin. The cst functin is the sum f the lss functins ver all the sample pints, s ne batch step behaves similarly t n stchastic steps and takes the same amunt f time. But if yu have many di erent examples f the number 9, they cntain much redundant infrmatin, and stchastic gradient descent learns the redundant infrmatin mre quickly.] y batchvsstch.pdf (LeCun et al., E cient BackPrp ) [Left: a simple neural net with nly three weights, and its 2D training data. Center: batch gradient descent makes nly a little prgress each epch. (Epchs alternate between red and blue.) Right: stchastic descent decreases the errr much faster than batch descent.] One epch presents every training example nce. Training usually takes many epchs, but if sample is huge [and carries lts f redundant infrmatin] it can take less than ne.

2 Better Neural Netwrk Training; Cnvlutinal Neural Netwrks 109 Nrmalizing the data. Center each feature s mean is zer. Then scale each feature s variance 1. [The first step seems t make it easier fr hidden units t get int a gd perating regin f the sigmid r ReLU. The secnd step makes the bjective functin better cnditined, s gradient descent cnverges faster.] nrmalize.jpg [A 2D example f data nrmalizatin.] 6 x x illcnditin.pdf [Skewed data ften leads t an bjective functin with an ill-cnditined (highly eccentric) Hessian. Gradient descent in these functins can be painfully slw, as this figure shws. Nrmalizing helps by reducing the eccentricity. Whitening reduces the eccentricity even mre, but it s much mre expensive. Anther thing that helps with elngated trughs like this is mmentum, which we ll discuss shrtly. It eventually gets us mving fast alng the lng axis.]

3 110 Jnathan Richard Shewchuk Centering the hidden units helps t. Replace sigmids with tanh = e e e +e = 2 s(2 ) 1. [This functin ranges frm 1 t 1 instead f frm 0 t 1.] [If yu use tanh units, dn t frget that yu als need t change backprp t replace s 0 with the derivative f tanh. Als, gd utput target values change t rughly 0.7 and 0.7.] Di erent learning rate fr each layer f weights. Earlier layers have smaller gradients, need larger learning rate. curvaturelayers.pdf [In this illustratin, the inputs are at the bttm, and the utputs at the tp. The derivatives tend t be smaller at the earlier layers.] Emphasizing schemes. [Neural netwrks learn the mst redundant examples quickly, and the mst rare examples slwly. S we try t emphasize the uncmmn examples.] Present examples frm rare classes mre ften, r w/bigger. Same fr misclassified examples. [I have t warn yu that emphasizing schemes can backfire n bad utliers.] Secnd-rder ptimizatin. [Unfrtunately, Newtn s methd is cmpletely impractical, because the Hessian is t large and expensive t cmpute. There have been a lt f attempts t incrprate curvature infrmatin int neural net learning in cheaper ways, but nne f them are ppular yet.] Nnlinear cnjugate gradient: wrks well fr small nets + small data + regressin. Batch descent nly!! T slw with redundant data. Stchastic Levenberg Marquardt: apprximates a diagnal Hessian. [The authrs claim cnvergence is typically three times faster than well-tuned stchastic gradient descent. The algrithm is cmplicated.]

4 Better Neural Netwrk Training; Cnvlutinal Neural Netwrks 111 Heuristics fr Aviding Bad Lcal Minima (1) (5) abve. [Fix the unit saturatin prblem.] Stchastic gradient descent. Randm mtin gets yu ut f shallw lcal minima. [Even if yu re at a lcal minimum f the cst functin, each sample pint s lss functin will be trying t push yu in a di erent directin. Stchastic gradient descent lks like a randm walk r Brwnian mtin, which can shake yu ut f a shallw minimum.] Mmentum. Gradient descent changes velcity slwly. Carries us thrugh shallw lcal minima t deeper nes. w rw repeat w w + w w rw + w Gd fr bth batch & stchastic. Chse hyperparameter <1. [Think f w as the velcity. The hyperparameter specifies hw much mmentum persists frm iteratin t iteratin.] [I ve seen cnflicting advice n. Sme researchers set it t 0.9; sme set it clse t zer. Ge Hintn suggests starting at 0.5 and slwly increasing it t 0.9 r higher as the gradients get small.] [If is large, yu shuld usually chse smaller t cmpensate, and yu might als want t change the first line s the initial velcity is larger.] [A prblem with mmentum is that nce it gets clse t a gd minimum, it scillates arund the minimum. But it s mre likely t get clse t a gd minimum in the first place.] Heuristics t Avid Overfitting Ensemble f neural nets. Randm initial weights + bagging. [We saw hw well ensemble learning wrks fr decisin trees. It wrks well fr neural nets t. The cmbinatin f randm initial weights and bagging helps ensure that each neural net cmes ut di erently. Obviusly, ensembles f neural nets are slw.] `2 regularizatin, aka weight decay. Add w 0 2 t the cst/lss fn, where w 0 is vectr f all weights except biases. [w 0 includes all the weights in matrices V and W except the last clumn f each, rewritten as a vectr.] [We d this fr the same reasn we d it in ridge regressin: penalizing large weights reduces verfitting by reducing the variance f the methd.] i has extra term 2 w i Weight decays by factr 1 2 if nt reinfrced by training.

5 112 Jnathan Richard Shewchuk Neural Netwrk - 10 Units, N Weight Decay Neural Netwrk - 10 Units, Weight Decay= Training Errr: Test Errr: Bayes Errr: Training Errr: Test Errr: Bayes Errr: weightdecay.pdf, weightdecayn.pdf (ESL, Figure 11.4) Write sftmax + crssentrpy lss. [Examples f 2D classificatin withut (left) and with (right) weight decay. Observe that the secnd example better apprximates the Bayes ptimal bundary (dashed purple curve).] Drput emulates an ensemble in ne netwrk. drput.pdf [During training, we temprarily disable a randm subset f the units, alng with all the edges in and ut f thse units. It seems t wrk well t disable each hidden unit with prbability 0.5, and t disable input units with a smaller prbability. We d stchastic gradient descent and we frequently change which randm subset f units is disabled. The authrs claim that their methd gives even better generalizatin than `2 regularizatin. It gives sme f the advantages f an ensemble, but it s faster t train.] Fewer hidden units. [The number f hidden units is a hyperparameter yu can use t adjust the bias-variance trade. If there s t few, yu can t learn well, but if there s t many, yu tend t verfit. `2 regularizatin and drput make it safer t have t many hidden units, s it s less critical t find just the right number.]

6 Better Neural Netwrk Training; Cnvlutinal Neural Netwrks 113 CONVOLUTIONAL NEURAL NETWORKS (CnvNets; CNNs) [Cnvlutinal neural nets have caused a big resurgence f interest in neural nets in the last few years. Often yu ll hear the buzzwrd deep learning, which refers t neural nets with many layers. Many f the mst successful deep netwrks have been cnvlutinal.] Visin: inputs are large images image = 40,000 pixels. If we cnnect them all t 40,000 hidden units! 1.6 billin cnnectins. Neural nets are ften verparametrized: t many weights, t little data. [As a rule f thumb, yu shuld have mre data than yu have weights t estimate. If yu dn t fllw that rule, yu usually verfit. With images, it s impssible t get enugh data t crrectly train billins f weights. We culd shrink the images, but then we re thrwing away useful data. Anther prblem with having billins f weights is that the netwrk becmes very slw t train r even t use.] [Researchers have addressed these prblems by taking inspiratin frm the neurlgy f the visual system. Remember that early in the semester, I tld yu that yu can get better perfrmance n the handwriting recgnitin task by using edge detectrs. Edge detectrs have tw interesting prperties. First, each edge detectr lks at just ne small part f the image. Secnd, the edge detectin cmputatin is the same n matter which part f the image yu apply it t. S let s apply these tw prperties t neural net design.] CnvNet ideas: (1) Lcal cnnectivity: A hidden unit (in early layer) cnnects nly t small patch f units in previus layer. [This imprves the verparametrizatin prblem, and speeds up bth the frward pass and the training cnsiderably.] (2) Shared weights: Grups f hidden units share same set f input weights, called a mask aka filter aka kernel. [N relatin t the kernels f Lecture 14.] We learn several masks. [Each mask perates n every patch f image.] Masks patches = hidden units in first hidden layer. If ne patch learns t detect edges, every patch has an edge detectr. [Because the mask that detects edges is applied t every patch.] CnvNets explit repeated structure in images, audi. Cnvlutin: the same linear transfrmatin applied t di erent parts f the input by shifting. [Shared weights imprve the verparametrizatin prblem even mre, because shared weights means fewer weights. It s a kind f regularizatin.] [But shared weights have anther big advantage. Suppse that gradient descent starts t develp an edge detectr. That edge detectr is being trained n every part f every image, nt just n ne spt. And that s gd, because edges appear at di erent lcatins in di erent images. The lcatin n lnger matters; the edge detectr can learn frm edges in every part f the image.] [In a neural net, yu can think f hidden units as features that we learn, as ppsed t features that yu cde up yurself. Cnvlutinal neural nets take them t the next level by learning features frm multiple patches simultaneusly and then applying thse features everywhere, nt just in the patches where they were riginally learned.] [By the way, althugh lcal cnnectivity is inspired by ur visual systems, shared weights bviusly dn t happen in bilgy.]

7 114 Jnathan Richard Shewchuk [Shw slides n cmputing in the visual crtex and CnvNets (cnn.pdf), available frm the CS 189 web page. Srry, readers, there are t many images t include here. The narratin is belw.] [Neurlgists can stick needles int individual neurns in animal brains. After a few hurs the neurn dies, but until then they can recrd its actin ptentials. In this way, bilgists quickly learned hw sme f the neurns in the retina, called retinal ganglin cells, respnd t light. They have interesting receptive fields, illustrated in the slides, which shw that each ganglin cell receives excitatry stimulatin frm receptrs in a small patch f the retina but inhibitry stimulatin frm ther receptrs arund it.] [The signals frm these cells prpagate t the V1 visual crtex in the ccipital lbe at the back f yur skull. The V1 cells prved much harder t understand. David Hubel and Trsten Wiesel f the Jhns Hpkins University put prbes int the V1 visual crtex f cats, but they had a very hard time getting any neurns t fire there. Hwever, a lucky accident unlcked the secret and ultimately wn them the 1981 Nbel Prize in Physilgy.] [Shw vide HubelWiesel.mp4, taken frm YuTube: ] [The glass slide happened t be at the particular rientatin the neurn was sensitive t. The neurn desn t respnd t ther rientatins; just that ne. S they were pretty lucky t catch that.] [The simple cells act as line detectrs and/r edge detectrs by taking a linear cmbinatin f inputs frm retinal ganglin cells.] [The cmplex cells act as lcatin-independent line detectrs by taking inputs frm many simple cells, which are lcatin dependent.] [Later researchers shwed that lcal cnnectivity runs thrugh the V1 crtex by prjecting certain images nt the retina and using radiactive tracers in the crtex t mark which neurns had been firing. Thse images shw that the neural mapping frm the retina t V1 is retinatpic, i.e., lcality preserving. This is a big part f the inspiratin fr cnvlutinal neural netwrks!] [Unfrtunately, as we g deeper int the visual system, layers V2 and V3 and s n, we knw less and less abut what prcessing the visual crtex des.] [CnvNets were first ppularized by the success f Yann LeCun s LeNet 5 handwritten digit recgnitin sftware. LeNet 5 has six hidden layers! Layers 1 and 3 are cnvlutinal layers in which grups f units share weights. Layers 2 and 4 are pling layers that make the image smaller. These are just hardcded max-functins with n weights and nthing t train. Layers 5 and 6 are just regular layers f hidden units with n shared weights. A great deal f experimentatin went int figuring ut the number f layers and their sizes. At its peak, LeNet 5 was respnsible fr reading the zip cdes n 10% f US Mail. Anther Yann LeCun system was deplyed in ATMs and check reading machines and was reading 10 t 20% f all the checks in the US by the late 90 s.] [Shw Yann LeCun s vide LeNet5.mv, illustrating LeNet 5.]

8 Better Neural Netwrk Training; Cnvlutinal Neural Netwrks 115 LeNet5.png Architecture f LeNet5. [When CnvNets were first applied t image analysis, researchers fund that sme f the learned masks are edge detectrs r line detectrs, similar t the nes that Hubel and Wiesel discvered! This created a lt f excitement in bth the cmputer learning cmmunity and the neurscience cmmunity. The fact that a neural net can naturally learn the same features as the mammalian visual crtex is impressive.] [I tld yu tw lectures ag that neural nets research was ppular in the 60 s, but the 1969 bk Perceptrns killed interest in them thrughut the 70 s. They came back in the 80 s, but interest was partly killed a secnd time in the 00 s by... guess what? By supprt vectr machines. SVMs wrk well fr a lt f tasks, they re much faster t train, and they mre r less have nly ne hyperparameter, whereas neural nets take a lt f wrk t tune.] [Neural nets are nw in their third wave f ppularity. The single biggest factr in bringing them back is prbably big data. Thanks t the internet, we nw have abslutely huge cllectins f images t train neural nets with, and researchers have discvered that neural nets ften give better perfrmance than cmpeting algrithms when yu have huge amunts f data t train them with. In particular, cnvlutinal neural nets are nw learning better features than hand-tuned features. That s a recent change.] [One event that brught attentin back t neural nets was the ImageNet Image Classificatin Challenge in The winner f that cmpetitin was a neural net, and it wn by a huge margin, abut 10%. It s called AlexNet, and it s surprisingly similarly t LeNet 5, in terms f hw its layers are structured. Hwever, there are sme new innvatins that led t their prize-winning perfrmance, in additin t the fact that the training set had 1.4 millin images: they used ReLUs, GPUs fr training, and drput.] alexnet.pdf Architecture f AlexNet.

IAML: Support Vector Machines

IAML: Support Vector Machines 1 / 22 IAML: Supprt Vectr Machines Charles Suttn and Victr Lavrenk Schl f Infrmatics Semester 1 2 / 22 Outline Separating hyperplane with maimum margin Nn-separable training data Epanding the input int