Statistical Machine Learning Methods for Bioinformatics III. Neural Network & Deep Learning Theory

Size: px

Start display at page:

Download "Statistical Machine Learning Methods for Bioinformatics III. Neural Network & Deep Learning Theory"

Prudence Carson
6 years ago
Views:

1 Statstcal Machne Learnng Methods for Bonformatcs III. Neural Network & Deep Learnng Theory Janln Cheng, PhD Department of Computer Scence Unversty of Mssour 2016 Free for Academc Use. Janln Cheng & orgnal sources of some materals.

2 Classfcaton Problem Legs weght sze. Feature m Input Output Category / Label Mammal Bug Queston: How to automatcally predct output gven nput? Idea: Learn from known examples and generalze to unknown ones.

3 Data Drven Machne Learnng Approach Data wth Labels Splt Tranng Data Tranng Model: Map Input to Output Predcton New Data Test Data Test Input: words of news Output: poltcs, sports, entertanment, Tranng: Buld a model (classfer) Test: Test the model Key dea: Learn from known data and Generalze to unseen data

4 Outlne Introducton Lnear regresson Lnear Dscrmnant functon (classfcaton) One layer neural network / perceptron Mult-layer network Recurrent neural network Prevent overfttng Speedup learnng Deep learnng

5 Machne Learnng Supervsed learnng (tranng wth labeled data), un-supervsed learnng (clusterng un-labeled data), and sem-supervsed learnng (use both labeled and unlabeled data) Supervsed learnng: classfcaton and regresson Classfcaton: output s dscrete value Regresson: output s real value

6 Learnng Example: Recognze Handwrtng Classfcaton: recognze each number Clusterng: cluster the same numbers together Regresson: predct the ndex of Dow-Jones

7 Neural Network Neural Network can do both supervsed learnng and un-supervsed learnng Neural Network can do both regresson and classfcaton Neural Network has both statstcal and artfcal ntellgence roots

8 Roots of Neural Network Artfcal ntellgence root (neuron scence) Statstcal root (lnear regresson, generalzed lnear regresson, dscrmnant analyss. Ths s our focus.)

A Typcal Cortcal Neuron Dentrtc tree 10 11 neurons Juncton between neurons Collect chemcal

9 A Typcal Cortcal Neuron Dentrtc tree neurons Juncton between neurons Collect chemcal sgnals Axon: generate Potentals (Fre/not Fre) Synapse: control release chemcal transmtters.

10 A Neural Model weght Input Actvaton Actvaton functon Adapted from

11 Statstcs Root: Lnear Regresson Example Fsh length vs. weght? X: nput or predctor Y: output or response Goal: learn a lnear functon E[y x] = wx + b. Adapted from A. Moore, 2003

12 Lnear Regresson Defnton of a lnear model: y = wx + b + nose. nose ~ N (0, σ 2 ), assume σ s a constant. y ~ N(wx + b, σ 2 ) Estmate expected value of y gven x (E[y x] = wx +b). Gven a set of data (x 1, y 1 ), (x 2, y 2 ),, (x n, y n ), to fnd the optmal parameters w and b.

13 Objectve Functon Least square error: Maxmum Lkelhood: Mnmzng square error s equvalent to maxmzng lkelhood = N b w x y P 1 ),, ( = N b wx y 1 2 ) (

14 = N b w x y P 1 ),, ( ) ( σ πσ b wx y N e = = N b w x y P 1 )),, ( log( Maxmze Lkelhood Mnmze negatve log-lkelhood: ) 2 ) ( ) 2 log( ( ) 2 1 log( ) ( σ πσ πσ σ b wx y e N b wx y N = = = ) 2 ) ( ) 2 (log( σ πσ b wx y N + = = = = Note: σ s a constant.

15 1-Varable Lnear Regresson = N b wx y 1 2 ) ( 0 ) 2( ) )*( 2( = + + = = = = N N bx wx x y x b wx y W E 0 ) 2( 1) )*( 2( 1 1 = + + = = = = N N b wx y b wx y b E Mnmze E = Error w N wx y b N = = 1 ) ( = = = N N N xx x N x y y x w 1 2 1

16 Multvarate Lnear Regresson How about multple predctors: (x 1, x 2,, x d ). y = w 0 + w 1 x 1 + w 2 x w d x d + ε For multple data ponts, each data pont s represented as (y, x ), x conssts of d predctors (x 1, x 2,, x d ). y = w 0 + w 1 x 1 + w 2 x w d x d + ε

17 A Motvatng Example Each day you get lunch at the cafetera. Your det conssts of fsh, chps, and beer. You get several portons of each The casher only tells you the total prce of the meal After several days, you should be able to fgure out the prce of each porton. Each meal prce gves a lnear constrant on the prces of the portons: prce = x w + x w + fsh fsh chps chps x beer w beer G. Hnton, 2006

18 Matrx Representaton n data ponts, d dmenson + ε = d nd n d d n w w w x x x x x x y y y n*1 n*(d+1) (d+1)*1 Y = XW + ε Matrx Representaton:

19 Multvarate Lnear Regresson Goal: mnmze square error = (Y-XW) T (Y- XW) = Y T Y -2X T WY + W T X T XW Dervatve: -2X T Y + 2X T XW = 0 W = (X T X) -1 X T Y Thus, we can solve lnear regresson usng matrx nverson, transpose, and multplcaton.

20 Dffculty and Generalzaton Numercal computaton ssue. (a lot data ponts. Matrx nverson s mpossble.) Sngular matrx (determnant s zero) : no nverson How to handle non-lnear data? Turns out neural network and ts teratve learnng algorthm can address ths problem.

21 Graphcal Representaton: One Layer Neural Network for Regresson Output Unt f o Target: y Actvaton functon f s used to convert a to output. Here t s a lnear functon. o = a. w 0 w 1 w d a =Σw x Actvaton Input Unt 1 x 1 x d

22 Gradent Descent Algorthm For a data x = (x 1,x 2, x d ), error E = (y o) 2 = (y w 0 x 0 - w 1 x w d x d ) 2 Partal dervatve: Error Mnma E o E w = = 2( y o) = 2( y o)( x ) = 2( y o) x w w Update rule: E < 0 w w E > 0 w ( t+ 1) ( t) w = w + η( y o) x Famous Delta Rule

23 Algorthm of One-Layer Regresson Neural Network Intalze weghts w (small random numbers) Repeat Present a data pont x = (x 1,x 2,,x d ) to the network and compute output o. f y > o, add ηx to w. f y < o, add -ηx to w. Untl Σ(y k -o k ) 2 s zero or below a threshold or reaches the predefned number of teratons. Comments: onlne learnng: update weght for every x. batch learnng: update weght every batch of x (.e. Σηx ).

24 Graphcal Representaton: One Layer Neural Network for Regresson Output O Target: y Output Unt out O = f(σw x ), f s actvaton functon. a =Σw x Actvaton w 0 w 1 w d Input Unt 1 x 1 x d

25 What about Hyperbolc Tanh Functon for Output Unt Can we use actvaton functon other than lnear functon? For nstance, f we want to lmt the output to be n [-1, +1], we can use hyperbolc Tanh functon: e e x x e + e x x The only thng to change s to use the new gradent.

26 Two-Category Classfcaton Two classes: C 1 and C 2. Input feature vector: x. Defne a dscrmnant functon y(x) such that x s assgned to C 1 f y(x) > 0 and to class C 2 f y(x) < 0. Lnear dscrmnant functon: y(x) = w T x + w 0 = w T x, where x = (1, x). w: weght vector, w 0 : bas.

27 A Lnear Decson Boundary n 2-D Input Space x2 w: orentaton of decson boundary w 0 : defnes the poston of the plan n terms of ts perpendcular dstance from the orgn. w y(x) = w T x = 0 x1 y(x) = w T x + w 0 = 0 l = w T x / w = w 0 / w

28 Graphcal Representaton: Perceptron, One- Layer Classfcaton Neural Network Actvaton / Transfer functon out y=g(w T x) w T x > 0: +1, class 1 w T x < 0: -1, class 2 (threshold functon) Actvaton w T x = Σw x w 0 w 1 w d Input Unt 1 x 1 x d

29 Perceptron Crteron Mnmze classfcaton error Input data (vector): x 1, x 2,, x N and correspondng target value t 1, t 2,, t N. Goal: for all x n C 1 (t = 1), w T x > 0, for all x n C 2 (t = -1), w T x < 0. Or for all x: w T xt > 0. Error: E perc (w) = T n n w x t. M s the set of n x M msclassfed data ponts.

30 Gradent Descent Error Mnma E > 0 w E < 0 w W For each msclassfed data pont, adjust weght as follows: w = w - E w η = w + η x n t n

31 Perceptron Algorthm Intalze weght w Repeat For each data pont (x n, t n ) Classfy each data pont usng current w. If w T x n t n > 0 (correct), do nothng If w T x n t n < 0 (wrong), w new = w + ηx n t n w = w new Untl w s not changed (all the data wll be separated correctly, f data s lnearly separable) or error s below a threshold. Rosenblatt, 1962

32 Perceptron Convergence Theorem For any data set whch s lnearly separable, the algorthm s guaranteed to fnd a soluton n a fnte number of steps (Rosenblatt, 1962; Block 1962; Nlsson, 1965; Mnsky and Papert 1969; Duda and Hart, 1973; Hand, 1981; Arbb, 1987; Hertz et al., 1991)

33 Perceptron Demo v=vgwemzhplsa

34 Mult-Class Lnear Dscrmnant Functon c classes. Use one dscrmnant functon y k (x) = w kt x + w k0 for each class C k. A new data pont x s assgned to class C k f y k (x) > y j (x) for all j k.

35 One-Layer Mult-Class Perceptron y 1 y c w c0 w 10 w 11 w 1d w c1 w cd x 0 = 1 x 1 x d How to learn t?

36 Mut-Threshold Perceptron Algorthm Intalze weght w Repeat Present data pont x to the network, f classfcaton s correct, do nothng. f x s wrongly classfed to C nstead of true class C j, adjust weghts connected to C and C j as follows. Add ηx k to w k. Add ηx k to w jk Untl msclassfcaton s zero or below a threshold. Note: may also Add ηx k to w lk for any l, y l > y j.

37 Lmtaton of the Perceptron Can t not separate non-lnear data completely. Or can t not ft non-lnear data well. Two drectons to attack the problem: (1) extend to mult-layer neural network (2) map data nto hgh dmenson (SVM approach)

38 Exclusve OR Problem C1 C2 (0,1) (1,1) C2 C1 (0,0) (1,0) Perceptron (or one-layer neural network) can not learn a functon to separate the two classes perfectly.

39 Logstc Regresson Estmate posteror dstrbuton: P(C 1 x) Dose response estmaton: n boassay, the relaton between dose level and death rate P(death x). We can not use 0/1 hard classfcaton. We can not use unconstraned lnear regresson because P(death x) must be n [0,1]?

40 1 x 1 x d Logstc Regresson and One Layer Neural Network Wth Sgmod P( death x) = (Sgmod functon) 1 1+ e wx Functon e z y Target: t (0 or 1) Actvaton Functon: sgmod Actvaton z = Σw x

41 How to Adjust Weghts? Mnmze error E=(t-y) 2. For smplcty, we derve the formula for one data pont. For multple data ponts, just add the gradents together. x y y y t w z z y y E w E ) (1 ) ( 2 = = Notce: ) (1 ) 1 1 (1 1 1 ) 1 1 ( y y e e z e z y z z z = + + = + =

42 Error Functon and Learnng Least Square Maxmum lkelhood: output y s the probablty of beng n C 1 (t=1). 1- y s the probablty of beng n C 2. So what s probablty of P(t x) = y t (1-y) 1-t. Maxmum lkelhood s equvalent to mnmze negatve log lkelhood: E = -log P(t x) = -tlogy - (1-t)log(1-y). (cross / relatve entropy)

43 How to Adjust Weghts? Mnmze error E= -tlogy - (1-t)log(1-y). For smplcty, we derve the formula for one data pont. For multple data ponts, just add the gradents together. x t y x y y y y t y w z z y y E w E ) ( ) (1 ) (1 = = = ) ( ) ( 1 1 y y t y y t y t y t y t y E = = = t t x y t w w ) ( 1) ( + = + η Update rule:

44 Mult-Class Logstc Regresson Transfer (or actvaton) functon s normalzed exponentals (or soft max) y 1 w c0 y c e a y = c a e j j= 1 Actvaton Functon The mage cannot be dsplayed. Your computer may not have enough memory to open the mage, or the mage may have been corrupted. Restart your computer, and then open the fle agan. If the red x stll appears, you may have to delete the mage and then nsert t agan. w 10 w 11 w 1d w c1 w 1d Actvaton to Node O x 0 x 1 x d How to learn ths network? Once agan, gradent descent.

45 Questons? Is logstc regresson a lnear regresson? Can logstc regresson handle non-lnearly separable data? How to ntroduce non-lnearty?

46 Support Vector Machne Approach x 2 Map data pont nto hgh dmenson, e.g. addng some non-lnear features. C1 C2 How about we augument feature nto three dmenson (x 1, x 2, x 12 +x 22 ). x 1 x 12 +x 2 2 = 10 All data ponts n class C2 have a larger value for the thrd feature Than data ponts n C1. Now data s lnearly separable.

47 Neural Network Approach Mult-Layer Perceptrons In addton to nput nodes and output nodes, some hdden nodes between nput / output nodes are ntroduced. Use hdden unts to learn nternal features to represent data. Hdden nodes can learn nternal representaton of data that are not explct n the nput features. Transfer functon of hdden unts are non-lnear functon

48 Mult-Layer Perceptron Connectons go from lower layer to hgher layer. (usually from nput layer to hdden layer, to output layer) Connecton between nput/hdden nodes, nput/output nodes, hdden/hdden nodes, hdden/output nodes are arbtrary as long as there s no loop (must be feedforward). However, for smplcty, we usually only allow connecton from nput nodes to hdden nodes and from hdden nodes to output nodes. The connectons wth a layer are dsallowed.

49 Mult-Layer Perceptron Two-layer neural network (one hdden and one output) wth non-lnear actvaton functon s a unversal functon approxmator (see Bald and Brunak 2001 or Bshop 96 for the proof),.e. t can approxmate any numerc functon wth arbtrary precson gven a set of approprate weghts and hdden unts. In early days, people usually used two-layer (or three-layer f you count the nput as one layer) neural network. Increasng the number of layers was occasonally helpful. Later expanded nto deep learnng wth many layers!!!

50 Two-Layer Neural Network y 1 y k y c w kj z 1 z j z M Output Actvaton functon: f (lnear,sgmod, softmax) Actvaton of unt a k : M j= 0 w kj z j Z 0 =1 Actvaton functon: g (lnear, tanh, sgmod) Actvaton of unt a j : d w 11 w 1 w j = 0 w j x 1 x 0 x 1 x x d y k = f ( M w g( d kj j= 0 = 0 w j x ))

51 Adjust Weghts by Tranng How to adjust weghts? Adjust weghts usng known examples (tranng data) (x 1,x 2,x 3,,x d,t). Try to adjust weghts so that the dfference between the output of the neural network y and t (target) becomes smaller and smaller. Goal s to mnmze Error (dfference) as we dd for one layer neural network

52 Adjust Weghts usng Gradent Descent (Back-Propagaton) Known: Data: (x 1,x 2,x 3,,x n ) target t. Unknown weghts w: w 11, w 12,.. Error Mnma Randomly ntalze weghts Repeat for each example, compute output y calculate error E = (y-t) 2 E compute the dervatve of E over w: dw= w w new = w prev η * dw Untl error doesn t decrease or max num of teratons Note: η s learnng rate or step sze. W

53 Insghts We know how to compute the dervatve of one layer neural network? How to change weghts between nput layer and hdden layer? Should we compute the dervatve of each w separately or we can reuse ntermedate results? We wll have an effcent back-propagaton algorthm. We wll derve learnng for one data example. For multple examples, we can smply add the dervatves from them for a weght parameter together.

54 Neural Network Learnng: Two Processes Forward propagaton: present an example (data) nto neural network. Compute actvaton nto unts and output from unts. Backward propagaton: propagate error back from output layer to the nput layer and compute dervatves (or gradents).

55 Forward Propagaton w 11 y 1 y k y c w kj z 1 z j z M w 1 w j Output Actvaton functon: f (lnear,sgmod, softmax) Actvaton of unt a k : M j= 1 d = 1 w kj z j z j Actvaton functon: g (lnear, tanh, sgmod) Actvaton of unt a j : y k w j x x 1 x x d Tme complexty? O(dM + MC) = O(W)

56 Backward Propagaton = M j w kj z j 1 = d w j x 1 = = C k y k t k E 1 2 ) ( 2 1 y 1 y c x 1 x x d y k z 1 z j z M w j w 11 w 1 w kj f a k : g a j : Tme complexty? O(CM+Md) = O(W) k k k t y y E = k k k k k k k k a f t y a y y E a E = δ = = ) '( ) ( j k kj k k kj z w a a E w E = δ = j C k j kj k c k j j j k k k k j a g w a z z a a y y E a E δ = δ = = = = 1 1 ) '( j j j j x wj a a E w E = δ = If no back-propagaton, tme complexty s: (MdC+CM)

57 Example 2 ) ( 2 1 t y E = ( t) y a y y E a E k k = = = δ j j z w E δ = = M w j x 1 x 1 x x d y z 1 z j z M w j w 11 w 1 w j f lnear functon a k : g s sgmod: a j : ) (1 ) ( ) ( ' j j j j j j z z w t y a g w = = δ δ j j j j j x z w z t y x w E ) (1 ) ( = = δ

58 Algorthm Intalze weghts w Repeat For each data pont x, do the followng: Forward propagaton: compute outputs and actvatons Backward propagaton: compute errors for each output unts and hdden unts. Compute gradent for each weght. Update weght w = w - η ( E / w) Untl a number of teratons or errors drops below a threshold.

59 Implementaton Issue What should we store? An nput vector x of d dmensons A M*d matrx {w j } for weghts between nput and hdden unts An actvaton vector of M dmensons for hdden unts An output vector of M dmensons for hdden unts A C*M matrx {w kj } for weghts between hdden and output unts An actvaton vector of C dmensons for output unts An output vector of C dmensons for output unts An error vector of C dmensons for output unts An error vector of M dmensons for hdden unts

60 Recurrent Network y 1 y k y c w kj z 1 z j z M w w 11 w 1 w j Forward: At tme 1: present X1, 0 At tme 2: present X2, y1 x 1 x x d Backward: Tme t: back-propagate Tme t-1: back-propagate wth Output errors and errors from prevous step

61 Recurrent Neural Network 1. Recurrent network s essentally a seres of feed-forward neural networks sharng the same weghts 2. Recurrent network s good for tme seres data and sequence data such as bologcal sequences and stock seres

62 Overfttng The tranng data contans nformaton about the regulartes n the mappng from nput to output. But t also contans nose The target values may be unrelable. There s samplng error. There wll be accdental regulartes just because of the partcular tranng cases that were chosen. When we ft the model, t cannot tell whch regulartes are real and whch are caused by samplng error. So t fts both knds of regularty. If the model s very flexble t can model the samplng error really well. Ths s a dsaster. G. Hnton, 2006

63 Example of Overfttng and Good Fttng Overfttng Good fttng Overfttng functon can not generalze well to unseen data.

64 Preventng Overfttng Use a model that has the rght capacty: enough to model the true regulartes not enough to also model the spurous regulartes (assumng they are weaker). Standard ways to lmt the capacty of a neural net: Lmt the number of hdden unts. Lmt the sze of the weghts. Stop the learnng before t has tme to overft. G. Hnton, 2006

65 Lmtng the Sze of the Weghts Weght-decay nvolves addng an extra term to the cost functon that penalzes the squared weghts. C Keeps weghts small unless they have bg error dervatves. when C = C w C w E = + E w = λ 2 0, w w 2 + λw = 1 E λ w w G. Hnton, 2006

66 The Effect of Weght-Decay It prevents the network from usng weghts that t does not need. Ths can often mprove generalzaton a lot. It helps to stop t from fttng the samplng error. It makes a smoother model n whch the output changes more slowly as the nput changes. If the network has two very smlar nputs t prefers to put half the weght on each rather than all the weght on one. w/2 w/2 w 0 G. Hnton, 2006

67 Decdng How Much to Restrct the Capacty How do we decde whch lmt to use and how strong to make the lmt? If we use the test data we get an unfar predcton of the error rate we would get on new test data. Suppose we compared a set of models that gave random results, the best one on a partcular dataset would do better than chance. But t wont do better than chance on another test set. So use a separate valdaton set to do model selecton. G. Hnton, 2006

68 Usng a Valdaton Set Dvde the total dataset nto three subsets: Tranng data s used for learnng the parameters of the model. Valdaton data s not used of learnng but s used for decdng what type of model and what amount of regularzaton works best. Test data s used to get a fnal, unbased estmate of how well the network works. We expect ths estmate to be worse than on the valdaton data. We could then re-dvde the total dataset to get another unbased estmate of the true error rate. G. Hnton, 2006

69 Preventng Overfttng by Early Stoppng If we have lots of data and a bg model, ts very expensve to keep re-tranng t wth dfferent amounts of weght decay. It s much cheaper to start wth very small weghts and let them grow untl the performance on the valdaton set starts gettng worse (but don t get fooled by nose!) The capacty of the model s lmted because the weghts have not had tme to grow bg. G. Hnton, 2006

70 Why Early Stoppng Works When the weghts are very small, every hdden unt s n ts lnear range. So a net wth a large layer of hdden unts s lnear. It has no more capacty than a lnear net n whch the nputs are drectly connected to the outputs! As the weghts grow, the hdden unts start usng ther non-lnear ranges so the capacty grows. outputs nputs G. Hnton, 2006

71 Combnng Networks When the amount of tranng data s lmted, we need to avod overfttng. Averagng the predctons of many dfferent networks s a good way to do ths. It works best f the networks are as dfferent as possble. Combnng networks reduces varance If the data s really a mxture of several dfferent regmes t s helpful to dentfy these regmes and use a separate, smple model for each regme. We want to use the desred outputs to help cluster cases nto regmes. Just clusterng the nputs s not as effcent. G. Hnton, 2006

72 How the Combned Predctor Compares wth the Indvdual Predctors On any one test case, some ndvdual predctors wll be better than the combned predctor. But dfferent ndvduals wll be better on dfferent cases. If the ndvdual predctors dsagree a lot, the combned predctor s typcally better than all of the ndvdual predctors when we average over test cases. So how do we make the ndvdual predctors dsagree? (wthout makng them much worse ndvdually). G. Hnton, 2006

73 Ways to Make Predctors Dffer Rely on the learnng algorthm gettng stuck n a dfferent local optmum on each run. A dubous hack unworthy of a true computer scentst (but defntely worth a try). Use lots of dfferent knds of models: Dfferent archtectures Dfferent learnng algorthms. Use dfferent tranng data for each model: Baggng: Resample (wth replacement) from the tranng set: a,b,c,d,e -> a c c d d Boostng: Ft models one at a tme. Re-weght each tranng case by how badly t s predcted by the models already ftted. Ths makes effcent use of computer tme because t does not bother to back-ft models that were ftted earler. G. Hnton, 2006

74 How to Speedup Learnng? The Error Surface for a Lnear Neuron The error surface les n a space wth a horzontal axs for each weght and one vertcal axs for the error. It s a quadratc bowl..e. the heght can be expressed as a functon of the weghts wthout usng powers hgher than 2. Quadratcs have constant curvature (because the second dervatve must be a constant) Vertcal cross-sectons are parabolas. G. Hnton, 2006 Horzontal cross-sectons are ellpses. E w1 w w2

75 Convergence Speed The drecton of steepest descent does not pont at the mnmum unless the ellpse s a crcle. The gradent s bg n the drecton n whch we only want to travel a small dstance. The gradent s small n the drecton n whch we want to travel a large dstance. Δw = E ε w Ths equaton s sck. G. Hnton, 2006

76 How the Learnng Goes Wrong If the learnng rate s bg, t sloshes to and fro across the ravne. If the rate s too bg, ths oscllaton dverges. How can we move quckly n drectons wth small gradents wthout gettng dvergent oscllatons n drectons wth bg gradents? E w G. Hnton, 2006

77 Fve Ways to Speed up Learnng Use an adaptve global learnng rate Increase the rate slowly f ts not dvergng Decrease the rate quckly f t starts dvergng Use separate adaptve learnng rate on each connecton Adjust usng consstency of gradent on that weght axs Use momentum Instead of usng the gradent to change the poston of the weght partcle, use t to change the velocty. Use a stochastc estmate of the gradent from a few cases Ths works very well on large, redundant datasets. G. Hnton, 2006

78 The Momentum Method Imagne a ball on the error surface wth velocty v. It starts off by followng the gradent, but once t has velocty, t no longer does steepest descent. It damps oscllatons by combnng gradents wth opposte sgns. It bulds up speed n drectons wth a gentle but consstent gradent. Δw(t) = v(t) = α Δw(t 1) ε E w (t) G. Hnton, 2006

79 How to Intalze weghts? Saturated Use small random numbers. For nstance small numbers between [-0.2, 0.2]. Some numbers are postve and some are negatve. Why are the ntal weghts should be small? 1 1+ e wx

80 Neural Network Software Weka (Java): ml/weka/ NNClass and NNRank (C++): J. Cheng, Z. Wang, G. Pollastr. A Neural Network Approach to Ordnal Regresson. IJCNN, 2008

NNClass Demo Abalone data: http://archve.cs.uc.

81 NNClass Demo Abalone data: Abalone (from Spansh Abulón) are a group of shellfsh (mollusks) n the famly Halotdae and the Halots genus. They are marne snals multcom_toolbox/tools.html

82 Problem of Neural Network Vanshng gradents Cannot use unlabeled data Hard to understand the relatonshp between nput and output Cannot generate data

Accomplshments Apple s Sr vrtual personal assstant Google s Street Vew

90 Deep Learnng Revoluton 2012: Is deep learnng a revoluton n artfcal ntellgence? Accomplshments Apple s Sr vrtual personal assstant Google s Street Vew & Self-Drvng Car Google/Facebook/Tweeter/Yahoo Deep Learnng Acquston Hnton s Hand Wrtng Recognton CASP10 proten contact map predcton

100

101

102 A model for a dstrbu0on over bnary vectors Probablty of a vector, v, under the model s defned va an energy hdden layer w j c j h v b vsble layer

103

104 Instead of a;emp0ng to sample from jont dstrbu0on p(v,h) (.e. p ), sample from p 1 (v,h). Hnton, Neural Computa-on(2002) Faster and lower varance n sample.

105 Par0als of E(v, h) easy to calculate. j j t = 0 t = 1 Hnton, Neural Computa-on(2002)

106 Gradent of the lkelhood wth respect to w j the dfference between nterac0on of v and h j at 0me 0 and at 0me 1. Hdden Layer j j Vsble Layer t = 0 t = 1 Hnton, Neural Computa-on(2002)

107 Gradent of the lkelhood wth respect to w j the dfference between nterac0on of v and h j at 0me 0 and at 0me 1. Hdden Layer j j Vsble Layer t = 0 t = 1 Hnton, Neural Computa-on(2002)

108 Gradent of the lkelhood wth respect to w j the dfference between nterac0on of v and h j at 0me 0 and at 0me 1. Hdden Layer j j Vsble Layer t = 0 t = 1 Δw,j = <v p j0 > - <p 1 p j1 > Hnton, Neural Computa-on(2002)

109 ɛ s the learnng rate, η s the weght cost, and υ the momentum. Gradent Smaller Weghts Avod Local Mnma

110

111 Face or not? Lnes, crcles, squares Image pxels Bran Learnng

112 Objectve of Unsupervsed Learnng: Fnd w,j to maxmze the lkelhood p(v) of vsble data Iteratve Gradent Descent Approach: Adjust w,j to ncrease the lkelhood accordng to gradent

113

114

115 ~350 nodes ~500 nodes ~500 nodes w,j ~400 nput nodes A Vector of ~400 Features (numbers between 0 and 1)

116 [0,1] 1. Weghts are learned layer by layer va unsupervsed learnng. 2. Fnal layer s learned as a supervsed neural network. 3. All weghts are fne- tuned usng supervsed back propaga0on. Hnton and Salakhutdnov, Scence, 2006

117 [0,1] 1. Weghts are learned layer by layer va unsupervsed learnng. 2. Fnal layer s learned as a supervsed neural network. 3. All weghts are fne- tuned usng supervsed back propaga0on. Hnton and Salakhutdnov, Scence, 2006

118 Speed up tranng by CUDAMat and GPUs Tran DNs wth over 1M parameters n about an hour LSDEKIINVDF [0,1] KPSEERVREII

119

120

121

122

123 Demo:

124

125

126

127 Varous Deep Learnng Archtectures Deep belef network Deep neural networks Deep autoencoder Deep convoluton networks Deep resdual network Deep recurrent network

128 Deep Belef Network

129

130 Deep AutoEncoder

131 Deep Convoluton Neural Network

132 Deep Recurrent Neural Network

133 An Example

134 Deep Resdual Network the rectfer s an actvaton functon defned as A unt employng the rectfer s also called a rectfed lnear unt (ReLU)

135 Prevent from over-fttng Prevent unts from coadaptng Tranng: remove randomly selected unts accordng to a rate (0.5) Testng: multply all the unts wth dropout rate

136 Deep Learnng Tools Pylearn2 Theano Caffe Torch Cuda-convnet Deeplearnng4j October 25, 2016 Data Mnng: Concepts and Technques 136

137 Googles s TensorFlow TensorFlow s an open source software lbrary for numercal computaton usng data flow graphs. Nodes n the graph represent mathematcal operatons, whle the graph edges represent the multdmensonal data arrays (tensors) communcated between them. The flexble archtecture allows you to deploy computaton to one or more CPUs or GPUs n a desktop, server, or moble devce wth a sngle API. TensorFlow was orgnally developed by researchers and engneers workng on the Google Bran Team wthn Google's Machne Intellgence research organzaton for the purposes of conductng machne learnng and deep neural networks research, but the system s general enough to be applcable n a wde varety of other domans as well.

138 Acknowledgements Geoffery Hnton s sldes Jesse Eckholt s sldes Images.google.com

Multilayer Perceptron (MLP)

Multilayer Perceptron (MLP) Multlayer Perceptron (MLP) Seungjn Cho Department of Computer Scence and Engneerng Pohang Unversty of Scence and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjn@postech.ac.kr 1 / 20 Outlne