Deep Learning: A Quick Overview

Size: px

Start display at page:

Download "Deep Learning: A Quick Overview"

Tobias Hunter
6 years ago
Views:

1 Deep Learnng: A Quck Overvew Seungjn Cho Department of Computer Scence and Engneerng Pohang Unversty of Scence and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjn@postech.ac.kr seungjn October 16, / 43

2 Machne Learnng? Envronments Model Learnng Data sources Inference Machne Learnng: A scentfc dscplne that s concerned wth the desgn and development of algorthms that allow computers to learn from emprcal data (sensor data or database) and to make predctons. Descrptve or Predctve Learnng 2 / 43

3 Models n ML Generatve model Dscrmnatve model Undrected model Generatve model: Jont dstrbuton p(v, h) = p(v h)p(h) Dscrmnatve model: Drectly model p(h v) Undrected model: Energy-based model exp { E (v, h)} p(v, h) = v,h exp { E ( v, h )} 3 / 43

4 Generatve vs Energy-based Models Generatve model p(v, h) = p(v h)p(h). Energy-based model p(v, h) = exp { E (v, h)} v,h exp { E ( v, h )}. Generaton from model s easy. Inference can be hard. Learnng s easy after nference Generaton from model s hard. Inference can be easy. Learnng s relatvely hard? 4 / 43

5 Neural Networks Hstory Warren McCulloch and Walter Ptts (1943): McCulloch-Ptts model Donald Hebb (1949): The Organzaton of Behavor (Hebban learnng) Frank Rosenblatt (1957): Perceptron Bernard Wdrow and Marcan Hoff (1959): ADALINE John Hopfeld (1982): Hopfeld net Erkk Oja (1982): Oja s rule Davd E. Rumelhart, Geoffrey E. Hnton, and R. J. Wllams (1986): Error backpropagaton algorthm for MLP George Cybenko (1989): Unversal approxmaton theorem Radford Neal (1996): Bayesan learnng for neural networks 5 / 43

6 Recent Tme Lne of Deep Learnng Paul Smolensky (1986): Harmonum Geoffrey E. Hnton and Terry Sejnowsk (1986): Boltzmann machnes Geoffrey E. Hnton and hs colleagues (md-2000s): RBM Kunhko Fukushma (1980): Neocogntron Y. LeCun, L. Bottou,Y. Bengo, and P.Haffner (1989): Convolutonal neural networks Geoffrey E. Hnton, Smon Osndero, and Yee-Whye Teh (2006): Deep belef networks Ruslan Salakhutdnov and Geoffrey E. Hnton (2009): Deep Boltzmann machnes 6 / 43

7 Success and Issues Why successful? Pre-tranng: Restrcted Boltzmann machne (RBM), Auto-encoder, nonnegatve matrx factorzaton (NMF) Tranng: Dropout Rectfed lnear unts: No vanshng gradent, sparse actvaton Thanks to plenty of data an computng power! Issues Dstrbutons: Exponental famly harmonum, deep exponental famles Hyperparameter tunng: Bayesan optmzaton Scalablty: Squeeze deep nets, tranng and nference Mult-modal extensons: Dual-wng harmonum, restrcted deep belef nets, mult-modal stacked auto-encoders, and mult-modal deep Boltzmann machnes 7 / 43

8 Restrcted Boltzmann Machnes 8 / 43

9 Harmonum (or Restrcted Boltzmann Machne) h 1 h K v 1 v 2 v D Harmonum (Smolensky, 1986), aka RBM (Hnton and Sejnowsk, 1986). Undrected model whch allows only nter-layer connectons (bpartte graphs). Energy-based probablstc model defnes a probablty dstrbuton through an energy functon, assocatng an energy functon to each confguraton of the varables of nterest: p(v) = h p(v, h) = h 1 Z e E(v,h). Learnng corresponds to modfyng the energy functon so that ts shape has desrable propertes (maxmum lkelhood estmaton). 9 / 43

10 RBM: v {1, 0} D and h {1, 0} K Energy s gven by E(v, h) = b v c h v W h. Condtonal probabltes are calculated as: ( ) ( ) D K p(h j = 1 v) = σ c j + W,j v, p(v = 1 h) = σ b + W,j h j. =1 j=1 Gaussan-Bernoull RBM: v R D and h {1, 0} K Energy s gven by E(v, h) = D =1 (v b ) 2 2σ 2 K D K c j h j j=1 =1 j=1 v W,j h j σ. Condtonal probabltes are calculated as: ( ) D W,j v p(h j = 1 v) = σ c j +, σ =1 ( p(v h) = N v b + σ K j=1 W,j h j, σ 2 ). 10 / 43

11 Gbbs Samplng n RBM wth Bnary Unts v (1) p(v), h (1) p(h v (1) ) v (2) p(v h (1) ) h (2) p(h v (2) ). v (k+1) p(v h (k) ) (k Gbbs steps). 11 / 43

12 Contrastve Dvergence Learnng Average log-lkelhood gradent s approxmated by k Gbbs steps, log p(v) θ θ E(v, h) E(v, h) θ p(v) p +, p (k) (v,h) where p (k) (v, h) s the jont dstrbuton determned by k Gbbs steps and p(v) = 1 N N δ(v v n), n=1 p = 1 N N p(h v n). n=1 Gradent ascent learnng leads to [ W W + η vh p vh ], p (k) (v,h) [ ] 1 N b b + η v n v N p (k) (v,h), n=1 [ ] c c + η h p h p (k) (v,h). 12 / 43

13 Exponental Famly Harmonum Max Wellng, Mchal Rosen-Zv, Geoffrey E. Hnton (2004) Allows for both observed and hdden varables to follow exponental famly dstrbuton. Assume ndependent dstrbutons for observed and hdden varables from exponental famly and couple them n the log-doman by the ntroducton of a quadratc nteracton term yelds a random feld: p(v, h) exp ξ a f a (v ) + λ jb g jb (h j ) + W ajb f a (v )g jb (h j ),,a j,b,a,j,b where {f a (v ), g jb (h )} are suffcent statstcs, {ξ a, λ jb } are canoncal parameters. Apply the standard contrastve dvergence learnng. 13 / 43

14 Auto-Encoders 14 / 43

15 Auto-Encoder decodng v = g encodng h = f ( ) U h + c ( ) W v + b v Lnear actvatons and ted weghts U = W : Equvalent to PCA Traned by backpropagaton: Mnmze v J (v, v). 15 / 43

16 Denosng Auto-Encoder Fgure: Autoencoder X X Fgure: Denosng autoencoder 16 / 43

17 Stacked Denosng Auto-Encoder for Face Pose Normalzaton (Kang and Cho, 2013) (frontal faces) (Non-frontal faces) Fgure: Pre-tranng Fgure: Tran another DAE 17 / 43

18 Fgure: 10 examples of corrupted face mages from Georga tech face database (left), ther pose-normalzed verson obtaned by the proposed method (mddle), and ground-truth frontal face mages (rght). 18 / 43

19 Varatonal Auto-Encoder (Kngma and Wellng, 2014) Model parameters θ and varatonal parameters φ are learned by maxmzng varatonal lower bound on the margnal lkelhood (stochastc gradent varatonal Bayes). Decodng MLP and encodng MLP. For nstance, Probablstc decoder: p θ (x z) Probablstc encoder: q φ (z x) p(x z) = N (x µ, Σ), h = σ (W 1 z + b 1 ), µ = W 2 h + b 2, log dag(σ) = W 3 h + b / 43

20 Kngma and Wellng, / 43

21 Dropout 21 / 43

22 Dropout Fgure: Taken from Srvastava et al., JMLR Form a vector of ndependent Bernoull random varables, z (l), where z (l) Bern(p). Feedforward operatons are: ( ( y (l+1) = σ w (l+1) y (l) z (l)) ) + b (l+1). 22 / 43

23 Fgure: Taken from Srvastava et al., JMLR Approxmates the effect of averagng the predctons of all these thnned networks by smply usng a sngle unthnned network that has smaller weghts. 23 / 43

24 Gaussan Dropout (Wang and Mannng, 2013) v (l+1) = j ( y (l+1) = σ W (l+1) j, v (l+1) ). y (l) j z (l) j + b (l+1), Approxmate ( v (l+1) as Gaussan ( ) ) 2 ṽ (l+1) N µ (l+1), σ (l+1). Drectly sample v (l+1) from ths Gaussan dstrbuton: ṽ (l+1) = µ (l+1) + σ (l+1) ɛ, ɛ N (0, 1), [ ] µ (l+1) = E ṽ (l+1) = p j W (l+1) j, y (l) j + b (l+1), j ( ) σ (l+1) = var ṽ (l+1) = ( ) 2 p j (1 p j ) W (l+1) j, y (l) j. j 24 / 43

25 Standard dropout Sample z (l) Bern(p ). Compute actvatons v (l+1) = w (l+1) + b (l+1) ( y (l+1) = σ, v (l+1) (y (l) z (l)) ). Gaussan dropout Drectly sample v (l+1) N ( µ (l+1), ( Compute actvatons ( ) = σ. y (l+1) v (l+1) σ (l+1) ) 2 ). { Gaussan dropout gnores dependences between the standard dropout. v (l+1) }, as present n Nonetheless good emprcal results were reported by Wang and Mannng (2013). 25 / 43

26 Bayesan Learnng for Feedforward Neural Networks 26 / 43

27 Bayesan Learnng for Neural Networks (Neal, 1996) Queston: Does t make sense to use a network wth an nfnte number of hdden unts, tranng t by maxmzng the lkelhood of a fnte amount of data? The network wll overft the data and ts generalzaton performance wll be poor. Well, however, the dea of selectng the network sze dependng on the amount of tranng data makes lttle sense to a Bayesan. Radford M. Neal (1996) showed that: Sensble to consder a lmt where the number of hdden unts n a net tends to nfnty, and that good predctons can be obtaned from such models usng the Bayesan machnery. For fxed hyperparameters, a large class of neural network models wll converge to a Gaussan process pror over functons n the lmt of an nfnte number of hdden unts. 27 / 43

28 Computng wth Infnte Networks (Wllams, 1996) Neal proved that an nfnte neural net wll converge to a Gaussan process. However, Neal dd not gve the covarance functon needed to actually specfy the partcular Gaussan process. Wllams has showed that for certan weght prors and transfer functons n the neural network model, the covarance functon whch descrbes the behavor of the correspondng Gaussan process can be calculated analytcally. Allows predctons to be made usng neural networks wth an nfnte number of hdden unts n tme O(n 3 ), where n s the number of tranng examples. 28 / 43

29 Mult-vew Extenson of Deep Learnng s EASY! 29 / 43

30 Mult-Task Learnng (Caruana, 1997) 30 / 43

31 31 / 43

32 Mult-Vew Learnng: Moble Examples 32 / 43

33 Mult-Vew Learnng: Recommendaton Systems 33 / 43

34 Mult-Vew Learnng: Image Annotaton 34 / 43

35 Dual-Wng Harmonum (E. P. Xng, R. Yan, and A. G. Hauptmann, 2005) h x y For each word (word-count profle) { p(x h) = Posson (x exp b + j W,j h j }). For each mage (color hstogram representaton) ( ( σ 2 p(y k h) = N y k c k + ) ) U k,j h j, σ 2. j 35 / 43

36 36 / 43

37 Mult-Vew Harmonum (Y. Kang and S. Cho, 2011) h x h h y x y Latent varable Latent varable t t (a) mult-vew RBM (b) dual-wng harmonum 37 / 43

38 Mult-Modal Autoencoders (J. Ngam, et al., 2011) shared representaton 38 / 43

39 Restrcted Deep Belef Net (Kang and Cho, 2011) / 43

40 Deep Multvew Hashng (Kang, Km, and Cho, 2012) (c)... (b) (a) / 43

41 Experments: Smlarty Search Results Query Retreved mages 41 / 43

42 Deep Belef Net vs Deep Boltzmann Machne p(h 2, h 3 ) h 3 h 3 p(h 1 h 2 ) h 2 h 2 p(v h 1 ) h 1 h 1 v v 42 / 43

43 Queston and Dscusson 43 / 43

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing CSC321 Tutoral 9: Revew of Boltzmann machnes and smulated annealng (Sldes based on Lecture 16-18 and selected readngs) Yue L Emal: yuel@cs.toronto.edu Wed 11-12 March 19 Fr 10-11 March 21 Outlne Boltzmann