Deep Learning. Boyang Albert Li, Jie Jay Tan

Deep Learnng Boyang Albert L, Je Jay Tan

An Unrelated Vdeo A bcycle controller learned usng NEAT (Stanley)

What do you mean, deep? Shallow Hdden Markov models ANNs wth one hdden layer Manually selected and desgned features Deep Stacked Restrcted Boltzmann Machnes ANNs wth multple hdden layers Learnng complex features

Algorthms of Deep Learnng Recurrent Neural Networks Stacked Autoencoders (.e. deep neural networks) Stacked Restrcted Boltzmann Machnes (.e. deep belef networks) Convoluted Deep Belef Networks a growng lst

But What s Wrong wth Shallow? Needs more nodes / computng unts and weghts [Bengo, Y., et al. (2007). Greedy layerwse tranng of deep networks] Boolean functons (such as the functon that computes the multplcaton of two numbers from ther d-bt representaton) expressble by Olog layers of combnatoral logc wth O elements n each layer O2 elements when expressed wth only 2 layers Relance on manually selected features Automatcally learnng the features Dsentanglng nteractng factors, creatng nvarant features (wll come back to that)

Dsentanglng factors

Is the bran deep, too? http://thebran.mcgll.ca/flash/a/a_02/a_02_cr/a_02_cr_vs/a_02_cr_vs.html Erc R. Kandel. (2012) The Age of Insght: The Quest to Understand the Unconscous n Art, Mnd and Bran from Venna 1900 to the Present

A general algorthm for the bran? One part of the bran can learn the functon of another part If the vsual nput s sent to the audtory cortex of a newborn ferret, the "audtory" cells learn to do vson. (Sharma, Angelucc, and Sur. Nature 2000) People blnded at a young age can hear better, possbly because ther bran can stll adapt. (Gougoux et al. Nature 2004) Dfferent regons of the bran look smlar

Feature Learnng vs. Deep Neural Network pxels

Feature Learnng vs. Deep Neural Network pxels edges

Feature Learnng vs. Deep Neural Network pxels edges object parts

Feature Learnng vs. Deep Neural Network pxels edges object parts object models

Artfcal Neural Networks y h W ( x) x y Input Layer Hdden Layer Output Layer

Backpropagaton Mnmze Gradent computaton: 1 J ( w) hw ( x ) y 2 2 h w x y ) w 2 (2) (2) 11 w11 J ( ) w 1 ( ( ) a y (3) ( ) (3) ( ) a y a w (3) (2) 11 ( a y ) f ' a 4 j1 (3) (2) 1 f( w a ) w (2) (2) j1 1 (2) 11 2 h ( ) w x x

Backpropagaton J ( ) w 1 ( ( ) 2 h w x y ) w 2 (1) (1) 11 w11 (3) ( a y ) (3) ( a y ) a w (3) (1) 11 4 j1 a ( a ) ' (2) (3) 1 y f (1) w11 f( w a ) w (2) (2) j1 1 (1) 11 h ( ) w x x

More than one hdden layer? I thought of that, too. Ddn t work! Lack of data and computatonal power Weghts ntalzaton Poor local mnma Dffuson of gradent Overfttng A mult-layer model s too powerful / complex

Dffuson of Gradent s l 1 () l () l ( l1 ) ( l) ( wj j ) f ' j1 J ( w) l l a w () l j () ( 1) j

Preventon of Overfttng Generatve Pre-tranng a way to ntalze the weghts Learnng p(x) or p(x, h) nstead of p(y x) Early stoppng Weght sharng and many other methods

Autoencoders x x h W ( x) w arg mn x x w () () 2

Sparse Autoencoder x x h W ( x)

Sparse Autoencoder 2 x a a (2) 1 0 (2) 2 (2) a n

Sparse Autoencoder x x h W ( x) w arg mn ( x x S( a )) w 2 ( 2 ) 2

Sparsty Regularzer L 0 norm: S( a) I( a 0)

Sparsty Regularzer L 0 norm: L 1 norm: S( a) I( a 0) S( a) a

Sparsty Regularzer L 0 norm: L 1 norm: S( a) I( a 0) L 2 norm: S( a) a S( a) a 2

Sparsty Regularzer L 0 norm: L 1 norm: S( a) I( a 0) S( a) a L 2 norm: S( a) a 2

L 1 vs. L 2 Regularzer

Effcent sparse codng Lee et al. (2006) Effcent sparse codng algorthms. NIPS a a a

Dmenson Reducton vs. Sparsty vs.

Vsualze a Traned Autoencoder Suppose the autoencoder s traned on 10 * 10 mages: 100 (2) j j j1 a f( W x )

Vsualze a Traned Autoencoder (2) a What mage wll maxmally actvate? Less formally, what s the feature that hdden unt s lookng for? 100 j1 max f ( Wx) x j j j

Vsualze a Traned Autoencoder What mage wll maxmally actvate (2)? Less formally, what s the feature that hdden unt s lookng for? a 100 j1 max f ( Wx) x 100 j st.. j1 x 2 j 1 j j

Vsualze a Traned Autoencoder (2) a What mage wll maxmally actvate? Less formally, what s the feature that hdden unt s lookng for? 100 j1 max f ( Wx) x j st.. 100 j1 x 2 j 1 j j x j 100 j1 W j ( W ) j 2

Vsualze a Traned Autoencoder

Tran a Deep Autoencoder x x

Tran a Deep Autoencoder

Tran a Deep Autoencoder Fne Tunng x x

Tran a Deep Autoencoder x Feature Vector

Tran an Image Classfer x Image Label (car or people)

Vsualze a Traned Autoencoder

Learnng Independent features? Le, Zou, Yeung, and Ng, CVPR 2011 Invarant features, dsentangle factors Introducng ndependence to mprove the results

Results

Recurrent Neural Networks Sutskever, Martens, Hnton. 2011. Generatng Text wth Recurrent Neural Networks. ICML y x

RNN to predct characters 1500 hdden unts 1500 hdden unts c character: 1 of 86 softmax predcted dstrbuton for next character. It s a lot easer to predct 86 characters than 100,000 words.

A sub-tree n the tree of all character strngs There are exponentally many nodes n the tree of all character strngs of length N. n fxn fx...fx e fxe In an RNN, each node s a hdden state vector. The next character must transform ths to a new node. If the nodes are mplemented as hdden states n an RNN, dfferent nodes can share structure because they use dstrbuted representatons. The next hdden representaton needs to depend on the conjuncton of the current character and the current hdden representaton.

Multplcatve connectons Instead of usng the nputs to the recurrent net to provde addtve extra nput to the hdden unts, we could use the current nput character to choose the whole hdden-to-hdden weght matrx. But ths requres 86x1500x1500 parameters Ths could make the net overft. Can we acheve the same knd of multplcatve nteracton usng fewer parameters? We want a dfferent transton matrx for each of the 86 characters, but we want these 86 character-specfc weght matrces to share parameters (the characters 9 and 8 should have smlar matrces).

Group a Usng factors to mplement multplcatve nteractons We can get groups a and b to nteract multplcatvely by usng factors. Each factor frst computes a weghted sum for each of ts nput groups. Then t sends the product of the weghted sums to ts output group. u f f Group b v f w f Group c c f vector of nputs to group c b T w f scalar nput to f from group b a T u f scalar nput to f from group a v f

He was elected Presdent durng the Revolutonary War and forgave Opus Paul at Rome. The regme of hs crew of England, s now Arab women's cons n and the demons that use somethng between the characters ssters n lower col trans were always operated on the lne of the ephemerable street, respectvely, the graphc or other faclty for deformaton of a gven proporton of large segments at RTUS). The B every chord was a "strongly cold nternal palette pour even the whte blade.

The meanng of lfe s 42? The meanng of lfe s the tradton of the ancent human reproducton: t s less favorable to the good boy for when to remove her bgger.

Is RNN deep enough? Ths deep structure provdes memory, not herarchcal processng Addng herarchcal processng Pascanu, Gulcehre, Cho, and Bengo (2013)

Why Unsupervsed Pre-tranng Works From Bengo s talk Optmzaton Hypothess Unsupervsed tranng ntalzes weghts near localtes of better mnma than random ntalzaton can. Regularzaton Hypothess (Prevent over-fttng) The unsupervsed pre-tranng dataset s larger. Features extracted from unsupervsed set are more general and have better dscrmnant power.

Why Unsupervsed Pre-tranng Works Bengo: Learnng P(x) or P(x, h), whch helps you wth P(y x) Structures and features that can generate the nputs (no matter f a probablstc formulaton s used) also happen to be useful for your supervsed task Ths requres P(x) and P(y x) to be smlar,.e. smlarly lookng x produces smlar y Ths s probably more true for vson / audo than for texts

Concluson Motvaton for deep learnng Backpropagaton Autoencoder and sparsty Generatve, layerwse pre-tranng (Stacked Autoencoder) Recurrent Neural Networks Speculaton of why these thngs work