CS294A Lecture notes Andrew Ng Sparse autoencoder 1 Introducton Supervsed learnng s one of the most powerful tools of AI, and has led to automatc zp code recognton, speech recognton, self-drvng cars, and a contnually mprovng understandng of the human genome. Despte ts sgnfcant successes, supervsed learnng today s stll severely lmted. Specfcally, most applcatons of t stll requre that we manually specfy the nput features x gven to the algorthm. Once a good feature representaton s gven, a supervsed learnng algorthm can do well. But n such domans as computer vson, audo processng, and natural language processng, there re now hundreds or perhaps thousands of researchers who ve spent years of ther lves slowly and laborously hand-engneerng vson, audo or text features. Whle much of ths feature-engneerng work s extremely clever, one has to wonder f we can do better. Certanly ths labor-ntensve hand-engneerng approach does not scale well to new problems; further, deally we d lke to have algorthms that can automatcally learn even better feature representatons than the hand-engneered ones. These notes descrbe the sparse autoencoder learnng algorthm, whch s one approach to automatcally learn features from unlabeled data. In some domans, such as computer vson, ths approach s not by tself compettve wth the best hand-engneered features, but the features t can learn do turn out to be useful for a range of problems (ncludng ones n audo, text, etc). Further, there re more sophstcated versons of the sparse autoencoder (not descrbed n these notes, but that you ll hear more about later n the class) that do surprsngly well, and n some cases are compettve wth or sometmes even better than some of the hand-engneered representatons. 1
Ths set of notes s organzed as follows. We wll frst descrbe feedforward neural networks and the backpropagaton algorthm for supervsed learnng. Then, we show how ths s used to construct an autoencoder, whch s an unsupervsed learnng algorthm, and fnally how we can buld on ths to derve a sparse autoencoder. Because these notes are farly notaton-heavy, the last page also contans a summary of the symbols used. 2 Neural networks Consder a supervsed learnng problem where we have access to labeled tranng examples (x (), y () ). Neural networks gve a way of defnng a complex, non-lnear form of hypotheses h W,b (x), wth parameters W, b that we can ft to our data. To descrbe neural networks, we use the followng dagram to denote a sngle neuron : Ths neuron s a computatonal unt that takes as nput x 1, x 2, x 3 (and a +1 ntercept term), and outputs h w,b (x) = f(w T x) = f( 3 =1 w x + b), where f : R R s called the actvaton functon. One possble choce for f( ) s the sgmod functon f(z) = 1/(1 + exp( z)); n that case, our sngle neuron corresponds exactly to the nput-output mappng defned by logstc regresson. In these notes, however, we ll use a dfferent actvaton functon, the hyperbolc tangent, or tanh, functon: Here s a plot of the tanh(z) functon: f(z) = tanh(z) = ez e z, (1) e z + e z 2
The tanh(z) functon s a rescaled verson of the sgmod, and ts output range s [ 1, 1] nstead of [0, 1]. Our descrpton of neural networks wll use ths actvaton functon. Note that unlke CS221 and (parts of) CS229, we are not usng the conventon here of x 0 = 1. Instead, the ntercept term s handled separated by the parameter b. Fnally, one dentty that ll be useful later: If f(z) = tanh(z), then ts dervatve s gven by f (z) = 1 (f(z)) 2. (Derve ths yourself usng the defnton of tanh(z) gven n Equaton 1.) 2.1 Neural network formulaton A neural network s put together by hookng together many of our smple neurons, so that the output of a neuron can be the nput of another. For example, here s a small neural network: 3
In ths fgure, we have used crcles to also denote the nputs to the network. The crcles labeled +1 are called bas unts, and correspond to the ntercept term. The leftmost layer of the network s called the nput layer, and the rghtmost layer the output layer (whch, n ths example, has only one node). The mddle layer of nodes s called the hdden layer, because ts values are not observed n the tranng set. We also say that our example neural network has 3 nput unts (not countng the bas unt), 3 hdden unts, and 1 output unt. We wll let n l denote the number of layers n our network; thus n l = 3 n our example. We label layer l as L l, so layer L 1 s the nput layer, and layer L nl the output layer. Our neural network has parameters (W, b) = (W (1), b (1), W (2), b (2) ), where we wrte W (l) j to denote the parameter (or weght) assocated wth the connecton between unt j n layer l, and unt n layer l+1. (Note the order of the ndces.) Also, b (l) s the bas assocated wth unt n layer l+1. Thus, n our example, we have W (1) R 3 3, and W (2) R 1 3. Note that bas unts don t have nputs or connectons gong nto them, snce they always output the value +1. We also let s l denote the number of nodes n layer l (not countng the bas unt). to denote the actvaton (meanng output value) of unt n layer l. For l = 1, we also use a (1) = x to denote the -th nput. Gven a fxed settng of the parameters W, b, our neural network defnes a hypothess h W,b (x) that outputs a real number. Specfcally, the computaton that ths neural network represents s gven by: We wll wrte a (l) a (2) 1 = f(w (1) 11 x 1 + W (1) 12 x 2 + W (1) 13 x 3 + b (1) 1 ) (2) a (2) 2 = f(w (1) 21 x 1 + W (1) 22 x 2 + W (1) 23 x 3 + b (1) 2 ) (3) a (2) 3 = f(w (1) 31 x 1 + W (1) 32 x 2 + W (1) 33 x 3 + b (1) 3 ) (4) h W,b (x) = a (3) 1 = f(w (2) 11 a 1 + W (2) 12 a 2 + W (2) 13 a 3 + b (2) 1 ) (5) In the sequel, we also let z (l) n layer l, ncludng the bas term (e.g., z (2) a (l) = f(z (l) ). denote the total weghted sum of nputs to unt = n j=1 W (1) j x j + b (1) ), so that Note that ths easly lends tself to a more compact notaton. Specfcally, f we extend the actvaton functon f( ) to apply to vectors n an elementwse fashon (.e., f([z 1, z 2, z 3 ]) = [tanh(z 1 ), tanh(z 2 ), tanh(z 3 )]), then we can 4
wrte Equatons (2-5) more compactly as: z (2) = W (1) x + b (1) a (2) = f(z (2) ) z (3) = W (2) a (2) + b (2) h W,b (x) = a (3) = f(z (3) ) More generally, recallng that we also use a (1) = x to also denote the values from the nput layer, then gven layer l s actvatons a (l), we can compute layer l + 1 s actvatons a (l+1) as: z (l+1) = W (l) a (l) + b (l) (6) a (l+1) = f(z (l+1) ) (7) By organzng our parameters n matrces and usng matrx-vector operatons, we can take advantage of fast lnear algebra routnes to quckly perform calculatons n our network. We have so far focused on one example neural network, but one can also buld neural networks wth other archtectures (meanng patterns of connectvty between neurons), ncludng ones wth multple hdden layers. The most common choce s a n l -layered network where layer 1 s the nput layer, layer n l s the output layer, and each layer l s densely connected to layer l + 1. In ths settng, to compute the output of the network, we can successvely compute all the actvatons n layer L 2, then layer L 3, and so on, up to layer L nl, usng Equatons (6-7). Ths s one example of a feedforward neural network, snce the connectvty graph does not have any drected loops or cycles. Neural networks can also have multple output unts. For example, here s a network wth two hdden layers layers L 2 and L 3 and two output unts n layer L 4 : 5
To tran ths network, we would need tranng examples (x (), y () ) where y () R 2. Ths sort of network s useful f there re multple outputs that you re nterested n predctng. (For example, n a medcal dagnoss applcaton, the vector x mght gve the nput features of a patent, and the dfferent outputs y s mght ndcate presence or absence of dfferent dseases.) 2.2 Backpropagaton algorthm We wll tran our neural network usng stochastc gradent descent. For much of CS221 and CS229, we consdered a settng n whch we have a fxed tranng set {(x (1), y (1) ),..., (x (m), y (m) )}, and we ran ether batch or stochastc gradent descent on that fxed tranng set. In these notes, wll take an onlne learnng vew, n whch we magne that our algorthm has access to an unendng sequence of tranng examples {(x (1), y (1) ), (x (2), y (2) ), (x (3), y (3) ),...}. In practce, f we have only a fnte tranng set, then we can form such a sequence by repeatedly vstng our fxed tranng set, so that the examples n the sequence wll repeat. But even n ths case, the onlne learnng vew wll make some of our algorthms easer to descrbe. In ths settng, stochastc gradent descent wll proceed as follows: For = 1, 2, 3,... Get next tranng example (x (), y () ). Update W (l) jk b (l) j (l) := W := b (l) j jk α α b (l) j W (l) jk J(W, b; x (), y () ) J(W, b; x (), y () ) Here, α s the learnng rate parameter, and J(W, b) = J(W, b; x, y) s a cost functon defned wth respect to a sngle tranng example. (When there s no rsk of ambguty, we drop the dependence of J on the tranng example x, y, and smply wrte J(W, b)). If the tranng examples are drawn IID from some tranng dstrbuton D, we can thnk of ths algorthm as tryng to mnmze E (x,y) D [J(W, b; x, y)]. Alternatvely, f our sequence of examples s obtaned by repeatng some fxed, fnte tranng set {(x (1), y (1) ),..., (x (m), y (m) )}, then ths algorthm s standard stochastc gradent descent for mnmzng 1 m m J(W, b; x, y). =1 6
To tran our neural network, we wll use the cost functon: J(W, b; x, y) = 1 2 ( hw,b (x) y 2) λ 2 n l 1 l=1 s l =1 s l+1 ( j=1 W (l) j The frst term s a sum-of-squares error term; the second s a regularzaton term (also called a weght decay term) that tends to decrease the magntude of the weghts, and helps prevent overfttng. 1 The weght decay parameter λ controls the relatve mportance of the two terms. Ths cost functon above s often used both for classfcaton and for regresson problems. For classfcaton, we let y = +1 or 1 represent the two class labels (recall that the tanh(z) actvaton functon outputs values n [ 1, 1], so we use +1/-1 valued outputs nstead of 0/1). For regresson problems, we frst scale our outputs to ensure that they le n the [ 1, 1] range. Our goal s to mnmze E (x,y) [J(W, b; x, y)] as a functon of W and b. To tran our neural network, we wll ntalze each parameter W (l) j and each b (l) to a small random value near zero (say accordng to a N (0, ɛ 2 ) dstrbuton for some small ɛ, say 0.01), and then apply stochastc gradent descent. Snce J(W, b; x, y) s a non-convex functon, gradent descent s susceptble to local optma; however, n practce gradent descent usually works farly well. Also, n neural network tranng, stochastc gradent descent s almost always used rather than batch gradent descent. Fnally, note that t s mportant to ntalze the parameters randomly, rather than to all 0 s. If all the parameters start off at dentcal values, then all the hdden layer unts wll end up learnng the same functon of the nput (more formally, W (1) j wll be the same for all values of, so that a (2) 1 = a (2) 2 =... for any nput x). The random ntalzaton serves the purpose of symmetry breakng. We now descrbe the backpropagaton algorthm, whch gves an effcent way to compute the partal dervatves we need n order to perform stochastc gradent descent. The ntuton behnd the algorthm s as follows. Gven a tranng example (x, y), we wll frst run a forward pass to compute all the actvatons throughout the network, ncludng the output value of the hypothess h W,b (x). Then, for each node n layer l, we would lke to compute 1 Usually weght decay s not appled to the bas terms b (l), as reflected n our defnton for J(W, b; x, y). Applyng weght decay to the bas unts usually makes only a small dfferent to the fnal network, however. If you took CS229, you may also recognze weght decay ths as essentally a varant of the Bayesan regularzaton method you saw there, where we placed a Gaussan pror on the parameters and dd MAP (nstead of maxmum lkelhood) estmaton. 7 ) 2
an error term δ (l) that measures how much that node was responsble for any errors n our output. For an output node, we can drectly measure the dfference between the network s actvaton and the true target value, and use that to defne δ (n l) (where layer n l s the output layer). How about hdden unts? For those, we wll compute δ (l) based on a weghted average of the error terms of the nodes that uses a (l) as an nput. In detal, here s the backpropagaton algorthm: 1. Perform a feedforward pass, computng the actvatons for layers L 2, L 3, and so on up to the output layer L nl. 2. For each output unt n layer n l (the output layer), set δ (n l) = z (n l) 1 2 y h W,b(x) 2 = (y a (n l) ) f (z (n l) ) 3. For l = n l 1, n l 2, n l 3,..., 2 For each node n layer l, set δ (l) = 4. Update each weght W (l) j ( sl+1 j=1 and b (l) j W (l) j := W (l) j α W (l) j δ(l+1) j ( b (l) := b (l) αδ (l+1). ) accordng to: a (l) j δ(l+1) f (z (l) ) ) + λw (l) j Although we have not proved t here, t turns out that W (l) j b (l) J(W, b; x, y) = a (l) j δ(l+1) + λw (l) j, J(W, b; x, y) = δ (l+1). Thus, ths algorthm s exactly mplementng stochastc gradent descent. Fnally, we can also re-wrte the algorthm usng matrx-vectoral notaton. We wll use to denote the element-wse product operator (denoted.* n Matlab or Octave, and also called the Hadamard product), so that f 8
a = b c, then a = b c. Smlar to how we extended the defnton of f( ) to apply element-wse to vectors, we also do the same for f ( ) (so that f ([z 1, z 2, z 3 ]) = [ z 1 tanh(z 1 ), z 2 tanh(z 2 ), z 3 tanh(z 3 )]). The algorthm can then be wrtten: 1. Perform a feedforward pass, computng the actvatons for layers L 2, L 3, up to the output layer L nl, usng Equatons (6-7). 2. For the output layer (layer n l ), set δ (n l) = (y a (n l) ) f (z (n) ) 3. For l = n l 1, n l 2, n l 3,..., 2 Set δ (l) = ( (W (l) ) T δ (l+1)) f (z (l) ) 4. Update the parameters accordng to: W (l) := W (l) α ( δ (l+1) (a (l) ) T + λw (l)) b (l) := b (l) αδ (l+1). Implementaton note 1: In steps 2 and 3 above, we need to compute f (z (l) ) for each value of. Assumng f(z) s the tanh actvaton functon, we would already have a (l) stored away from the forward pass through the network. Thus, usng the expresson that we worked out earler for f (z), we can compute ths as f (z (l) ) = 1 (a (l) ) 2. Implementaton note 2: Backpropagaton s a notorously dffcult algorthm to debug and get rght, especally snce many subtly buggy mplementatons of t for example, one that has an off-by-one error n the ndces and that thus only trans some of the layers of weghts, or an mplementaton that omts the bas term, etc. wll manage to learn somethng that can look surprsngly reasonable (whle performng less well than a correct mplementaton). Thus, even wth a buggy mplementaton, t may not at all be apparent that anythng s amss. So, when mplementng backpropagaton, do read and re-read your code to check t carefully. Some people also numercally check ther computaton of the dervatves; f you know how to do ths, t s worth consderng too. (Feel free to ask us f you want to learn more about ths.) 9
3 Autoencoders and sparsty So far, we have descrbed the applcaton of neural networks to supervsed learnng, n whch we are have labeled tranng examples. Now suppose we have only unlabeled tranng examples set {x (1), x (2), x (3),...}, where x () R n. An autoencoder neural network s an unsupervsed learnng algorthm that apples back propagaton, settng the target values to be equal to the nputs. I.e., t uses y () = x (). Here s an autoencoder: The autoencoder tres to learn a functon h W,b (x) x. In other words, t s tryng to learn an approxmaton to the dentty functon, so as to output ˆx that s smlar to x. The dentty functon seems a partcularly trval functon to be tryng to learn; but by placng constrants on the network, such as by lmtng the number of hdden unts, we can dscover nterestng structure about the data. As a concrete example, suppose the nputs x are the pxel ntensty values from a 10 10 mage (100 pxels) so n = 100, and there are s 2 = 50 hdden unts n layer L 2. Note that we also have y R 100. Snce there are only 50 hdden unts, the network s forced to learn a compressed representaton of the nput. I.e., gven only the vector of hdden unt actvatons a (2) R 50, t must try to reconstruct the 100-pxel 10
nput x. If the nput were completely random say, each x comes from an IID Gaussan ndependent of the other features then ths compresson task would be very dffcult. But f there s structure n the data, for example, f some of the nput features are correlated, then ths algorthm wll be able to dscover some of those correlatons. 2 Our argument above reled on the number of hdden unts s 2 beng small. But even when the number of hdden unts s large (perhaps even greater than the number of nput pxels), we can stll dscover nterestng structure, by mposng other constrants on the network. In partcular, f we mpose a sparsty constrant on the hdden unts, then the autoencoder wll stll dscover nterestng structure n the data, even f the number of hdden unts s large. Informally, we wll thnk of a neuron as beng actve (or as frng ) f ts output value s close to 1, or as beng nactve f ts output value s close to -1. We would lke to constran the neurons to be nactve most of the tme. 3 We wll do ths n an onlne learnng fashon. More formally, we agan magne that our algorthm has access to an unendng sequence of tranng examples {x (1), x (2), x (3),...} drawn IID from some dstrbuton D. Also, let as usual denote the actvaton of hdden unt n the autoencoder. We would lke to (approxmately) enforce the constrant that [ ] E x D a (2) = ρ, a (2) where ρ s our sparsty parameter, typcally a value slghtly above -1.0 (say, ρ 0.9). In other words, we would lke the expected actvaton of each hdden neuron to be close to -0.9 (say). To satsfy ths expectaton constrant, the hdden unt s actvatons must mostly be near -1. Our algorthm for (approxmately) enforcng the expectaton constrant wll have two major components: [ ] Frst, for each hdden unt, we wll keep a runnng estmate of E x D a (2). Second, after each teraton of stochastc gradent descent, we wll slowly adjust that unt s parameters to make ths expected value closer to ρ. 2 In fact, ths smple autoencoder often ends up learnng a low-dmensonal representaton very smlar to PCA s. 3 The term sparsty comes from an alternatve formulaton of these deas usng networks wth a sgmod actvaton functon f, so that the actvatons are between 0 or 1 (rather than -1 and 1). In ths case, sparsty refers to most of the actvatons beng near 0. 11
In each teraton of gradent descent, when we see each tranng nput x we wll compute the hdden unts [ ] actvatons a (2) for each. We wll keep a runnng estmate ˆρ of E x D a (2) by updatng: ˆρ := 0.999ˆρ + 0.001a (2). (Or, n vector notaton, ˆρ := 0.999ˆρ + 0.001a (2).) Here, the 0.999 (and 0.001 ) s a parameter of the algorthm, and there s a wde range of values wll that wll work fne. Ths partcular choce causes ˆρ to be an exponentally-decayed weghted average of about the last 1000 observed values of a (2). Our runnng estmates ˆρ s can be ntalzed to 0 at the start of the algorthm. The second part of the algorthm modfes the parameters so as to try to satsfy the expectaton constrant. If ˆρ > ρ, then we would lke hdden unt to become less actve, or equvalently, for ts actvatons to become closer to -1. Recall that unt s actvaton s ( n ) = f W (1) j x j + b (1), (8) where b (1) a (2) j=1 s the bas term. Thus, we can make unt less actve by decreasng. Smlarly, f ˆρ < ρ, then we would lke unt s actvatons to become larger, whch we can do by ncreasng b (1). Fnally, the further ρ s from ρ, the more aggressvely we mght want to decrease or ncrease b (1) so as to drve the expectaton towards ρ. Concretely, we can use the followng learnng rule: b (1) b (1) := b (1) αβ(ˆρ ρ) (9) where β s an addtonal learnng rate parameter. To summarze, n order to learn a sparse autoencoder usng onlne learnng, upon gettng an example x, we wll () Run a forward pass on our network on nput x, to compute all unts actvatons; () Perform one step of stochastc gradent descent usng backpropagaton; () Perform the updates gven n Equatons (8-9). 4 Vsualzaton Havng traned a (sparse) autoencoder, we would now lke to vsualze the functon learned by the algorthm, to try to understand what t has learned. 12
Consder the case of tranng an autoencoder on 10 10 mages, so that n = 100. Each hdden unt computes a functon of the nput: ( 100 ) a (2) = f W (1) j x j + b (1). j=1 We wll vsualze the functon computed by hdden unt whch depends on the parameters W (1) j (gnorng the bas term for now) usng a 2D mage. In partcular, we thnk of a (1) as some non-lnear feature of the nput x. We ask: What nput mage x would cause a (1) to be maxmally actvated? For ths queston to have a non-trval answer, we must mpose some constrants on x. If we suppose that the nput s norm constraned by x 2 = 100 =1 x2 1, then one can show (try dong ths yourself) that the nput whch maxmally actvates hdden unt s gven by settng pxel x j (for all 100 pxels, j = 1,..., 100) to x j = 100 W (1) j j=1 (W (1) j )2. By dsplayng the mage formed by these pxel ntensty values, we can begn to understand what feature hdden unt s lookng for. If we have an autoencoder wth 100 hdden unts (say), then we our vsualzaton wll have 100 such mages one per hdden unt. By examnng these 100 mages, we can try to understand what the ensemble of hdden unts s learnng. When we do ths for a sparse autoencoder (traned wth 100 hdden unts on 10x10 pxel nputs 4 ) we get the followng result: 4 The results below were obtaned by tranng on whtened natural mages. Whtenng s a preprocessng step whch removes redundancy n the nput, by causng adjacent pxels to become less correlated. 13
Each square n the fgure above shows the (norm bounded) nput mage x that maxmally actves one of 100 hdden unts. We see that the dfferent hdden unts have learned to detect edges at dfferent postons and orentatons n the mage. These features are, not surprsngly, useful for such tasks as object recognton and other vson tasks. When appled to other nput domans (such as audo), ths algorthm also learns useful representatons/features for those domans too. 14
5 Summary of notaton x Input features for a tranng example, x R n. y Output/target values. Here, y can be vector valued. In the case of an autoencoder, y = x. (x (), y () ) The -th tranng example h W,b (x) Output of our hypothess on nput x, usng parameters W, b. Ths should be a vector of the same dmenson as the target value y. W (l) j The parameter assocated wth the connecton between unt j n layer l, and unt n layer l + 1. b (l) The bas term assocated wth unt n layer l + 1. Can also be thought of as the parameter assocated wth the connecton between the bas unt n layer l and unt n layer l + 1. Actvaton (output) of unt n layer l of the network. In addton, snce layer L 1 s the nput layer, we also have a (1) = x. f( ) The actvaton functon. Throughout these notes, we used f(z) = tanh(z). Total weghted sum of nputs to unt n layer l. Thus, a (l) a (l) z (l) α s l n l λ ˆx ρ ˆρ β f(z (l) = ). Learnng rate parameter Number of unts n layer l (not countng the bas unt). Number layers n the network. Layer L 1 s usually the nput layer, and layer L nl the output layer. Weght decay parameter For an autoencoder, ts output;.e., ts reconstructon of the nput x. Same meanng as h W,b (x). Sparsty parameter, whch specfes our desred level of sparsty Our runnng estmate of the expected actvaton of unt (n the sparse autoencoder). Learnng rate parameter for algorthm tryng to (approxmately) satsfy the sparsty constrant. 15