CS294A Lecture notes. Andrew Ng

CS294A Lecture notes Andrew Ng Sparse autoencoder 1 Introducton Supervsed learnng s one of the most powerful tools of AI, and has led to automatc zp code recognton, speech recognton, self-drvng cars, and a contnually mprovng understandng of the human genome. Despte ts sgnfcant successes, supervsed learnng today s stll severely lmted. Specfcally, most applcatons of t stll requre that we manually specfy the nput features x gven to the algorthm. Once a good feature representaton s gven, a supervsed learnng algorthm can do well. But n such domans as computer vson, audo processng, and natural language processng, there re now hundreds or perhaps thousands of researchers who ve spent years of ther lves slowly and laborously hand-engneerng vson, audo or text features. Whle much of ths feature-engneerng work s extremely clever, one has to wonder f we can do better. Certanly ths labor-ntensve hand-engneerng approach does not scale well to new problems; further, deally we d lke to have algorthms that can automatcally learn even better feature representatons than the hand-engneered ones. These notes descrbe the sparse autoencoder learnng algorthm, whch s one approach to automatcally learn features from unlabeled data. In some domans, such as computer vson, ths approach s not by tself compettve wth the best hand-engneered features, but the features t can learn do turn out to be useful for a range of problems (ncludng ones n audo, text, etc). Further, there re more sophstcated versons of the sparse autoencoder (not descrbed n these notes, but that you ll hear more about later n the class) that do surprsngly well, and n many cases are compettve wth or superor to even the best hand-engneered representatons. 1

These notes are organzed as follows. We wll frst descrbe feedforward neural networks and the backpropagaton algorthm for supervsed learnng. Then, we show how ths s used to construct an autoencoder, whch s an unsupervsed learnng algorthm. Fnally, we buld on ths to derve a sparse autoencoder. Because these notes are farly notaton-heavy, the last page also contans a summary of the symbols used. 2 Neural networks Consder a supervsed learnng problem where we have access to labeled tranng examples (x (), y () ). Neural networks gve a way of defnng a complex, non-lnear form of hypotheses h W,b (x), wth parameters W, b that we can ft to our data. To descrbe neural networks, we wll begn by descrbng the smplest possble neural network, one whch comprses a sngle neuron. We wll use the followng dagram to denote a sngle neuron: Ths neuron s a computatonal unt that takes as nput x 1, x 2, x 3 (and a +1 ntercept term), and outputs h W,b (x) = f(w T x) = f( 3 =1 W x + b), where f : R R s called the actvaton functon. In these notes, we wll choose f( ) to be the sgmod functon: f(z) = 1 1 + exp( z). Thus, our sngle neuron corresponds exactly to the nput-output mappng defned by logstc regresson. Although these notes wll use the sgmod functon, t s worth notng that another common choce for f s the hyperbolc tangent, or tanh, functon: f(z) = tanh(z) = ez e z, (1) e z + e z Here are plots of the sgmod and tanh functons: 2

The tanh(z) functon s a rescaled verson of the sgmod, and ts output range s [ 1, 1] nstead of [0, 1]. Note that unlke CS221 and (parts of) CS229, we are not usng the conventon here of x 0 = 1. Instead, the ntercept term s handled separately by the parameter b. Fnally, one dentty that ll be useful later: If f(z) = 1/(1 + exp( z)) s the sgmod functon, then ts dervatve s gven by f (z) = f(z)(1 f(z)). (If f s the tanh functon, then ts dervatve s gven by f (z) = 1 (f(z)) 2.) You can derve ths yourself usng the defnton of the sgmod (or tanh) functon. 2.1 Neural network formulaton A neural network s put together by hookng together many of our smple neurons, so that the output of a neuron can be the nput of another. For example, here s a small neural network: 3

In ths fgure, we have used crcles to also denote the nputs to the network. The crcles labeled +1 are called bas unts, and correspond to the ntercept term. The leftmost layer of the network s called the nput layer, and the rghtmost layer the output layer (whch, n ths example, has only one node). The mddle layer of nodes s called the hdden layer, because ts values are not observed n the tranng set. We also say that our example neural network has 3 nput unts (not countng the bas unt), 3 hdden unts, and 1 output unt. We wll let n l denote the number of layers n our network; thus n l = 3 n our example. We label layer l as L l, so layer L 1 s the nput layer, and layer L nl the output layer. Our neural network has parameters (W, b) = (W (1), b (1), W (2), b (2) ), where we wrte W (l) j to denote the parameter (or weght) assocated wth the connecton between unt j n layer l, and unt n layer l+1. (Note the order of the ndces.) Also, b (l) s the bas assocated wth unt n layer l+1. Thus, n our example, we have W (1) R 3 3, and W (2) R 1 3. Note that bas unts don t have nputs or connectons gong nto them, snce they always output the value +1. We also let s l denote the number of nodes n layer l (not countng the bas unt). to denote the actvaton (meanng output value) of unt n layer l. For l = 1, we also use a (1) = x to denote the -th nput. Gven a fxed settng of the parameters W, b, our neural network defnes a hypothess h W,b (x) that outputs a real number. Specfcally, the computaton that ths neural network represents s gven by: We wll wrte a (l) a (2) 1 = f(w (1) 11 x 1 + W (1) 12 x 2 + W (1) 13 x 3 + b (1) 1 ) (2) a (2) 2 = f(w (1) 21 x 1 + W (1) 22 x 2 + W (1) 23 x 3 + b (1) 2 ) (3) a (2) 3 = f(w (1) 31 x 1 + W (1) 32 x 2 + W (1) 33 x 3 + b (1) 3 ) (4) h W,b (x) = a (3) 1 = f(w (2) 11 a (2) 1 + W (2) 12 a (2) 2 + W (2) 13 a (2) 3 + b (2) 1 ) (5) In the sequel, we also let z (l) n layer l, ncludng the bas term (e.g., z (2) a (l) = f(z (l) ). denote the total weghted sum of nputs to unt = n j=1 W (1) j x j + b (1) ), so that Note that ths easly lends tself to a more compact notaton. Specfcally, f we extend the actvaton functon f( ) to apply to vectors n an elementwse fashon (.e., f([z 1, z 2, z 3 ]) = [f(z 1 ), f(z 2 ), f(z 3 )]), then we can wrte 4

Equatons (2-5) more compactly as: z (2) = W (1) x + b (1) a (2) = f(z (2) ) z (3) = W (2) a (2) + b (2) h W,b (x) = a (3) = f(z (3) ) More generally, recallng that we also use a (1) = x to also denote the values from the nput layer, then gven layer l s actvatons a (l), we can compute layer l + 1 s actvatons a (l+1) as: z (l+1) = W (l) a (l) + b (l) (6) a (l+1) = f(z (l+1) ) (7) By organzng our parameters n matrces and usng matrx-vector operatons, we can take advantage of fast lnear algebra routnes to quckly perform calculatons n our network. We have so far focused on one example neural network, but one can also buld neural networks wth other archtectures (meanng patterns of connectvty between neurons), ncludng ones wth multple hdden layers. The most common choce s a n l -layered network where layer 1 s the nput layer, layer n l s the output layer, and each layer l s densely connected to layer l + 1. In ths settng, to compute the output of the network, we can successvely compute all the actvatons n layer L 2, then layer L 3, and so on, up to layer L nl, usng Equatons (6-7). Ths s one example of a feedforward neural network, snce the connectvty graph does not have any drected loops or cycles. Neural networks can also have multple output unts. For example, here s a network wth two hdden layers layers L 2 and L 3 and two output unts n layer L 4 : 5

To tran ths network, we would need tranng examples (x (), y () ) where y () R 2. Ths sort of network s useful f there re multple outputs that you re nterested n predctng. (For example, n a medcal dagnoss applcaton, the vector x mght gve the nput features of a patent, and the dfferent outputs y s mght ndcate presence or absence of dfferent dseases.) 2.2 Backpropagaton algorthm Suppose we have a fxed tranng set {(x (1), y (1) ),..., (x (m), y (m) )} of m tranng examples. We can tran our neural network usng batch gradent descent. In detal, for a sngle tranng example (x, y), we defne the cost functon wth respect to that sngle example to be J(W, b; x, y) = 1 2 h W,b(x) y 2. Ths s a (one-half) squared-error cost functon. Gven a tranng set of m examples, we then defne the overall cost functon to be [ ] 1 m J(W, b) = J(W, b; x (), y () ) + λ n l 1 s l s l+1 ( ) 2 W (l) j (8) m 2 =1 l=1 =1 j=1 [ 1 m ( 1 = hw,b (x () ) y () ) ] 2 + λ n l 1 s l s l+1 ( ) 2 W (l) j m 2 2 =1 l=1 =1 j=1 The frst term n the defnton of J(W, b) s an average sum-of-squares error term. The second term s a regularzaton term (also called a weght decay term) that tends to decrease the magntude of the weghts, and helps prevent overfttng. 1 The weght decay parameter λ controls the relatve mportance of the two terms. Note also the slghtly overloaded notaton: J(W, b; x, y) s the squared error cost wth respect to a sngle example; J(W, b) s the overall cost functon, whch ncludes the weght decay term. Ths cost functon above s often used both for classfcaton and for regresson problems. For classfcaton, we let y = 0 or 1 represent the two class labels (recall that the sgmod actvaton functon outputs values n [0, 1]; f 1 Usually weght decay s not appled to the bas terms b (l), as reflected n our defnton for J(W, b). Applyng weght decay to the bas unts usually makes only a small dfferent to the fnal network, however. If you took CS229, you may also recognze weght decay ths as essentally a varant of the Bayesan regularzaton method you saw there, where we placed a Gaussan pror on the parameters and dd MAP (nstead of maxmum lkelhood) estmaton. 6

we were usng a tanh actvaton functon, we would nstead use -1 and +1 to denote the labels). For regresson problems, we frst scale our outputs to ensure that they le n the [0, 1] range (or f we were usng a tanh actvaton functon, then the [ 1, 1] range). Our goal s to mnmze J(W, b) as a functon of W and b. To tran our neural network, we wll ntalze each parameter W (l) j and each b (l) to a small random value near zero (say accordng to a N (0, ɛ 2 ) dstrbuton for some small ɛ, say 0.01), and then apply an optmzaton algorthm such as batch gradent descent. Snce J(W, b) s a non-convex functon, gradent descent s susceptble to local optma; however, n practce gradent descent usually works farly well. Fnally, note that t s mportant to ntalze the parameters randomly, rather than to all 0 s. If all the parameters start off at dentcal values, then all the hdden layer unts wll end up learnng the same wll be the same for all values of, so that a (2) 1 = a (2) 2 = a (2) 3 =... for any nput x). The random ntalzaton serves the purpose of symmetry breakng. One teraton of gradent descent updates the parameters W, b as follows: functon of the nput (more formally, W (1) j W (l) j := W (l) j α b (l) := b (l) α W (l) j b (l) J(W, b) J(W, b) where α s the learnng rate. The key step s computng the partal dervatves above. We wll now descrbe the backpropagaton algorthm, whch gves an effcent way to compute these partal dervatves. We wll frst descrbe how backpropagaton can be used to compute W (l) j J(W, b; x, y) and b (l) J(W, b; x, y), the partal dervatves of the cost functon J(W, b; x, y) defned wth respect to a sngle example (x, y). Once we can compute these, then by referrng to Equaton (8), we see that the dervatve of the overall cost functon J(W, b) can be computed as W (l) j b (l) [ ] 1 m J(W, b) = J(W, b; x (), y () ) m =1 W (l) j J(W, b) = 1 m J(W, b; x (), y () ). m =1 b (l) + λw (l) j, The two lnes above dffer slghtly because weght decay s appled to W but not b. 7

The ntuton behnd the backpropagaton algorthm s as follows. Gven a tranng example (x, y), we wll frst run a forward pass to compute all the actvatons throughout the network, ncludng the output value of the hypothess h W,b (x). Then, for each node n layer l, we would lke to compute an error term δ (l) that measures how much that node was responsble for any errors n our output. For an output node, we can drectly measure the dfference between the network s actvaton and the true target value, and use that to defne δ (n l) (where layer n l s the output layer). How about hdden unts? For those, we wll compute δ (l) based on a weghted average of the error terms of the nodes that uses a (l) as an nput. In detal, here s the backpropagaton algorthm: 1. Perform a feedforward pass, computng the actvatons for layers L 2, L 3, and so on up to the output layer L nl. 2. For each output unt n layer n l (the output layer), set δ (n l) = z (n l) 1 2 y h W,b(x) 2 = (y a (n l) ) f (z (n l) ) 3. For l = n l 1, n l 2, n l 3,..., 2 For each node n layer l, set δ (l) = ( sl+1 j=1 W (l) j δ(l+1) j ) f (z (l) ) 4. Compute the desred partal dervatves, whch are gven as: W (l) j b (l) J(W, b; x, y) = a (l) j δ(l+1) J(W, b; x, y) = δ (l+1). Fnally, we can also re-wrte the algorthm usng matrx-vectoral notaton. We wll use to denote the element-wse product operator (denoted.* n Matlab or Octave, and also called the Hadamard product), so that f a = b c, then a = b c. Smlar to how we extended the defnton of f( ) to apply element-wse to vectors, we also do the same for f ( ) (so that f ([z 1, z 2, z 3 ]) = [ z 1 f(z 1 ), z 2 f(z 2 ), z 3 f(z 3 )]). The algorthm can then be wrtten: 8

1. Perform a feedforward pass, computng the actvatons for layers L 2, L 3, up to the output layer L nl, usng Equatons (6-7). 2. For the output layer (layer n l ), set δ (n l) = (y a (n l) ) f (z (n) ) 3. For l = n l 1, n l 2, n l 3,..., 2 Set δ (l) = ( (W (l) ) T δ (l+1)) f (z (l) ) 4. Compute the desred partal dervatves: W (l)j(w, b; x, y) = δ (l+1) (a (l) ) T, b (l)j(w, b; x, y) = δ (l+1). Implementaton note: In steps 2 and 3 above, we need to compute f (z (l) ) for each value of. Assumng f(z) s the sgmod actvaton functon, we would already have a (l) stored away from the forward pass through the network. Thus, usng the expresson that we worked out earler for f (z), we can compute ths as f (z (l) ) = a (l) (1 a (l) ). Fnally, we are ready to descrbe the full gradent descent algorthm. In the pseudo-code below, W (l) s a matrx (of the same dmenson as W (l) ), and b (l) s a vector (of the same dmenson as b (l) ). Note that n ths notaton, W (l) s a matrx, and n partcular t sn t tmes W (l). We mplement one teraton of batch gradent descent as follows: 1. Set W (l) := 0, b (l) := 0 (matrx/vector of zeros) for all l. 2. For = 1 to m, 2a. Use backpropagaton to compute W (l)j(w, b; x, y) and b (l)j(w, b; x, y). 2b. Set W (l) := W (l) + W (l)j(w, b; x, y). 2c. Set b (l) := b (l) + b (l)j(w, b; x, y). 9

3. Update the parameters: [( ) ] 1 W (l) := W (l) (l) α W + λw (l) m [ ] 1 b (l) := b (l) α m b(l) To tran our neural network, we can now repeatedly take steps of gradent descent to reduce our cost functon J(W, b). 2.3 Gradent checkng and advanced optmzaton Backpropagaton s a notorously dffcult algorthm to debug and get rght, especally snce many subtly buggy mplementatons of t for example, one that has an off-by-one error n the ndces and that thus only trans some of the layers of weghts, or an mplementaton that omts the bas term wll manage to learn somethng that can look surprsngly reasonable (whle performng less well than a correct mplementaton). Thus, even wth a buggy mplementaton, t may not at all be apparent that anythng s amss. In ths secton, we descrbe a method for numercally checkng the dervatves computed by your code to make sure that your mplementaton s correct. Carryng out the dervatve checkng procedure descrbed here wll sgnfcantly ncrease your confdence n the correctness of your code. Suppose we want to mnmze J(θ) as a functon of θ. For ths example, suppose J : R R, so that θ R. In ths 1-dmensonal case, one teraton of gradent descent s gven by θ := θ α d dθ J(θ). Suppose also that we have mplemented some functon g(θ) that purportedly computes d J(θ), so that we mplement gradent descent usng the update dθ θ := θ αg(θ). How can we check f our mplementaton of g s correct? Recall the mathematcal defnton of the dervatve as d J(θ) = lm dθ ɛ 0 J(θ + ɛ) J(θ ɛ). 2ɛ Thus, at any specfc value of θ, we can numercally approxmate the dervatve as follows: J(θ + EPSILON) J(θ EPSILON) 2 EPSILON 10

In practce, we set EPSILON to a small constant, say around 10 4. (There s a large range of values of EPSILON that should work well, but we don t set EPSILON to be extremely small, say 10 20, as that would lead to numercal roundoff errors.) Thus, gven a functon g(θ) that s supposedly computng d J(θ), we can dθ now numercally verfy ts correctness by checkng that g(θ) J(θ + EPSILON) J(θ EPSILON). 2 EPSILON The degree to whch these two values should approxmate each other wll depend on the detals of J. But assumng EPSILON = 10 4, you ll usually fnd that the left- and rght-hand sdes of the above wll agree to at least 4 sgnfcant dgts (and often many more). Now, consder the case where θ R n s a vector rather than a sngle real number (so that we have n parameters that we want to learn), and J : R n R. In our neural network example we used J(W, b), but one can magne unrollng the parameters W, b nto a long vector θ. We now generalze our dervatve checkng procedure to the case where θ may be a vector. Suppose we have a functon g (θ) that purportedly computes θ J(θ); we d lke to check f g s outputtng correct dervatve values. Let θ (+) = θ + EPSILON e, where 0 0. e = s the -th bass vector (a vector of the same dmenson as θ, wth a 1 n the -th poston and 0 s everywhere else). So, θ (+) s the same as θ, except ts -th element has been ncremented by EPSILON. Smlarly, let θ ( ) = θ EPSILON e be the correspondng vector wth the -th element decreased by EPSILON. We can now numercally verfy g (θ) s correctness by checkng, for each, that: 1. 0 g (θ) J(θ(+) ) J(θ ( ) ). 2 EPSILON When mplementng backpropagaton to tran a neural network, n a cor- 11

rect mplementaton we wll have that ( ) 1 (l) W (l)j(w, b) = W + λw (l) m b (l)j(w, b) = 1 m b(l). Ths result shows that the fnal block of psuedo-code n Secton 2.2 s ndeed mplementng gradent descent. To make sure your mplementaton of gradent descent s correct, t s usually very helpful to use the method descrbed above to numercally compute the dervatves of J(W, b), and thereby verfy that your computatons of ( 1 W (l)) +λw and 1 m m b(l) are ndeed gvng the dervatves you want. Fnally, so far our dscusson has centered on usng gradent descent to mnmze J(θ). If you have mplemented a functon that computes J(θ) and θ J(θ), t turns out there are more sophstcated algorthms than gradent descent for tryng to mnmze J(θ). For example, one can envson an algorthm that uses gradent descent, but automatcally tunes the learnng rate α so as to try to use a step-sze that causes θ to approach a local optmum as quckly as possble. There are other algorthms that are even more sophstcated than ths; for example, there are algorthms that try to fnd an approxmaton to the Hessan matrx, so that t can take more rapd steps towards a local optmum (smlar to Newton s method). A full dscusson of these algorthms s beyond the scope of these notes, but one example s the L-BFGS algorthm. (Another example s conjugate gradent.) You wll use one of these algorthms n the programmng exercse. The man thng you need to provde to these advanced optmzaton algorthms s that for any θ, you have to be able to compute J(θ) and θ J(θ). These optmzaton algorthms wll then do ther own nternal tunng of the learnng rate/step-sze α (and compute ts own approxmaton to the Hessan, etc.) to automatcally search for a value of θ that mnmzes J(θ). Algorthms such as L-BFGS and conjugate gradent can often be much faster than gradent descent. 3 Autoencoders and sparsty So far, we have descrbed the applcaton of neural networks to supervsed learnng, n whch we are have labeled tranng examples. Now suppose we have only unlabeled tranng examples set {x (1), x (2), x (3),...}, where x () R n. An autoencoder neural network s an unsupervsed learnng algorthm that apples backpropagaton, settng the target values to be equal to the nputs. I.e., t uses y () = x (). 12

Here s an autoencoder: The autoencoder tres to learn a functon h W,b (x) x. In other words, t s tryng to learn an approxmaton to the dentty functon, so as to output ˆx that s smlar to x. The dentty functon seems a partcularly trval functon to be tryng to learn; but by placng constrants on the network, such as by lmtng the number of hdden unts, we can dscover nterestng structure about the data. As a concrete example, suppose the nputs x are the pxel ntensty values from a 10 10 mage (100 pxels) so n = 100, and there are s 2 = 50 hdden unts n layer L 2. Note that we also have y R 100. Snce there are only 50 hdden unts, the network s forced to learn a compressed representaton of the nput. I.e., gven only the vector of hdden unt actvatons a (2) R 50, t must try to reconstruct the 100-pxel nput x. If the nput were completely random say, each x comes from an IID Gaussan ndependent of the other features then ths compresson task would be very dffcult. But f there s structure n the data, for example, f some of the nput features are correlated, then ths algorthm wll be able to dscover some of those correlatons. 2 2 In fact, ths smple autoencoder often ends up learnng a low-dmensonal representaton very smlar to PCA s. 13

Our argument above reled on the number of hdden unts s 2 beng small. But even when the number of hdden unts s large (perhaps even greater than the number of nput pxels), we can stll dscover nterestng structure, by mposng other constrants on the network. In partcular, f we mpose a sparsty constrant on the hdden unts, then the autoencoder wll stll dscover nterestng structure n the data, even f the number of hdden unts s large. Informally, we wll thnk of a neuron as beng actve (or as frng ) f ts output value s close to 1, or as beng nactve f ts output value s close to 0. We would lke to constran the neurons to be nactve most of the tme. 3 Recall that a (2) j denotes the actvaton of hdden unt j n the autoencoder. However, ths notaton doesn t make explct what was the nput x that led to that actvaton. Thus, we wll wrte a (2) j (x) to denote the actvaton of ths hdden unt when the network s gven a specfc nput x. Further, let ˆρ j = 1 m m =1 [ ] a (2) j (x () ) be the average actvaton of hdden unt j (averaged over the tranng set). We would lke to (approxmately) enforce the constrant ˆρ j = ρ, where ρ s a sparsty parameter, typcally a small value close to zero (say ρ = 0.05). In other words, we would lke the average actvaton of each hdden neuron j to be close to 0.05 (say). To satsfy ths constrant, the hdden unt s actvatons must mostly be near 0. To acheve ths, we wll add an extra penalty term to our optmzaton objectve that penalzes ˆρ j devatng sgnfcantly from ρ. Many choces of the penalty term wll gve reasonable results. We wll choose the followng: s 2 j=1 ρ log ρˆρ j + (1 ρ) log 1 ρ 1 ˆρ j. Here, s 2 s the number of neurons n the hdden layer, and the ndex j s summng over the hdden unts n our network. If you are famlar wth the 3 Ths dscusson assumes a sgmod actvaton functon. If you are usng a tanh actvaton functon, then we thnk of a neuron as beng nactve when t outputs values close to -1. 14

concept of KL dvergence, ths penalty term s based on t, and can also be wrtten s 2 j=1 KL(ρ ˆρ j ), where KL(ρ ˆρ j ) = ρ log ρˆρ j + (1 ρ) log 1 ρ 1 ˆρ j s the Kullback-Lebler (KL) dvergence between a Bernoull random varable wth mean ρ and a Bernoull random varable wth mean ˆρ j. KL-dvergence s a standard functon for measurng how dfferent two dfferent dstrbutons are. (If you ve not seen KL-dvergence before, don t worry about t; everythng you need to know about t s contaned n these notes.) Ths penalty functon has the property that KL(ρ ˆρ j ) = 0 f ˆρ j = ρ, and otherwse t ncreases monotoncally as ˆρ j dverges from ρ. For example, n the fgure below, we have set ρ = 0.2, and plotted KL(ρ ˆρ j ) for a range of values of ˆρ j : We see that the KL-dvergence reaches ts mnmum of 0 at ˆρ j = ρ, and blows up (t actually approaches ) as ˆρ j approaches 0 or 1. Thus, mnmzng ths penalty term has the effect of causng ˆρ j to be close to ρ. Our overall cost functon s now J sparse (W, b) = J(W, b) + β s 2 j=1 KL(ρ ˆρ j ), where J(W, b) s as defned prevously, and β controls the weght of the sparsty penalty term. The term ˆρ j (mplctly) depends on W, b also, because t s the average actvaton of hdden unt j, and the actvaton of a hdden unt depends on the parameters W, b. To ncorporate the KL-dvergence term nto your dervatve calculaton, there s a smple-to-mplement trck nvolvng only a small change to your 15

code. Specfcally, where prevously for the second layer (l = 2), durng backpropagaton you would have computed ( s2 ) δ (2) = W (2) j δ (3) j f (z (2) ), now nstead compute (( s2 ) δ (2) = W (2) j δ (3) j j=1 j=1 + β ( ρˆρ + 1 ρ ) ) f (z (2) ). 1 ˆρ One subtlety s that you ll need to know ˆρ to compute ths term. Thus, you ll need to compute a forward pass on all the tranng examples frst to compute the average actvatons on the tranng set, before computng backpropagaton on any example. If your tranng set s small enough to ft comfortably n computer memory (ths wll be the case for the programmng assgnment), you can compute forward passes on all your examples and keep the resultng actvatons n memory and compute the ˆρ s. Then you can use your precomputed actvatons to perform backpropagaton on all your examples. If your data s too large to ft n memory, you may have to scan through your examples computng a forward pass on each to accumulate (sum up) the actvatons and compute ˆρ (dscardng the result of each forward pass after you have taken ts actvatons a (2) nto account for computng ˆρ ). Then after havng computed ˆρ, you d have to redo the forward pass for each example so that you can do backpropagaton on that example. In ths latter case, you would end up computng a forward pass twce on each example n your tranng set, makng t computatonally less effcent. The full dervaton showng that the algorthm above results n gradent descent s beyond the scope of these notes. But f you mplement the autoencoder usng backpropagaton modfed ths way, you wll be performng gradent descent exactly on the objectve J sparse (W, b). Usng the dervatve checkng method, you wll be able to verfy ths for yourself as well. 4 Vsualzaton Havng traned a (sparse) autoencoder, we would now lke to vsualze the functon learned by the algorthm, to try to understand what t has learned. Consder the case of tranng an autoencoder on 10 10 mages, so that 16

n = 100. Each hdden unt computes a functon of the nput: ( 100 ) a (2) = f W (1) j x j + b (1). j=1 We wll vsualze the functon computed by hdden unt whch depends on the parameters W (1) j (gnorng the bas term for now) usng a 2D mage. In partcular, we thnk of a (1) as some non-lnear feature of the nput x. We ask: What nput mage x would cause a (1) to be maxmally actvated? For ths queston to have a non-trval answer, we must mpose some constrants on x. If we suppose that the nput s norm constraned by x 2 = 100 =1 x2 1, then one can show (try dong ths yourself) that the nput whch maxmally actvates hdden unt s gven by settng pxel x j (for all 100 pxels, j = 1,..., 100) to x j = 100 W (1) j j=1 (W (1) j )2. By dsplayng the mage formed by these pxel ntensty values, we can begn to understand what feature hdden unt s lookng for. If we have an autoencoder wth 100 hdden unts (say), then we our vsualzaton wll have 100 such mages one per hdden unt. By examnng these 100 mages, we can try to understand what the ensemble of hdden unts s learnng. When we do ths for a sparse autoencoder (traned wth 100 hdden unts on 10x10 pxel nputs 4 ) we get the followng result: 4 The results below were obtaned by tranng on whtened natural mages. Whtenng s a preprocessng step whch removes redundancy n the nput, by causng adjacent pxels to become less correlated. 17

Each square n the fgure above shows the (norm bounded) nput mage x that maxmally actves one of 100 hdden unts. We see that the dfferent hdden unts have learned to detect edges at dfferent postons and orentatons n the mage. These features are, not surprsngly, useful for such tasks as object recognton and other vson tasks. When appled to other nput domans (such as audo), ths algorthm also learns useful representatons/features for those domans too. 18

5 Summary of notaton x Input features for a tranng example, x R n. y Output/target values. Here, y can be vector valued. In the case of an autoencoder, y = x. (x (), y () ) The -th tranng example h W,b (x) Output of our hypothess on nput x, usng parameters W, b. Ths should be a vector of the same dmenson as the target value y. W (l) j The parameter assocated wth the connecton between unt j n layer l, and unt n layer l + 1. b (l) The bas term assocated wth unt n layer l + 1. Can also be thought of as the parameter assocated wth the connecton between the bas unt n layer l and unt n layer l + 1. θ Our parameter vector. It s useful to thnk of ths as the result of takng the parameters W, b and unrollng them nto a long column vector. Actvaton (output) of unt n layer l of the network. In addton, snce layer L 1 s the nput layer, we also have a (1) = x. f( ) The actvaton functon. Throughout these notes, we used f(z) = tanh(z). Total weghted sum of nputs to unt n layer l. Thus, a (l) a (l) z (l) α s l n l λ ˆx ρ ˆρ β f(z (l) = ). Learnng rate parameter Number of unts n layer l (not countng the bas unt). Number layers n the network. Layer L 1 s usually the nput layer, and layer L nl the output layer. Weght decay parameter. For an autoencoder, ts output;.e., ts reconstructon of the nput x. Same meanng as h W,b (x). Sparsty parameter, whch specfes our desred level of sparsty The average actvaton of hdden unt (n the sparse autoencoder). Weght of the sparsty penalty term (n the sparse autoencoder objectve). 19