CS294A Lecture notes. Andrew Ng

Size: px
Start display at page:

Download "CS294A Lecture notes. Andrew Ng"

Transcription

1 CS294A Lecture notes Andrew Ng Sparse autoencoder 1 Introducton Supervsed learnng s one of the most powerful tools of AI, and has led to automatc zp code recognton, speech recognton, self-drvng cars, and a contnually mprovng understandng of the human genome. Despte ts sgnfcant successes, supervsed learnng today s stll severely lmted. Specfcally, most applcatons of t stll requre that we manually specfy the nput features x gven to the algorthm. Once a good feature representaton s gven, a supervsed learnng algorthm can do well. But n such domans as computer vson, audo processng, and natural language processng, there re now hundreds or perhaps thousands of researchers who ve spent years of ther lves slowly and laborously hand-engneerng vson, audo or text features. Whle much of ths feature-engneerng work s extremely clever, one has to wonder f we can do better. Certanly ths labor-ntensve hand-engneerng approach does not scale well to new problems; further, deally we d lke to have algorthms that can automatcally learn even better feature representatons than the hand-engneered ones. These notes descrbe the sparse autoencoder learnng algorthm, whch s one approach to automatcally learn features from unlabeled data. In some domans, such as computer vson, ths approach s not by tself compettve wth the best hand-engneered features, but the features t can learn do turn out to be useful for a range of problems (ncludng ones n audo, text, etc). Further, there re more sophstcated versons of the sparse autoencoder (not descrbed n these notes, but that you ll hear more about later n the class) that do surprsngly well, and n many cases are compettve wth or superor to even the best hand-engneered representatons. 1

2 These notes are organzed as follows. We wll frst descrbe feedforward neural networks and the backpropagaton algorthm for supervsed learnng. Then, we show how ths s used to construct an autoencoder, whch s an unsupervsed learnng algorthm. Fnally, we buld on ths to derve a sparse autoencoder. Because these notes are farly notaton-heavy, the last page also contans a summary of the symbols used. 2 Neural networks Consder a supervsed learnng problem where we have access to labeled tranng examples (x (), y () ). Neural networks gve a way of defnng a complex, non-lnear form of hypotheses h W,b (x), wth parameters W, b that we can ft to our data. To descrbe neural networks, we wll begn by descrbng the smplest possble neural network, one whch comprses a sngle neuron. We wll use the followng dagram to denote a sngle neuron: Ths neuron s a computatonal unt that takes as nput x 1, x 2, x 3 (and a +1 ntercept term), and outputs h W,b (x) = f(w T x) = f( 3 =1 W x + b), where f : R R s called the actvaton functon. In these notes, we wll choose f( ) to be the sgmod functon: f(z) = exp( z). Thus, our sngle neuron corresponds exactly to the nput-output mappng defned by logstc regresson. Although these notes wll use the sgmod functon, t s worth notng that another common choce for f s the hyperbolc tangent, or tanh, functon: f(z) = tanh(z) = ez e z, (1) e z + e z Here are plots of the sgmod and tanh functons: 2

3 The tanh(z) functon s a rescaled verson of the sgmod, and ts output range s [ 1, 1] nstead of [0, 1]. Note that unlke CS221 and (parts of) CS229, we are not usng the conventon here of x 0 = 1. Instead, the ntercept term s handled separately by the parameter b. Fnally, one dentty that ll be useful later: If f(z) = 1/(1 + exp( z)) s the sgmod functon, then ts dervatve s gven by f (z) = f(z)(1 f(z)). (If f s the tanh functon, then ts dervatve s gven by f (z) = 1 (f(z)) 2.) You can derve ths yourself usng the defnton of the sgmod (or tanh) functon. 2.1 Neural network formulaton A neural network s put together by hookng together many of our smple neurons, so that the output of a neuron can be the nput of another. For example, here s a small neural network: 3

4 In ths fgure, we have used crcles to also denote the nputs to the network. The crcles labeled +1 are called bas unts, and correspond to the ntercept term. The leftmost layer of the network s called the nput layer, and the rghtmost layer the output layer (whch, n ths example, has only one node). The mddle layer of nodes s called the hdden layer, because ts values are not observed n the tranng set. We also say that our example neural network has 3 nput unts (not countng the bas unt), 3 hdden unts, and 1 output unt. We wll let n l denote the number of layers n our network; thus n l = 3 n our example. We label layer l as L l, so layer L 1 s the nput layer, and layer L nl the output layer. Our neural network has parameters (W, b) = (W (1), b (1), W (2), b (2) ), where we wrte W (l) j to denote the parameter (or weght) assocated wth the connecton between unt j n layer l, and unt n layer l+1. (Note the order of the ndces.) Also, b (l) s the bas assocated wth unt n layer l+1. Thus, n our example, we have W (1) R 3 3, and W (2) R 1 3. Note that bas unts don t have nputs or connectons gong nto them, snce they always output the value +1. We also let s l denote the number of nodes n layer l (not countng the bas unt). to denote the actvaton (meanng output value) of unt n layer l. For l = 1, we also use a (1) = x to denote the -th nput. Gven a fxed settng of the parameters W, b, our neural network defnes a hypothess h W,b (x) that outputs a real number. Specfcally, the computaton that ths neural network represents s gven by: We wll wrte a (l) a (2) 1 = f(w (1) 11 x 1 + W (1) 12 x 2 + W (1) 13 x 3 + b (1) 1 ) (2) a (2) 2 = f(w (1) 21 x 1 + W (1) 22 x 2 + W (1) 23 x 3 + b (1) 2 ) (3) a (2) 3 = f(w (1) 31 x 1 + W (1) 32 x 2 + W (1) 33 x 3 + b (1) 3 ) (4) h W,b (x) = a (3) 1 = f(w (2) 11 a (2) 1 + W (2) 12 a (2) 2 + W (2) 13 a (2) 3 + b (2) 1 ) (5) In the sequel, we also let z (l) n layer l, ncludng the bas term (e.g., z (2) a (l) = f(z (l) ). denote the total weghted sum of nputs to unt = n j=1 W (1) j x j + b (1) ), so that Note that ths easly lends tself to a more compact notaton. Specfcally, f we extend the actvaton functon f( ) to apply to vectors n an elementwse fashon (.e., f([z 1, z 2, z 3 ]) = [f(z 1 ), f(z 2 ), f(z 3 )]), then we can wrte 4

5 Equatons (2-5) more compactly as: z (2) = W (1) x + b (1) a (2) = f(z (2) ) z (3) = W (2) a (2) + b (2) h W,b (x) = a (3) = f(z (3) ) More generally, recallng that we also use a (1) = x to also denote the values from the nput layer, then gven layer l s actvatons a (l), we can compute layer l + 1 s actvatons a (l+1) as: z (l+1) = W (l) a (l) + b (l) (6) a (l+1) = f(z (l+1) ) (7) By organzng our parameters n matrces and usng matrx-vector operatons, we can take advantage of fast lnear algebra routnes to quckly perform calculatons n our network. We have so far focused on one example neural network, but one can also buld neural networks wth other archtectures (meanng patterns of connectvty between neurons), ncludng ones wth multple hdden layers. The most common choce s a n l -layered network where layer 1 s the nput layer, layer n l s the output layer, and each layer l s densely connected to layer l + 1. In ths settng, to compute the output of the network, we can successvely compute all the actvatons n layer L 2, then layer L 3, and so on, up to layer L nl, usng Equatons (6-7). Ths s one example of a feedforward neural network, snce the connectvty graph does not have any drected loops or cycles. Neural networks can also have multple output unts. For example, here s a network wth two hdden layers layers L 2 and L 3 and two output unts n layer L 4 : 5

6 To tran ths network, we would need tranng examples (x (), y () ) where y () R 2. Ths sort of network s useful f there re multple outputs that you re nterested n predctng. (For example, n a medcal dagnoss applcaton, the vector x mght gve the nput features of a patent, and the dfferent outputs y s mght ndcate presence or absence of dfferent dseases.) 2.2 Backpropagaton algorthm Suppose we have a fxed tranng set {(x (1), y (1) ),..., (x (m), y (m) )} of m tranng examples. We can tran our neural network usng batch gradent descent. In detal, for a sngle tranng example (x, y), we defne the cost functon wth respect to that sngle example to be J(W, b; x, y) = 1 2 h W,b(x) y 2. Ths s a (one-half) squared-error cost functon. Gven a tranng set of m examples, we then defne the overall cost functon to be [ ] 1 m J(W, b) = J(W, b; x (), y () ) + λ n l 1 s l s l+1 ( ) 2 W (l) j (8) m 2 =1 l=1 =1 j=1 [ 1 m ( 1 = hw,b (x () ) y () ) ] 2 + λ n l 1 s l s l+1 ( ) 2 W (l) j m 2 2 =1 l=1 =1 j=1 The frst term n the defnton of J(W, b) s an average sum-of-squares error term. The second term s a regularzaton term (also called a weght decay term) that tends to decrease the magntude of the weghts, and helps prevent overfttng. 1 The weght decay parameter λ controls the relatve mportance of the two terms. Note also the slghtly overloaded notaton: J(W, b; x, y) s the squared error cost wth respect to a sngle example; J(W, b) s the overall cost functon, whch ncludes the weght decay term. Ths cost functon above s often used both for classfcaton and for regresson problems. For classfcaton, we let y = 0 or 1 represent the two class labels (recall that the sgmod actvaton functon outputs values n [0, 1]; f 1 Usually weght decay s not appled to the bas terms b (l), as reflected n our defnton for J(W, b). Applyng weght decay to the bas unts usually makes only a small dfferent to the fnal network, however. If you took CS229, you may also recognze weght decay ths as essentally a varant of the Bayesan regularzaton method you saw there, where we placed a Gaussan pror on the parameters and dd MAP (nstead of maxmum lkelhood) estmaton. 6

7 we were usng a tanh actvaton functon, we would nstead use -1 and +1 to denote the labels). For regresson problems, we frst scale our outputs to ensure that they le n the [0, 1] range (or f we were usng a tanh actvaton functon, then the [ 1, 1] range). Our goal s to mnmze J(W, b) as a functon of W and b. To tran our neural network, we wll ntalze each parameter W (l) j and each b (l) to a small random value near zero (say accordng to a N (0, ɛ 2 ) dstrbuton for some small ɛ, say 0.01), and then apply an optmzaton algorthm such as batch gradent descent. Snce J(W, b) s a non-convex functon, gradent descent s susceptble to local optma; however, n practce gradent descent usually works farly well. Fnally, note that t s mportant to ntalze the parameters randomly, rather than to all 0 s. If all the parameters start off at dentcal values, then all the hdden layer unts wll end up learnng the same wll be the same for all values of, so that a (2) 1 = a (2) 2 = a (2) 3 =... for any nput x). The random ntalzaton serves the purpose of symmetry breakng. One teraton of gradent descent updates the parameters W, b as follows: functon of the nput (more formally, W (1) j W (l) j := W (l) j α b (l) := b (l) α W (l) j b (l) J(W, b) J(W, b) where α s the learnng rate. The key step s computng the partal dervatves above. We wll now descrbe the backpropagaton algorthm, whch gves an effcent way to compute these partal dervatves. We wll frst descrbe how backpropagaton can be used to compute W (l) j J(W, b; x, y) and b (l) J(W, b; x, y), the partal dervatves of the cost functon J(W, b; x, y) defned wth respect to a sngle example (x, y). Once we can compute these, then by referrng to Equaton (8), we see that the dervatve of the overall cost functon J(W, b) can be computed as W (l) j b (l) [ ] 1 m J(W, b) = J(W, b; x (), y () ) m =1 W (l) j J(W, b) = 1 m J(W, b; x (), y () ). m =1 b (l) + λw (l) j, The two lnes above dffer slghtly because weght decay s appled to W but not b. 7

8 The ntuton behnd the backpropagaton algorthm s as follows. Gven a tranng example (x, y), we wll frst run a forward pass to compute all the actvatons throughout the network, ncludng the output value of the hypothess h W,b (x). Then, for each node n layer l, we would lke to compute an error term δ (l) that measures how much that node was responsble for any errors n our output. For an output node, we can drectly measure the dfference between the network s actvaton and the true target value, and use that to defne δ (n l) (where layer n l s the output layer). How about hdden unts? For those, we wll compute δ (l) based on a weghted average of the error terms of the nodes that uses a (l) as an nput. In detal, here s the backpropagaton algorthm: 1. Perform a feedforward pass, computng the actvatons for layers L 2, L 3, and so on up to the output layer L nl. 2. For each output unt n layer n l (the output layer), set δ (n l) = z (n l) 1 2 y h W,b(x) 2 = (y a (n l) ) f (z (n l) ) 3. For l = n l 1, n l 2, n l 3,..., 2 For each node n layer l, set δ (l) = ( sl+1 j=1 W (l) j δ(l+1) j ) f (z (l) ) 4. Compute the desred partal dervatves, whch are gven as: W (l) j b (l) J(W, b; x, y) = a (l) j δ(l+1) J(W, b; x, y) = δ (l+1). Fnally, we can also re-wrte the algorthm usng matrx-vectoral notaton. We wll use to denote the element-wse product operator (denoted.* n Matlab or Octave, and also called the Hadamard product), so that f a = b c, then a = b c. Smlar to how we extended the defnton of f( ) to apply element-wse to vectors, we also do the same for f ( ) (so that f ([z 1, z 2, z 3 ]) = [ z 1 f(z 1 ), z 2 f(z 2 ), z 3 f(z 3 )]). The algorthm can then be wrtten: 8

9 1. Perform a feedforward pass, computng the actvatons for layers L 2, L 3, up to the output layer L nl, usng Equatons (6-7). 2. For the output layer (layer n l ), set δ (n l) = (y a (n l) ) f (z (n) ) 3. For l = n l 1, n l 2, n l 3,..., 2 Set δ (l) = ( (W (l) ) T δ (l+1)) f (z (l) ) 4. Compute the desred partal dervatves: W (l)j(w, b; x, y) = δ (l+1) (a (l) ) T, b (l)j(w, b; x, y) = δ (l+1). Implementaton note: In steps 2 and 3 above, we need to compute f (z (l) ) for each value of. Assumng f(z) s the sgmod actvaton functon, we would already have a (l) stored away from the forward pass through the network. Thus, usng the expresson that we worked out earler for f (z), we can compute ths as f (z (l) ) = a (l) (1 a (l) ). Fnally, we are ready to descrbe the full gradent descent algorthm. In the pseudo-code below, W (l) s a matrx (of the same dmenson as W (l) ), and b (l) s a vector (of the same dmenson as b (l) ). Note that n ths notaton, W (l) s a matrx, and n partcular t sn t tmes W (l). We mplement one teraton of batch gradent descent as follows: 1. Set W (l) := 0, b (l) := 0 (matrx/vector of zeros) for all l. 2. For = 1 to m, 2a. Use backpropagaton to compute W (l)j(w, b; x, y) and b (l)j(w, b; x, y). 2b. Set W (l) := W (l) + W (l)j(w, b; x, y). 2c. Set b (l) := b (l) + b (l)j(w, b; x, y). 9

10 3. Update the parameters: [( ) ] 1 W (l) := W (l) (l) α W + λw (l) m [ ] 1 b (l) := b (l) α m b(l) To tran our neural network, we can now repeatedly take steps of gradent descent to reduce our cost functon J(W, b). 2.3 Gradent checkng and advanced optmzaton Backpropagaton s a notorously dffcult algorthm to debug and get rght, especally snce many subtly buggy mplementatons of t for example, one that has an off-by-one error n the ndces and that thus only trans some of the layers of weghts, or an mplementaton that omts the bas term wll manage to learn somethng that can look surprsngly reasonable (whle performng less well than a correct mplementaton). Thus, even wth a buggy mplementaton, t may not at all be apparent that anythng s amss. In ths secton, we descrbe a method for numercally checkng the dervatves computed by your code to make sure that your mplementaton s correct. Carryng out the dervatve checkng procedure descrbed here wll sgnfcantly ncrease your confdence n the correctness of your code. Suppose we want to mnmze J(θ) as a functon of θ. For ths example, suppose J : R R, so that θ R. In ths 1-dmensonal case, one teraton of gradent descent s gven by θ := θ α d dθ J(θ). Suppose also that we have mplemented some functon g(θ) that purportedly computes d J(θ), so that we mplement gradent descent usng the update dθ θ := θ αg(θ). How can we check f our mplementaton of g s correct? Recall the mathematcal defnton of the dervatve as d J(θ) = lm dθ ɛ 0 J(θ + ɛ) J(θ ɛ). 2ɛ Thus, at any specfc value of θ, we can numercally approxmate the dervatve as follows: J(θ + EPSILON) J(θ EPSILON) 2 EPSILON 10

11 In practce, we set EPSILON to a small constant, say around (There s a large range of values of EPSILON that should work well, but we don t set EPSILON to be extremely small, say 10 20, as that would lead to numercal roundoff errors.) Thus, gven a functon g(θ) that s supposedly computng d J(θ), we can dθ now numercally verfy ts correctness by checkng that g(θ) J(θ + EPSILON) J(θ EPSILON). 2 EPSILON The degree to whch these two values should approxmate each other wll depend on the detals of J. But assumng EPSILON = 10 4, you ll usually fnd that the left- and rght-hand sdes of the above wll agree to at least 4 sgnfcant dgts (and often many more). Now, consder the case where θ R n s a vector rather than a sngle real number (so that we have n parameters that we want to learn), and J : R n R. In our neural network example we used J(W, b), but one can magne unrollng the parameters W, b nto a long vector θ. We now generalze our dervatve checkng procedure to the case where θ may be a vector. Suppose we have a functon g (θ) that purportedly computes θ J(θ); we d lke to check f g s outputtng correct dervatve values. Let θ (+) = θ + EPSILON e, where 0 0. e = s the -th bass vector (a vector of the same dmenson as θ, wth a 1 n the -th poston and 0 s everywhere else). So, θ (+) s the same as θ, except ts -th element has been ncremented by EPSILON. Smlarly, let θ ( ) = θ EPSILON e be the correspondng vector wth the -th element decreased by EPSILON. We can now numercally verfy g (θ) s correctness by checkng, for each, that: 1. 0 g (θ) J(θ(+) ) J(θ ( ) ). 2 EPSILON When mplementng backpropagaton to tran a neural network, n a cor- 11

12 rect mplementaton we wll have that ( ) 1 (l) W (l)j(w, b) = W + λw (l) m b (l)j(w, b) = 1 m b(l). Ths result shows that the fnal block of psuedo-code n Secton 2.2 s ndeed mplementng gradent descent. To make sure your mplementaton of gradent descent s correct, t s usually very helpful to use the method descrbed above to numercally compute the dervatves of J(W, b), and thereby verfy that your computatons of ( 1 W (l)) +λw and 1 m m b(l) are ndeed gvng the dervatves you want. Fnally, so far our dscusson has centered on usng gradent descent to mnmze J(θ). If you have mplemented a functon that computes J(θ) and θ J(θ), t turns out there are more sophstcated algorthms than gradent descent for tryng to mnmze J(θ). For example, one can envson an algorthm that uses gradent descent, but automatcally tunes the learnng rate α so as to try to use a step-sze that causes θ to approach a local optmum as quckly as possble. There are other algorthms that are even more sophstcated than ths; for example, there are algorthms that try to fnd an approxmaton to the Hessan matrx, so that t can take more rapd steps towards a local optmum (smlar to Newton s method). A full dscusson of these algorthms s beyond the scope of these notes, but one example s the L-BFGS algorthm. (Another example s conjugate gradent.) You wll use one of these algorthms n the programmng exercse. The man thng you need to provde to these advanced optmzaton algorthms s that for any θ, you have to be able to compute J(θ) and θ J(θ). These optmzaton algorthms wll then do ther own nternal tunng of the learnng rate/step-sze α (and compute ts own approxmaton to the Hessan, etc.) to automatcally search for a value of θ that mnmzes J(θ). Algorthms such as L-BFGS and conjugate gradent can often be much faster than gradent descent. 3 Autoencoders and sparsty So far, we have descrbed the applcaton of neural networks to supervsed learnng, n whch we are have labeled tranng examples. Now suppose we have only unlabeled tranng examples set {x (1), x (2), x (3),...}, where x () R n. An autoencoder neural network s an unsupervsed learnng algorthm that apples backpropagaton, settng the target values to be equal to the nputs. I.e., t uses y () = x (). 12

13 Here s an autoencoder: The autoencoder tres to learn a functon h W,b (x) x. In other words, t s tryng to learn an approxmaton to the dentty functon, so as to output ˆx that s smlar to x. The dentty functon seems a partcularly trval functon to be tryng to learn; but by placng constrants on the network, such as by lmtng the number of hdden unts, we can dscover nterestng structure about the data. As a concrete example, suppose the nputs x are the pxel ntensty values from a mage (100 pxels) so n = 100, and there are s 2 = 50 hdden unts n layer L 2. Note that we also have y R 100. Snce there are only 50 hdden unts, the network s forced to learn a compressed representaton of the nput. I.e., gven only the vector of hdden unt actvatons a (2) R 50, t must try to reconstruct the 100-pxel nput x. If the nput were completely random say, each x comes from an IID Gaussan ndependent of the other features then ths compresson task would be very dffcult. But f there s structure n the data, for example, f some of the nput features are correlated, then ths algorthm wll be able to dscover some of those correlatons. 2 2 In fact, ths smple autoencoder often ends up learnng a low-dmensonal representaton very smlar to PCA s. 13

14 Our argument above reled on the number of hdden unts s 2 beng small. But even when the number of hdden unts s large (perhaps even greater than the number of nput pxels), we can stll dscover nterestng structure, by mposng other constrants on the network. In partcular, f we mpose a sparsty constrant on the hdden unts, then the autoencoder wll stll dscover nterestng structure n the data, even f the number of hdden unts s large. Informally, we wll thnk of a neuron as beng actve (or as frng ) f ts output value s close to 1, or as beng nactve f ts output value s close to 0. We would lke to constran the neurons to be nactve most of the tme. 3 Recall that a (2) j denotes the actvaton of hdden unt j n the autoencoder. However, ths notaton doesn t make explct what was the nput x that led to that actvaton. Thus, we wll wrte a (2) j (x) to denote the actvaton of ths hdden unt when the network s gven a specfc nput x. Further, let ˆρ j = 1 m m =1 [ ] a (2) j (x () ) be the average actvaton of hdden unt j (averaged over the tranng set). We would lke to (approxmately) enforce the constrant ˆρ j = ρ, where ρ s a sparsty parameter, typcally a small value close to zero (say ρ = 0.05). In other words, we would lke the average actvaton of each hdden neuron j to be close to 0.05 (say). To satsfy ths constrant, the hdden unt s actvatons must mostly be near 0. To acheve ths, we wll add an extra penalty term to our optmzaton objectve that penalzes ˆρ j devatng sgnfcantly from ρ. Many choces of the penalty term wll gve reasonable results. We wll choose the followng: s 2 j=1 ρ log ρˆρ j + (1 ρ) log 1 ρ 1 ˆρ j. Here, s 2 s the number of neurons n the hdden layer, and the ndex j s summng over the hdden unts n our network. If you are famlar wth the 3 Ths dscusson assumes a sgmod actvaton functon. If you are usng a tanh actvaton functon, then we thnk of a neuron as beng nactve when t outputs values close to

15 concept of KL dvergence, ths penalty term s based on t, and can also be wrtten s 2 j=1 KL(ρ ˆρ j ), where KL(ρ ˆρ j ) = ρ log ρˆρ j + (1 ρ) log 1 ρ 1 ˆρ j s the Kullback-Lebler (KL) dvergence between a Bernoull random varable wth mean ρ and a Bernoull random varable wth mean ˆρ j. KL-dvergence s a standard functon for measurng how dfferent two dfferent dstrbutons are. (If you ve not seen KL-dvergence before, don t worry about t; everythng you need to know about t s contaned n these notes.) Ths penalty functon has the property that KL(ρ ˆρ j ) = 0 f ˆρ j = ρ, and otherwse t ncreases monotoncally as ˆρ j dverges from ρ. For example, n the fgure below, we have set ρ = 0.2, and plotted KL(ρ ˆρ j ) for a range of values of ˆρ j : We see that the KL-dvergence reaches ts mnmum of 0 at ˆρ j = ρ, and blows up (t actually approaches ) as ˆρ j approaches 0 or 1. Thus, mnmzng ths penalty term has the effect of causng ˆρ j to be close to ρ. Our overall cost functon s now J sparse (W, b) = J(W, b) + β s 2 j=1 KL(ρ ˆρ j ), where J(W, b) s as defned prevously, and β controls the weght of the sparsty penalty term. The term ˆρ j (mplctly) depends on W, b also, because t s the average actvaton of hdden unt j, and the actvaton of a hdden unt depends on the parameters W, b. To ncorporate the KL-dvergence term nto your dervatve calculaton, there s a smple-to-mplement trck nvolvng only a small change to your 15

16 code. Specfcally, where prevously for the second layer (l = 2), durng backpropagaton you would have computed ( s2 ) δ (2) = W (2) j δ (3) j f (z (2) ), now nstead compute (( s2 ) δ (2) = W (2) j δ (3) j j=1 j=1 + β ( ρˆρ + 1 ρ ) ) f (z (2) ). 1 ˆρ One subtlety s that you ll need to know ˆρ to compute ths term. Thus, you ll need to compute a forward pass on all the tranng examples frst to compute the average actvatons on the tranng set, before computng backpropagaton on any example. If your tranng set s small enough to ft comfortably n computer memory (ths wll be the case for the programmng assgnment), you can compute forward passes on all your examples and keep the resultng actvatons n memory and compute the ˆρ s. Then you can use your precomputed actvatons to perform backpropagaton on all your examples. If your data s too large to ft n memory, you may have to scan through your examples computng a forward pass on each to accumulate (sum up) the actvatons and compute ˆρ (dscardng the result of each forward pass after you have taken ts actvatons a (2) nto account for computng ˆρ ). Then after havng computed ˆρ, you d have to redo the forward pass for each example so that you can do backpropagaton on that example. In ths latter case, you would end up computng a forward pass twce on each example n your tranng set, makng t computatonally less effcent. The full dervaton showng that the algorthm above results n gradent descent s beyond the scope of these notes. But f you mplement the autoencoder usng backpropagaton modfed ths way, you wll be performng gradent descent exactly on the objectve J sparse (W, b). Usng the dervatve checkng method, you wll be able to verfy ths for yourself as well. 4 Vsualzaton Havng traned a (sparse) autoencoder, we would now lke to vsualze the functon learned by the algorthm, to try to understand what t has learned. Consder the case of tranng an autoencoder on mages, so that 16

17 n = 100. Each hdden unt computes a functon of the nput: ( 100 ) a (2) = f W (1) j x j + b (1). j=1 We wll vsualze the functon computed by hdden unt whch depends on the parameters W (1) j (gnorng the bas term for now) usng a 2D mage. In partcular, we thnk of a (1) as some non-lnear feature of the nput x. We ask: What nput mage x would cause a (1) to be maxmally actvated? For ths queston to have a non-trval answer, we must mpose some constrants on x. If we suppose that the nput s norm constraned by x 2 = 100 =1 x2 1, then one can show (try dong ths yourself) that the nput whch maxmally actvates hdden unt s gven by settng pxel x j (for all 100 pxels, j = 1,..., 100) to x j = 100 W (1) j j=1 (W (1) j )2. By dsplayng the mage formed by these pxel ntensty values, we can begn to understand what feature hdden unt s lookng for. If we have an autoencoder wth 100 hdden unts (say), then we our vsualzaton wll have 100 such mages one per hdden unt. By examnng these 100 mages, we can try to understand what the ensemble of hdden unts s learnng. When we do ths for a sparse autoencoder (traned wth 100 hdden unts on 10x10 pxel nputs 4 ) we get the followng result: 4 The results below were obtaned by tranng on whtened natural mages. Whtenng s a preprocessng step whch removes redundancy n the nput, by causng adjacent pxels to become less correlated. 17

18 Each square n the fgure above shows the (norm bounded) nput mage x that maxmally actves one of 100 hdden unts. We see that the dfferent hdden unts have learned to detect edges at dfferent postons and orentatons n the mage. These features are, not surprsngly, useful for such tasks as object recognton and other vson tasks. When appled to other nput domans (such as audo), ths algorthm also learns useful representatons/features for those domans too. 18

19 5 Summary of notaton x Input features for a tranng example, x R n. y Output/target values. Here, y can be vector valued. In the case of an autoencoder, y = x. (x (), y () ) The -th tranng example h W,b (x) Output of our hypothess on nput x, usng parameters W, b. Ths should be a vector of the same dmenson as the target value y. W (l) j The parameter assocated wth the connecton between unt j n layer l, and unt n layer l + 1. b (l) The bas term assocated wth unt n layer l + 1. Can also be thought of as the parameter assocated wth the connecton between the bas unt n layer l and unt n layer l + 1. θ Our parameter vector. It s useful to thnk of ths as the result of takng the parameters W, b and unrollng them nto a long column vector. Actvaton (output) of unt n layer l of the network. In addton, snce layer L 1 s the nput layer, we also have a (1) = x. f( ) The actvaton functon. Throughout these notes, we used f(z) = tanh(z). Total weghted sum of nputs to unt n layer l. Thus, a (l) a (l) z (l) α s l n l λ ˆx ρ ˆρ β f(z (l) = ). Learnng rate parameter Number of unts n layer l (not countng the bas unt). Number layers n the network. Layer L 1 s usually the nput layer, and layer L nl the output layer. Weght decay parameter. For an autoencoder, ts output;.e., ts reconstructon of the nput x. Same meanng as h W,b (x). Sparsty parameter, whch specfes our desred level of sparsty The average actvaton of hdden unt (n the sparse autoencoder). Weght of the sparsty penalty term (n the sparse autoencoder objectve). 19

CS294A Lecture notes. Andrew Ng

CS294A Lecture notes. Andrew Ng CS294A Lecture notes Andrew Ng Sparse autoencoder 1 Introducton Supervsed learnng s one of the most powerful tools of AI, and has led to automatc zp code recognton, speech recognton, self-drvng cars, and

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

MATH 567: Mathematical Techniques in Data Science Lab 8

MATH 567: Mathematical Techniques in Data Science Lab 8 1/14 MATH 567: Mathematcal Technques n Data Scence Lab 8 Domnque Gullot Departments of Mathematcal Scences Unversty of Delaware Aprl 11, 2017 Recall We have: a (2) 1 = f(w (1) 11 x 1 + W (1) 12 x 2 + W

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

Linear Feature Engineering 11

Linear Feature Engineering 11 Lnear Feature Engneerng 11 2 Least-Squares 2.1 Smple least-squares Consder the followng dataset. We have a bunch of nputs x and correspondng outputs y. The partcular values n ths dataset are x y 0.23 0.19

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata Multlayer Perceptrons and Informatcs CG: Lecture 6 Mrella Lapata School of Informatcs Unversty of Ednburgh mlap@nf.ed.ac.uk Readng: Kevn Gurney s Introducton to Neural Networks, Chapters 5 6.5 January,

More information

Difference Equations

Difference Equations Dfference Equatons c Jan Vrbk 1 Bascs Suppose a sequence of numbers, say a 0,a 1,a,a 3,... s defned by a certan general relatonshp between, say, three consecutve values of the sequence, e.g. a + +3a +1

More information

Singular Value Decomposition: Theory and Applications

Singular Value Decomposition: Theory and Applications Sngular Value Decomposton: Theory and Applcatons Danel Khashab Sprng 2015 Last Update: March 2, 2015 1 Introducton A = UDV where columns of U and V are orthonormal and matrx D s dagonal wth postve real

More information

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012 MLE and Bayesan Estmaton Je Tang Department of Computer Scence & Technology Tsnghua Unversty 01 1 Lnear Regresson? As the frst step, we need to decde how we re gong to represent the functon f. One example:

More information

Lecture 2: Prelude to the big shrink

Lecture 2: Prelude to the big shrink Lecture 2: Prelude to the bg shrnk Last tme A slght detour wth vsualzaton tools (hey, t was the frst day... why not start out wth somethng pretty to look at?) Then, we consdered a smple 120a-style regresson

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

Vapnik-Chervonenkis theory

Vapnik-Chervonenkis theory Vapnk-Chervonenks theory Rs Kondor June 13, 2008 For the purposes of ths lecture, we restrct ourselves to the bnary supervsed batch learnng settng. We assume that we have an nput space X, and an unknown

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons

More information

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester 0/25/6 Admn Assgnment 7 Class /22 Schedule for the rest of the semester NEURAL NETWORKS Davd Kauchak CS58 Fall 206 Perceptron learnng algorthm Our Nervous System repeat untl convergence (or for some #

More information

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Structure and Drive Paul A. Jensen Copyright July 20, 2003 Structure and Drve Paul A. Jensen Copyrght July 20, 2003 A system s made up of several operatons wth flow passng between them. The structure of the system descrbes the flow paths from nputs to outputs.

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

1 Matrix representations of canonical matrices

1 Matrix representations of canonical matrices 1 Matrx representatons of canoncal matrces 2-d rotaton around the orgn: ( ) cos θ sn θ R 0 = sn θ cos θ 3-d rotaton around the x-axs: R x = 1 0 0 0 cos θ sn θ 0 sn θ cos θ 3-d rotaton around the y-axs:

More information

Foundations of Arithmetic

Foundations of Arithmetic Foundatons of Arthmetc Notaton We shall denote the sum and product of numbers n the usual notaton as a 2 + a 2 + a 3 + + a = a, a 1 a 2 a 3 a = a The notaton a b means a dvdes b,.e. ac = b where c s an

More information

Errors for Linear Systems

Errors for Linear Systems Errors for Lnear Systems When we solve a lnear system Ax b we often do not know A and b exactly, but have only approxmatons  and ˆb avalable. Then the best thng we can do s to solve ˆx ˆb exactly whch

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

Section 8.3 Polar Form of Complex Numbers

Section 8.3 Polar Form of Complex Numbers 80 Chapter 8 Secton 8 Polar Form of Complex Numbers From prevous classes, you may have encountered magnary numbers the square roots of negatve numbers and, more generally, complex numbers whch are the

More information

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Neural networks. Nuno Vasconcelos ECE Department, UCSD Neural networs Nuno Vasconcelos ECE Department, UCSD Classfcaton a classfcaton problem has two types of varables e.g. X - vector of observatons (features) n the world Y - state (class) of the world x X

More information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

Unsupervised Learning

Unsupervised Learning Unsupervsed Learnng Kevn Swngler What s Unsupervsed Learnng? Most smply, t can be thought of as learnng to recognse and recall thngs Recognton I ve seen that before Recall I ve seen that before and I can

More information

Deep Learning. Boyang Albert Li, Jie Jay Tan

Deep Learning. Boyang Albert Li, Jie Jay Tan Deep Learnng Boyang Albert L, Je Jay Tan An Unrelated Vdeo A bcycle controller learned usng NEAT (Stanley) What do you mean, deep? Shallow Hdden Markov models ANNs wth one hdden layer Manually selected

More information

Gaussian Mixture Models

Gaussian Mixture Models Lab Gaussan Mxture Models Lab Objectve: Understand the formulaton of Gaussan Mxture Models (GMMs) and how to estmate GMM parameters. You ve already seen GMMs as the observaton dstrbuton n certan contnuous

More information

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018 INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton

More information

Limited Dependent Variables

Limited Dependent Variables Lmted Dependent Varables. What f the left-hand sde varable s not a contnuous thng spread from mnus nfnty to plus nfnty? That s, gven a model = f (, β, ε, where a. s bounded below at zero, such as wages

More information

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there

More information

CSC 411 / CSC D11 / CSC C11

CSC 411 / CSC D11 / CSC C11 18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t

More information

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS) Some Comments on Acceleratng Convergence of Iteratve Sequences Usng Drect Inverson of the Iteratve Subspace (DIIS) C. Davd Sherrll School of Chemstry and Bochemstry Georga Insttute of Technology May 1998

More information

Spectral Graph Theory and its Applications September 16, Lecture 5

Spectral Graph Theory and its Applications September 16, Lecture 5 Spectral Graph Theory and ts Applcatons September 16, 2004 Lecturer: Danel A. Spelman Lecture 5 5.1 Introducton In ths lecture, we wll prove the followng theorem: Theorem 5.1.1. Let G be a planar graph

More information

Learning Theory: Lecture Notes

Learning Theory: Lecture Notes Learnng Theory: Lecture Notes Lecturer: Kamalka Chaudhur Scrbe: Qush Wang October 27, 2012 1 The Agnostc PAC Model Recall that one of the constrants of the PAC model s that the data dstrbuton has to be

More information

= z 20 z n. (k 20) + 4 z k = 4

= z 20 z n. (k 20) + 4 z k = 4 Problem Set #7 solutons 7.2.. (a Fnd the coeffcent of z k n (z + z 5 + z 6 + z 7 + 5, k 20. We use the known seres expanson ( n+l ( z l l z n below: (z + z 5 + z 6 + z 7 + 5 (z 5 ( + z + z 2 + z + 5 5

More information

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Grover s Algorithm + Quantum Zeno Effect + Vaidman Grover s Algorthm + Quantum Zeno Effect + Vadman CS 294-2 Bomb 10/12/04 Fall 2004 Lecture 11 Grover s algorthm Recall that Grover s algorthm for searchng over a space of sze wors as follows: consder the

More information

The Geometry of Logit and Probit

The Geometry of Logit and Probit The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.

More information

MMA and GCMMA two methods for nonlinear optimization

MMA and GCMMA two methods for nonlinear optimization MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons

More information

Lecture 21: Numerical methods for pricing American type derivatives

Lecture 21: Numerical methods for pricing American type derivatives Lecture 21: Numercal methods for prcng Amercan type dervatves Xaoguang Wang STAT 598W Aprl 10th, 2014 (STAT 598W) Lecture 21 1 / 26 Outlne 1 Fnte Dfference Method Explct Method Penalty Method (STAT 598W)

More information

Ensemble Methods: Boosting

Ensemble Methods: Boosting Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement

More information

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique Outlne and Readng Dynamc Programmng The General Technque ( 5.3.2) -1 Knapsac Problem ( 5.3.3) Matrx Chan-Product ( 5.3.1) Dynamc Programmng verson 1.4 1 Dynamc Programmng verson 1.4 2 Dynamc Programmng

More information

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning CS9 Problem Set #3 Solutons CS 9, Publc Course Problem Set #3 Solutons: Learnng Theory and Unsupervsed Learnng. Unform convergence and Model Selecton In ths problem, we wll prove a bound on the error of

More information

Online Classification: Perceptron and Winnow

Online Classification: Perceptron and Winnow E0 370 Statstcal Learnng Theory Lecture 18 Nov 8, 011 Onlne Classfcaton: Perceptron and Wnnow Lecturer: Shvan Agarwal Scrbe: Shvan Agarwal 1 Introducton In ths lecture we wll start to study the onlne learnng

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal Inner Product Defnton 1 () A Eucldean space s a fnte-dmensonal vector space over the reals R, wth an nner product,. Defnton 2 (Inner Product) An nner product, on a real vector space X s a symmetrc, blnear,

More information

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page cheat sheet. CS 89 Fall 206 Introducton to Machne Learnng Fnal Do not open the exam before you are nstructed to do so The exam s closed book, closed notes except your one-page cheat sheet Usage of electronc devces

More information

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD CHALMERS, GÖTEBORGS UNIVERSITET SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS COURSE CODES: FFR 35, FIM 72 GU, PhD Tme: Place: Teachers: Allowed materal: Not allowed: January 2, 28, at 8 3 2 3 SB

More information

The Second Eigenvalue of Planar Graphs

The Second Eigenvalue of Planar Graphs Spectral Graph Theory Lecture 20 The Second Egenvalue of Planar Graphs Danel A. Spelman November 11, 2015 Dsclamer These notes are not necessarly an accurate representaton of what happened n class. The

More information

Notes on Frequency Estimation in Data Streams

Notes on Frequency Estimation in Data Streams Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to

More information

Numerical Heat and Mass Transfer

Numerical Heat and Mass Transfer Master degree n Mechancal Engneerng Numercal Heat and Mass Transfer 06-Fnte-Dfference Method (One-dmensonal, steady state heat conducton) Fausto Arpno f.arpno@uncas.t Introducton Why we use models and

More information

From Biot-Savart Law to Divergence of B (1)

From Biot-Savart Law to Divergence of B (1) From Bot-Savart Law to Dvergence of B (1) Let s prove that Bot-Savart gves us B (r ) = 0 for an arbtrary current densty. Frst take the dvergence of both sdes of Bot-Savart. The dervatve s wth respect to

More information

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1 C/CS/Phy9 Problem Set 3 Solutons Out: Oct, 8 Suppose you have two qubts n some arbtrary entangled state ψ You apply the teleportaton protocol to each of the qubts separately What s the resultng state obtaned

More information

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty Addtonal Codes usng Fnte Dfference Method Benamn Moll 1 HJB Equaton for Consumpton-Savng Problem Wthout Uncertanty Before consderng the case wth stochastc ncome n http://www.prnceton.edu/~moll/ HACTproect/HACT_Numercal_Appendx.pdf,

More information

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011 Stanford Unversty CS359G: Graph Parttonng and Expanders Handout 4 Luca Trevsan January 3, 0 Lecture 4 In whch we prove the dffcult drecton of Cheeger s nequalty. As n the past lectures, consder an undrected

More information

DUE: WEDS FEB 21ST 2018

DUE: WEDS FEB 21ST 2018 HOMEWORK # 1: FINITE DIFFERENCES IN ONE DIMENSION DUE: WEDS FEB 21ST 2018 1. Theory Beam bendng s a classcal engneerng analyss. The tradtonal soluton technque makes smplfyng assumptons such as a constant

More information

Lecture 4. Instructor: Haipeng Luo

Lecture 4. Instructor: Haipeng Luo Lecture 4 Instructor: Hapeng Luo In the followng lectures, we focus on the expert problem and study more adaptve algorthms. Although Hedge s proven to be worst-case optmal, one may wonder how well t would

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING 1 ADVANCED ACHINE LEARNING ADVANCED ACHINE LEARNING Non-lnear regresson technques 2 ADVANCED ACHINE LEARNING Regresson: Prncple N ap N-dm. nput x to a contnuous output y. Learn a functon of the type: N

More information

CHAPTER III Neural Networks as Associative Memory

CHAPTER III Neural Networks as Associative Memory CHAPTER III Neural Networs as Assocatve Memory Introducton One of the prmary functons of the bran s assocatve memory. We assocate the faces wth names, letters wth sounds, or we can recognze the people

More information

Boostrapaggregating (Bagging)

Boostrapaggregating (Bagging) Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod

More information

NUMERICAL DIFFERENTIATION

NUMERICAL DIFFERENTIATION NUMERICAL DIFFERENTIATION 1 Introducton Dfferentaton s a method to compute the rate at whch a dependent output y changes wth respect to the change n the ndependent nput x. Ths rate of change s called the

More information

1 GSW Iterative Techniques for y = Ax

1 GSW Iterative Techniques for y = Ax 1 for y = A I m gong to cheat here. here are a lot of teratve technques that can be used to solve the general case of a set of smultaneous equatons (wrtten n the matr form as y = A), but ths chapter sn

More information

Multi-layer neural networks

Multi-layer neural networks Lecture 0 Mult-layer neural networks Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Lnear regresson w Lnear unts f () Logstc regresson T T = w = p( y =, w) = g( w ) w z f () = p ( y = ) w d w d Gradent

More information

Note on EM-training of IBM-model 1

Note on EM-training of IBM-model 1 Note on EM-tranng of IBM-model INF58 Language Technologcal Applcatons, Fall The sldes on ths subject (nf58 6.pdf) ncludng the example seem nsuffcent to gve a good grasp of what s gong on. Hence here are

More information

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14 APPROXIMAE PRICES OF BASKE AND ASIAN OPIONS DUPON OLIVIER Prema 14 Contents Introducton 1 1. Framewor 1 1.1. Baset optons 1.. Asan optons. Computng the prce 3. Lower bound 3.1. Closed formula for the prce

More information

Chapter 9: Statistical Inference and the Relationship between Two Variables

Chapter 9: Statistical Inference and the Relationship between Two Variables Chapter 9: Statstcal Inference and the Relatonshp between Two Varables Key Words The Regresson Model The Sample Regresson Equaton The Pearson Correlaton Coeffcent Learnng Outcomes After studyng ths chapter,

More information

Assortment Optimization under MNL

Assortment Optimization under MNL Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.

More information

763622S ADVANCED QUANTUM MECHANICS Solution Set 1 Spring c n a n. c n 2 = 1.

763622S ADVANCED QUANTUM MECHANICS Solution Set 1 Spring c n a n. c n 2 = 1. 7636S ADVANCED QUANTUM MECHANICS Soluton Set 1 Sprng 013 1 Warm-up Show that the egenvalues of a Hermtan operator  are real and that the egenkets correspondng to dfferent egenvalues are orthogonal (b)

More information

Classification as a Regression Problem

Classification as a Regression Problem Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class

More information

An Interactive Optimisation Tool for Allocation Problems

An Interactive Optimisation Tool for Allocation Problems An Interactve Optmsaton ool for Allocaton Problems Fredr Bonäs, Joam Westerlund and apo Westerlund Process Desgn Laboratory, Faculty of echnology, Åbo Aadem Unversty, uru 20500, Fnland hs paper presents

More information

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest

More information

A linear imaging system with white additive Gaussian noise on the observed data is modeled as follows:

A linear imaging system with white additive Gaussian noise on the observed data is modeled as follows: Supplementary Note Mathematcal bacground A lnear magng system wth whte addtve Gaussan nose on the observed data s modeled as follows: X = R ϕ V + G, () where X R are the expermental, two-dmensonal proecton

More information

Chapter Newton s Method

Chapter Newton s Method Chapter 9. Newton s Method After readng ths chapter, you should be able to:. Understand how Newton s method s dfferent from the Golden Secton Search method. Understand how Newton s method works 3. Solve

More information

Global Sensitivity. Tuesday 20 th February, 2018

Global Sensitivity. Tuesday 20 th February, 2018 Global Senstvty Tuesday 2 th February, 28 ) Local Senstvty Most senstvty analyses [] are based on local estmates of senstvty, typcally by expandng the response n a Taylor seres about some specfc values

More information

Lecture 20: November 7

Lecture 20: November 7 0-725/36-725: Convex Optmzaton Fall 205 Lecturer: Ryan Tbshran Lecture 20: November 7 Scrbes: Varsha Chnnaobreddy, Joon Sk Km, Lngyao Zhang Note: LaTeX template courtesy of UC Berkeley EECS dept. Dsclamer:

More information

COS 521: Advanced Algorithms Game Theory and Linear Programming

COS 521: Advanced Algorithms Game Theory and Linear Programming COS 521: Advanced Algorthms Game Theory and Lnear Programmng Moses Charkar February 27, 2013 In these notes, we ntroduce some basc concepts n game theory and lnear programmng (LP). We show a connecton

More information

10. Canonical Transformations Michael Fowler

10. Canonical Transformations Michael Fowler 10. Canoncal Transformatons Mchael Fowler Pont Transformatons It s clear that Lagrange s equatons are correct for any reasonable choce of parameters labelng the system confguraton. Let s call our frst

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.

More information

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM An elastc wave s a deformaton of the body that travels throughout the body n all drectons. We can examne the deformaton over a perod of tme by fxng our look

More information

Chapter 11: Simple Linear Regression and Correlation

Chapter 11: Simple Linear Regression and Correlation Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests

More information

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

Lossy Compression. Compromise accuracy of reconstruction for increased compression. Lossy Compresson Compromse accuracy of reconstructon for ncreased compresson. The reconstructon s usually vsbly ndstngushable from the orgnal mage. Typcally, one can get up to 0:1 compresson wth almost

More information

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law: CE304, Sprng 2004 Lecture 4 Introducton to Vapor/Lqud Equlbrum, part 2 Raoult s Law: The smplest model that allows us do VLE calculatons s obtaned when we assume that the vapor phase s an deal gas, and

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

Calculation of time complexity (3%)

Calculation of time complexity (3%) Problem 1. (30%) Calculaton of tme complexty (3%) Gven n ctes, usng exhaust search to see every result takes O(n!). Calculaton of tme needed to solve the problem (2%) 40 ctes:40! dfferent tours 40 add

More information

Solving Nonlinear Differential Equations by a Neural Network Method

Solving Nonlinear Differential Equations by a Neural Network Method Solvng Nonlnear Dfferental Equatons by a Neural Network Method Luce P. Aarts and Peter Van der Veer Delft Unversty of Technology, Faculty of Cvlengneerng and Geoscences, Secton of Cvlengneerng Informatcs,

More information

Resource Allocation with a Budget Constraint for Computing Independent Tasks in the Cloud

Resource Allocation with a Budget Constraint for Computing Independent Tasks in the Cloud Resource Allocaton wth a Budget Constrant for Computng Independent Tasks n the Cloud Wemng Sh and Bo Hong School of Electrcal and Computer Engneerng Georga Insttute of Technology, USA 2nd IEEE Internatonal

More information

SDMML HT MSc Problem Sheet 4

SDMML HT MSc Problem Sheet 4 SDMML HT 06 - MSc Problem Sheet 4. The recever operatng characterstc ROC curve plots the senstvty aganst the specfcty of a bnary classfer as the threshold for dscrmnaton s vared. Let the data space be

More information

Support Vector Machines

Support Vector Machines Separatng boundary, defned by w Support Vector Machnes CISC 5800 Professor Danel Leeds Separatng hyperplane splts class 0 and class 1 Plane s defned by lne w perpendcular to plan Is data pont x n class

More information

Multilayer Perceptron (MLP)

Multilayer Perceptron (MLP) Multlayer Perceptron (MLP) Seungjn Cho Department of Computer Scence and Engneerng Pohang Unversty of Scence and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjn@postech.ac.kr 1 / 20 Outlne

More information

MULTISPECTRAL IMAGE CLASSIFICATION USING BACK-PROPAGATION NEURAL NETWORK IN PCA DOMAIN

MULTISPECTRAL IMAGE CLASSIFICATION USING BACK-PROPAGATION NEURAL NETWORK IN PCA DOMAIN MULTISPECTRAL IMAGE CLASSIFICATION USING BACK-PROPAGATION NEURAL NETWORK IN PCA DOMAIN S. Chtwong, S. Wtthayapradt, S. Intajag, and F. Cheevasuvt Faculty of Engneerng, Kng Mongkut s Insttute of Technology

More information