Representational Power of Restricted Boltzmann Machines

Size: px
Start display at page:

Download "Representational Power of Restricted Boltzmann Machines"

Transcription

1 1 Representational Power of Restricted Boltzmann Macines and Deep Belief Networks Nicolas Le Roux and Yosua Bengio Dept. IRO, Uniersité de Montréal C.P. 6128, Montreal, Qc, H3C 3J7, Canada ttp:// lisa Abstract Deep Belief Networks (DBN are generatie neural network models wit many layers of idden explanatory factors, recently introduced by Hinton et al., along wit a greedy layer-wise unsuperised learning algoritm. Te building block of a DBN is a probabilistic model called a Restricted Boltzmann Macine (RBM, used to represent one layer of te model. Restricted Boltzmann Macines are interesting because inference is easy in tem, and because tey ae been successfully used as building blocks for training deeper models. We first proe tat adding idden units yields strictly improed modelling power, wile a second teorem sows tat RBMs are uniersal approximators of discrete distributions. We ten study te question of weter DBNs wit more layers are strictly more powerful in terms of representational power. Tis suggests a new and less greedy criterion for training RBMs witin DBNs. 1 Introduction Learning algoritms tat learn to represent functions wit many leels of composition are said to ae a deep arcitecture. Bengio and Le Cun (2007 discuss results in computational teory of circuits tat strongly suggest tat deep arcitectures are muc more efficient in terms of representation (number of computational elements, number of parameters tan teir sallow counterparts. In spite of te fact tat 2-leel arcitectures (e.g., a one-idden layer neural network, a kernel macine, or a 2-leel digital circuit are able to represent any function (see for example (Hornik, Stinccombe, & Wite, 1989, tey may need a uge number of elements and, consequently, of training examples. For example, te parity function on d bits (wic associates te alue 1 wit a ector if as an odd number of bits equal to 1 and 0 oterwise can be implemented by a digital circuit of dept log(d wit O(d elements but requires O(2 d elements to be represented by a 2-leel digital circuit (Ajtai, 1983 (e.g., in conjunctie or disjunctie normal form. We proed a similar result for Gaussian kernel macines: tey require O(2 d non-zero coefficients (i.e., support ectors in a Support Vector Macine to represent suc igly

2 arying functions (Bengio, Delalleau, & Le Roux, 2006a. On te oter and, training learning algoritms wit a deep arcitecture (suc as neural networks wit many idden layers appears to be a callenging optimization problem (Tesauro, 1992; Bengio, Lamblin, Popoici, & Larocelle, Hinton, Osindero, and Te (2006 introduced a greedy layer-wise unsuperised learning algoritm for Deep Belief Networks (DBN. Te training strategy for suc networks may old great promise as a principle to elp address te problem of training deep networks. Upper layers of a DBN are supposed to represent more abstract concepts tat explain te input data wereas lower layers extract low-leel features from te data. In (Bengio et al., 2007; Ranzato, Poultney, Copra, & LeCun, 2007, tis greedy layer-wise principle is found to be applicable to models oter tan DBNs. DBNs and RBMs ae already been applied successfully to a number of classification, dimensionality reduction, information retrieal, and modelling tasks (Welling, Rosen-Zi, & Hinton, 2005; Hinton et al., 2006; Hinton & Salakutdino, 2006; Bengio et al., 2007; Salakutdino & Hinton, In tis paper we sow tat adding idden units yields strictly improed modelling power, unless te RBM already perfectly models te data. Ten, we proe tat an RBM can model any discrete distribution, a property similar to tose of neural networks wit one idden layer. Finally, we discuss te representational power of DBNs and find a puzzling result about te best tat could be acieed wen going from 1-layer to 2-layer DBNs. Note tat te proofs of uniersal approximation by RBMs are constructie but tese constructions are not practical as tey would lead RBMs wit potentially as many idden units as examples, and tis would defy te purpose of using RBMs as building blocks of a deep network tat efficiently represents te input distribution. Important teoretical questions terefore remain unanswered concerning te potential for DBNs tat stack multiple RBMs to represent a distribution efficiently. 1.1 Background on RBMs Definition and properties A Restricted Boltzmann Macine (RBM is a particular form of te Product of Experts model (Hinton, 1999, 2002 wic is also a Boltzmann Macine (Ackley, Hinton, & Sejnowski, 1985 wit a bipartite connectiity grap. An RBM wit n idden units is a parametric model of te joint distribution between idden ariables i (explanatory factors, collected in ector and obsered ariables j (te example, collected in ector, of te form p(, exp( E(, = e T W +b T +c T wit parameters θ = (W, b, c and j, i {0, 1}. E(, is called te energy of te state (,. We consider ere te simpler case of binary units. It is straigtforward to sow tat P ( = j P ( j and P ( j = 1 = sigm(b j + i W ij i (were sigm 1 is te sigmoid function defined as sigm(x = 1+exp( x, and P ( as a similar form: P ( = i P ( i and P ( i = 1 = sigm(c i + j W ij j. Altoug te marginal distribution p( is not tractable, it can be easily computed up to a normalizing constant. 2

3 Furtermore, one can also sample from te model distribution using Gibbs sampling. Consider a Monte-Carlo Marko cain (MCMC initialized wit sampled from te empirical data distribution (distribution denoted p 0. After sampling from P (, sample from P (, wic follows a distribution denoted p 1. After k suc steps we ae samples from p k, and te model s generatie distribution is p (due to conergence of te Gibbs MCMC Training and Contrastie Diergence Carreira-Perpiñan and Hinton (2005 sowed tat te deriatie of te log-likeliood of te data under te RBM wit respect to te parameters is log p(, log E(, log E(, = + (1 θ θ θ were aeraging is oer bot and, 0 denotes an aerage wit respect to p 0 (te data distribution multiplied by P (, and denotes an aerage wit respect to p (te model distribution: p (, = p(,. Since computing te aerage oer te true model distribution is intractable, Hinton et al. (2006 use an approximation of tat deriatie called contrastie diergence (Hinton, 1999, 2002: one replaces te aerage wit k for relatiely small alues of k. For example, in Hinton et al. (2006, Hinton and Salakutdino (2006, Bengio et al. (2007, Salakutdino and Hinton (2007, one uses k = 1 wit great success. Te aerage oer s from p 0 is replaced by a sample from te empirical distribution (tis is te usual stocastic gradient sampling trick and te aerage oer s from p 1 is replaced by a single sample from te Marko cain. Te resulting gradient estimator inoles only ery simple computations, and for te case of binary units, te gradient estimator on weigt W ij is simply P ( i = 1 j P ( i = 1 j, were is a sample from p 1 and is te input example tat starts te cain. Te procedure can easily be generalized to input or idden units tat are not binary (e.g., Gaussian or exponential, for continuousalued units (Welling et al., 2005; Bengio et al., RBMs are Uniersal Approximators We will now proe tat RBMs wit a data-selected number of idden units become non-parametric and possess uniersal approximation properties relating tem closely to classical multilayer neural networks, but in te context of probabilistic unsuperised learning of an input distribution. 2.1 Better Model wit Increasing Number of Units We sow below tat wen te number of idden units of an RBM is increased, tere are weigt alues for te new units tat guarantee improement in te training loglikeliood or equialently in te KL diergence between te data distribution p 0 and te 0 3

4 model distribution p = p. Tese are equialent since KL(p 0 p = p 0 ( log p 0( p( = H(p 0 1 N N log p( (i i=1 wen p 0 is te empirical distribution, wit (i te i t training ector and N te number of training ectors. Consider te objectie of approximating an arbitrary distribution p 0 wit an RBM. Let p denote te distribution oer isible units obtained wit an RBM tat as n idden units and p w,c denote te input distribution obtained wen adding a idden unit wit weigts w and bias c to tat RBM. Te RBM wit tis extra unit as te same weigts and biases for all oter idden units, and te same input biases. Lemma 2.1. Let R p be te equialence class containing te RBMs wose associated marginal distribution oer te isible units is p. Te operation of adding a idden unit to an RBM of R p preseres te equialence class. Tus, te set of RBMs composed of an RBM of R p and an additional idden unit is also an equialence class (meaning tat all te RBMs of tis set ae te same marginal distribution oer isible units. Proof in appendix. R p will be used ere to denote any RBM in tis class. We also define R pw,c as te set of RBMs obtained by adding a idden unit wit weigt w and bias c to an RBM from R p and p w,c te associated marginal distribution oer te isible units. As demonstrated in te aboe lemma, tis does not depend on wic particular RBM from R p we coose. We ten wis to proe, tat, regardless p and p 0, if p p 0, tere exists a pair (w, c suc tat KL(p 0 p w,c < KL(p 0 p, i.e., tat one can improe te approximation of p 0 by inserting an extra idden unit wit weigt ector w and bias c. We will first state a triial lemma needed for te rest of te proof. It says tat inserting a unit wit bias c = does not cange te input distribution associated wit te RBM. Lemma 2.2. Let p be te distribution oer binary ectors in {0, 1} d, obtained wit an RBM R p and let p w,c be te distribution obtained wen adding a idden unit wit weigts w and bias c to R p. Ten p, w R d, p = p w, [ ] [ ] [ ] Proof. Denoting =, W W = n+1 w T and C C = were w c T denotes te transpose of w and introducing z(, = exp( T W + B T + C T, we can express p(, and p w,c (, as follows: p(, z(, p w,c (, exp ( T W + B T + C T z(, exp ( n+1 w T + c n+1 4

5 If c =, p w,c (, = 0 if n+1 = 1. Tus, we can discard all terms were n+1 = 1, keeping only tose were n+1 = 0. Marginalizing oer te idden units, we ae: z(, p( = (0, 0 z(0, (0 e z(, exp ( n+1 w T + c n+1 p w, ( = ( g (0, 0 z(0, (0 exp (0 n+1 wt + c (0 n+1 z(, exp(0 = (0, 0 z(0, (0 exp(0 = p( We now state te main teorem. Teorem 2.3. Let p 0 be an arbitrary distribution oer {0, 1} n and let R p be an RBM wit marginal distribution p oer te isible units suc tat KL(p 0 p > 0. Ten tere exists an RBM R pw,c composed of R p and an additional idden unit wit parameters (w, c wose marginal distribution p w,c oer te isible units aciees KL(p 0 p w,c < KL(p 0 p. Proof in appendix. 2.2 A Huge Model can Represent Any Distribution Te second set of results are for te limit case wen te number of idden units is ery large, so tat we can represent any discrete distribution exactly. Teorem 2.4. Any distribution oer {0, 1} n can be approximated arbitrarily well (in te sense of te KL diergence wit an RBM wit k + 1 idden units were k is te number of input ectors wose probability is not 0. Proof sketc (Uniersal approximator property. We constructiely build an RBM wit as many idden units as te number of input ectors wose probability is strictly positie. Eac idden unit will be assigned to one input ector. Namely, wen i is te isible units ector, all idden units ae a probability 0 of being on except te one corresponding to i wic as a probability sigm(λ i of being on. Te alue of λ i is directly tied wit p( i. On te oter and, wen all idden units are off but te i t one, p( i = 1. Wit probability 1 sigm(λ i, all te idden units are turned off, wic yields independent draws of te isible units. Te proof consists in finding te appropriate weigts (and alues λ i to yield tat beaiour. Proof in appendix. 5

6 3 Representational power of Deep Belief Networks 3.1 Background on Deep Belief Networks A DBN wit l layers models te joint distribution between obsered ariables j and l idden layers (k, k = 1,..., l made of binary units (k i (ere all binary ariables, as follows: p(, (1, (2,..., (l = P ( (1 P ( (1 (2... P ( (l 2 (l 1 p( (l 1, (l Denoting = (0, b (k te bias ector of layer k and W (k te weigt matrix between layer k and layer k + 1, we ae: P ( (k (k+1 = i P ( (k i = 1 (k+1 = sigm P ( (k i (k+1 (factorial conditional distribution b (k i + j W (k ij (k+1 j (2 and p( (l 1, (l is an RBM. Te original motiation found in Hinton et al. (2006 for aing a deep network ersus a single idden layer (i.e., a DBN ersus an RBM was tat te representational power of an RBM would be too limited and tat more capacity could be acieed by aing more idden layers. Howeer, we ae found ere tat an RBM wit enoug idden units can model any discrete distribution. Anoter motiation for deep arcitectures is discussed in Bengio and Le Cun (2007 and Bengio et al. (2007: deep arcitectures can represent functions muc more efficiently (in terms of number of required parameters tan sallow ones. In particular, teoretical results on circuit complexity teory proe tat sallow digital circuits can be exponentially less efficient tan deeper ones (Ajtai, 1983; Hastad, 1987; Allender, Hence te original motiation (Hinton et al., 2006 was probably rigt wen one considers te restriction to reasonably sized models. 3.2 Trying to Anticipate a Hig-Capacity Top Layer In te greedy training procedure of Deep Belief Networks proposed in (Hinton et al., 2006, one layer is added on top of te network at eac stage, and only tat top layer is trained (as an RBM, see figure 1. In tat greedy pase, one does not take into account te fact tat oter layers will be added next. Indeed, wile trying to optimize te weigts, we restrict te marginal distribution oer its idden units to be te one induced by te RBM. On te contrary, wen we add a new layer, tat distribution (wic is te marginal distribution oer te isible units of te new RBM does not ae tat restriction (but anoter one wic is to be representable by an RBM of a gien size. Tus, we migt be able to better optimize te weigts of te RBM, knowing tat te marginal distribution oer te idden units will ae more freedom wen extra layers are added. Tis would lead to an alternatie training criterion for DBNs. 6

7 (3 RBM (2 (2 RBM W (2 (1 (1 (1 RBM W (1 W (1 (a Stage 1 (b Stage 2 (c Stage 3 Figure 1: Greedy learning of an RBM. After eac RBM as been trained, te weigts are frozen and a new layer is added. Te new layer is trained as an RBM. Consider a 2-layer DBN (l = 2, tat is wit tree layers in total. To train te weigts between (1 and (2 (see figure 1, te greedy strategy maximizes a lower bound on te likeliood of te data (instead of te likeliood itself, called te ariational bound (Hinton et al., 2006: log p( [ ] Q( (1 log p( (1 + log P ( (1 (1 (1 Q( (1 log Q( (1 (3 were Q( (1 is te posterior on idden units (1 gien isible ector, according to te first RBM model, and is determined by W (1. It is te assumed distribution used in te ariational bound on te DBN likeliood. p( (1 is te marginal distribution oer 1 in te DBN (tus induced by te second RBM, between (1 and (2 P ( (1 is te posterior oer gien 1 in te DBN and in te first RBM, and is determined by W (1. Once te weigts of te first layer (W (1 are frozen, te only element tat can be optimized is p( (1. We can sow tat tere is an analytic formulation for te distribution p ( (1 tat maximizes tis ariational bound: p ( (1 = p 0 (Q( (1 (4 7

8 were p 0 is te empirical distribution of input examples. One can sample from p ( (1 by first randomly sampling a from te empirical distribution and ten propagating it stocastically troug Q( (1. Using teorem 2.4, tere exists an RBM tat can approximate tis optimal distribution p ( (1 arbitrarily well. Using an RBM tat aciees tis optimal p ( (1 (optimal in terms of te ariational bound, but not necessarily wit respect to te likeliood, we can determine te distribution represented by DBN. Let p 1 be te distribution one obtains wen starting from p 0 clamped in te isible units of te lower layer (, sampling te idden units (1 gien and ten sampling a gien (1. Proposition 3.1. In a 2-layer DBN, using a second layer RBM acieing p ( (1, te model distribution p is equal to p 1. Tis is equialent to making one up-down in te first RBM trained. Proof. We can write te marginal p ( (1 by summing oer idden alues 0 : p ( (1 = 0 p 0 ( 0 Q( (1 0. Tus, te probability of te data under te 2-layer DBN wen te top-layer RBM aciees p ( (1 is p( (0 = (1 P ( (0 (1 p ( (1 (5 = 0 p 0 ( 0 (1 Q( (1 0 P ( (0 (1 p( (0 = p 1 ( (0 (6 Te last line can be seen to be true by considering te stocastic process of first picking an 0 from te empirical distribution p 0, ten sampling an (1 from Q( (1 0, and finally computing te probability of (0 under P ( (0 (1 for tat (1. Proposition 3.1 tells us tat, een wit te best possible model for p( (1, (2 according to te ariational bound (i.e., te model tat can aciee p ( (1, we obtain a KL diergence between te DBN and te data equal to KL(p 0 p 1. Hence if we train te 2nd leel RBM to model te stocastic output of te 1st leel RBM (as suggested in Hinton et al. (2006, te best KL(p 0 p we can aciee wit model p of te 2-leel DBN cannot be better tan KL(p 0 p 1. Note tat tis result does not preclude tat a better likeliood could be acieed wit p if a better criterion is used to train te 2nd leel RBM. For KL(p 0 p 1 to be 0, one sould ae p 0 = p 1. Note tat a weigt ector wit tis property would not only be a fixed point of KL(p 0 p 1 but also of te likeliood and of contrastie diergence for te first-leel RBM. p 0 = p 1 could ae been obtained wit a one-leel DBN (i.e., a single RBM tat perfectly fits te data. Tis can appen wen 8

9 te first RBM as infinite weigts i.e., is deterministic, and just encodes (0 = in (1 perfectly. In tat case te second layer (2 seems useless. Does tat mean tat adding layers is useless? We beliee te answer is no; first, een toug aing te distribution tat maximizes te ariational bound yields p = p 1, tis does not mean tat we cannot aciee KL(p 0 p < KL(p 0 p 1 wit a 2-layer DBN (toug we ae no proof tat it can be acieed eiter. Indeed, since te ariational bound is not te quantity we truly want to optimize, anoter criterion migt lead to a better model (in terms of te likeliood of te data. Besides tat, een if adding layers does not allow us to perfectly fit te data (wic migt actually only be te case wen we optimize te ariational bound rater tan te likeliood, te distribution of te 2-layer DBN is closer to te empirical distribution tan is te first layer RBM (we do only one up-down Gibbs step instead of doing an infinite number of suc steps. Furtermore, te extra layers allow us to regularize and opefully obtain a representation in wic een a ery ig capacity top layer (e.g., a memory-based non-parametric density estimator could generalize well. Tis approac suggests using alternatie criteria to train DBNs, tat approximate KL(p 0 p 1 and can be computed before (2 is added, but, unlike contrastie diergence, take into account te fact tat more layers will be added later. Note tat computing KL(p 0 p 1 exactly is intractable in an RBM because it inoles summing oer all possible alues of te idden ector. One could use a sampling or mean-field approximation (replacing te summation oer alues of te idden unit ector by a sample or a mean-field alue, but een ten tere would remain a double sum oer examples: N i=1 1 N N log j=1 1 N ˆP (V 1 = i V 0 = j were i denotes te i-t example and ˆP (V 1 = i V 0 = j denotes an estimator of te probability of obsering V 1 = i at iteration 1 of te Gibbs cain (tat is after a updown pass gien tat te cain is started from V 0 = j. We write ˆP rater tan P because computing P exactly migt inole an intractable summation oer all possible alues of. In Bengio et al. (2007, te reconstruction error for training an autoencoder corresponding to one layer of a deep network is log ˆP (V 1 = i V 0 = i. Hence log ˆP (V 1 = i V 0 = j is like a reconstruction error wen one tries to reconstruct or predict i according to P ( i wen starting from j, sampling from P ( j. Tis criterion is essentially te same as one already introduced in a different context in (Bengio, Larocelle, & Vincent, 2006b, were ˆP (V 1 = i V 0 = j is computed deterministically (no idden random ariable is inoled, and te inner sum (oer j s is approximated by using only te 5 nearest neigbours of i in te training set. Howeer, te oerall computation time in (Bengio et al., 2006b is O(N 2 because like most nonparametric learning algoritms it inoles comparing all training examples wit eac oter. In contrast, te contrastie diergence gradient estimator can be computed in O(N for a training set of size N. To ealuate weter tractable approximations of KL(p 0 p 1 would be wort inestigating, we performed an experiment on a toy dataset and toy model were te computations are feasible. Te data are 10-element bit ectors wit patterns of 1, 2 or 3 consecutie 9

10 CD-DBN(train KLp0p1-DBN(train CD-DBN(test KLp0p1-DBN(test 1 KL-di-to-p Figure 2: KL diergence w.r.t. number epocs after adding te 2nd leel RBM, between empirical distribution p 0 (eiter training or test set and (top cures DBN trained greedily wit contrastie diergence at eac layer, or (bottom cures DBN trained greedily wit KL(p 0 p 1 on te 1st layer, and contrastie diergence on te 2nd. stage ones (or zeros against a background of zeros (or ones, demonstrating simple sift inariance. Tere are 60 possible examples (p 0, 40 of wic are randomly cosen to train first an RBM wit 5 binomial idden units, and ten a 2-layer DBN. Te remaining 20 are a test set. Te second RBM as 10 idden units (so tat we could guarantee improement of te likeliood by te addition of te second layer. Te first RBM is eiter trained by contrastie diergence or to minimize KL(p 0 p 1, using gradient descent and a learning rate of 0.1 for 500 epocs (parameters are updated after eac epoc. Oter learning rates and random initialization seeds gae similar results, dierged, or conerged slower. Te second RBM is ten trained for te same number of epocs, by contrastie diergence wit te same learning rate. Figure 2 sows te exact KL(p 0 p of te DBN p wile training te 2nd RBM. Te adantage of te KL(p 0 p 1 training is clear. Tis suggests tat future researc sould inestigate tractable approximations of KL(p 0 p Open Questions on DBN Representational Power Te results described in te preious section were motiated by te following question: since an RBM can represent any distribution, wat can be gained by adding layers to a DBN, in terms of representational power? More formally, let Rl n be a Deep Belief Network wit l + 1 idden layers, eac of tem composed of n units. Can we say 10

11 someting about te representational power of Rl n as l increases? Denoting Dl n te set of distributions one can obtain wit Rl n, it follows from te unfolding argument in Hinton et al. (2006 tat Dl n Dn l+1. Te unfolding argument sows tat te last layer of an l-layer DBN corresponds to an infinite directed grapical model wit tied weigts. By untying te weigts in te (l + 1-t RBM of tis construction from tose aboe, we obtain an (l + 1-layer DBN. Hence eery element of Dl n can be represented in Dl+1 n. Two questions remain: do we ae Dl n Dn l+1, at least for l = 1? wat is D n? 4 Conclusions We ae sown tat wen te number of idden units is allowed to ary, Restricted Boltzmann Macines are ery powerful and can approximate any distribution, eentually representing tem exactly wen te number of idden units is allowed to become ery large (possibly 2 to te number of inputs. Tis only says tat parameter alues exist for doing so, but it does not prescribe ow to obtain tem efficiently. In addition, te aboe result is only concerned wit te case of discrete inputs. It remains to be sown ow to extend tat type of result to te case of continuous inputs. Restricted Boltzmann Macines are interesting ciefly because tey are te building blocks of Deep Belief Networks, wic can ae many layers and can teoretically be muc more efficient at representing complicated distributions (Bengio & Le Cun, We ae introduced open questions about te expressie power of Deep Belief Networks. We ae not answered tese questions, but in trying to do so, we obtained an apparently puzzling result concerning Deep Belief Networks: te best tat can be acieed by adding a second layer (wit respect to some bound is limited by te first layer s ability to map te data distribution to someting close to itself (KL(p 0 p 1, and tis ability is good wen te first layer is large and models well te data. So wy do we need te extra layers? We beliee tat te answer lies in te ability of a Deep Belief Network to generalize better by aing a more compact representation. Tis analysis also suggests to inestigate KL(p 0 p 1 (or an efficient approximation of it as a less greedy alternatie to contrastie diergence for training eac layer, because it would take into account tat more layers will be added. Acknowledgements Te autors would like to tank te following funding organizations for support: NSERC, MITACS, and te Canada Researc Cairs. Tey are also grateful for te elp and comments from Oliier Delalleau and Aaron Courille. 11

12 References Ackley, D., Hinton, G., & Sejnowski, T. (1985. A learning algoritm for Boltzmann macines. Cognitie Science, 9. Ajtai, M. ( formulae on finite structures. Annals of Pure and Applied Logic, 24 (1, 48. Allender, E. (1996. Circuit complexity before te dawn of te new millennium. In 16t Annual Conference on Foundations of Software Tecnology and Teoretical Computer Science, pp Lecture Notes in Computer Science Bengio, Y., Lamblin, P., Popoici, D., & Larocelle, H. (2007. Greedy layer-wise training of deep networks. In Scölkopf, B., Platt, J., & Hoffman, T. (Eds., Adances in Neural Information Processing Systems 19. MIT Press. Bengio, Y., Delalleau, O., & Le Roux, N. (2006a. Te curse of igly ariable functions for local kernel macines. In Weiss, Y., Scölkopf, B., & Platt, J. (Eds., Adances in Neural Information Processing Systems 18, pp MIT Press, Cambridge, MA. Bengio, Y., Larocelle, H., & Vincent, P. (2006b. Non-local manifold parzen windows. In Weiss, Y., Scölkopf, B., & Platt, J. (Eds., Adances in Neural Information Processing Systems 18, pp MIT Press. Bengio, Y., & Le Cun, Y. (2007. Scaling learning algoritms towards AI. In Bottou, L., Capelle, O., DeCoste, D., & Weston, J. (Eds., Large Scale Kernel Macines. MIT Press. Carreira-Perpiñan, M., & Hinton, G. (2005. On contrastie diergence learning. In Proceedings of te Tent International Worksop on Artificial Intelligence and Statistics, Jan 6-8, 2005, Saanna Hotel, Barbados. Hastad, J. T. (1987. Computational Limitations for Small Dept Circuits. MIT Press. Hinton, G. E., Osindero, S., & Te, Y. (2006. A fast learning algoritm for deep belief nets. Neural Computation, 18, Hinton, G. (2002. Training products of experts by minimizing contrastie diergence. Neural Computation, 14, Hinton, G., & Salakutdino, R. (2006. Reducing te dimensionality of data wit neural networks. Science, 313 (5786, Hinton, G. (1999. Products of experts. In Proceedings of te Nint International Conference on Artificial Neural Networks (ICANN, Vol. 1, pp Hornik, K., Stinccombe, M., & Wite, H. (1989. Multilayer feedforward networks are uniersal approximators. Neural Networks, 2,

13 Ranzato, M., Poultney, C., Copra, S., & LeCun, Y. (2007. Efficient learning of sparse representations wit an energy-based model. In Scölkopf, B., Platt, J., & Hoffman, T. (Eds., Adances in Neural Information Processing Systems 19. MIT Press. Salakutdino, R., & Hinton, G. (2007. Learning a nonlinear embedding by presering class neigbourood structure. In To Appear in Proceedings of AISTATS Tesauro, G. (1992. Practical issues in temporal difference learning. Macine Learning, 8, Welling, M., Rosen-Zi, M., & Hinton, G. (2005. Exponential family armoniums wit an application to information retrieal. In Saul, L., Weiss, Y., & Bottou, L. (Eds., Adances in Neural Information Processing Systems 17. MIT Press. 13

14 5 Appendix 5.1 Proof of Lemma 2.1 [ ] [ ] [ ] Proof. Denoting =, W W = n+1 w T and C C = were w c T denotes te transpose of w and introducing z(, = exp( T W + B T + C T, we can express p(, and p w,c (, as follows: p(, z(, p w,c (, exp ( T W + B T + C T z(, exp ( n+1 w T + c n+1 Expanding te expression of p w,c ( and regrouping te terms similar to te expression of p(, we get: e exp ( T W + n+1 w T + B T + C T + c n+1 p w,c ( = = = ( g exp (0T W 0 + (0 (0, 0 n+1 wt 0 + B T 0 + C T (0 + c (0 n+1 z(, ( 1 + exp ( w T + c (0, 0 z(0, (0 (1 + exp (w T 0 + c ( ( 1 + exp w T + c z(, 0 (1 + exp (wt 0 + c 0 z(0, (0 But z(, = p(k wit K =, z(,. Tus, ( ( 1 + exp w T + c p( p w,c ( = 0 (1 + exp (wt 0 + c p( 0 wic does not depend on our particular coice of R p (since it does only depend on p. 5.2 Proof of teorem 2.3 Proof. Expanding te expression of p w,c ( and regrouping te terms similar to te expression of p(, we get: e exp ( T W + n+1 w T + B T + C T + c n+1 p w,c ( = = = ( g exp (0T W 0 + (0 (0, 0 n+1 wt 0 + B T 0 + C T (0 + c (0 n+1 z(, ( 1 + exp ( w T + c (0, 0 z(0, (0 (1 + exp (w T 0 + c ( ( 1 + exp w T + c z(, 0, (0 (1 + exp (wt 0 + c z( 0, (0 14

15 Terefore, we ae: KL(p 0 p w,c = p 0 ( log p 0 ( p 0 ( log p w,c ( = H(p 0 ( ( ( 1 + exp w T + c z(, p 0 ( log 0, (0 (1 + exp (wt 0 + c z( 0, (0 = H(p 0 p 0 ( log ( 1 + exp ( w T + c ( p 0 ( log z(, + p 0 ( log ( ( 1 + exp w T 0 + c z( 0, (0 0, (0 Assuming w T + c is a ery large negatie alue for all, we can use te logaritmic series identity around 0 (log(1 + x = x + o x 0 (x for te second and te last term. Te second term becomes 1 p 0 ( log ( 1 + exp ( w T + c = p 0 ( exp ( w T + c + o c (exp(c and te last term becomes ( p 0 ( log 0, (0 ( ( 1 + exp w T 0 + c z( 0, (0 = log ( z( 0, (0 + log 1 + 0, (0 = log z( 0, (0 + 0, (0 0, (0 exp ( w T 0 + c z( 0, (0 0, (0 z(0, (0 0, (0 exp ( w T 0 + c z( 0, (0 0, (0 z(0, (0 + o c (exp(c But 0, (0 exp ( w T 0 + c z( 0, (0 0, (0 z(0, (0 = = exp ( w T + c exp ( w T + c p( (0 z(, (0 0, (0 z(0, (0 1 f(x o x ( notation: f(x = o x + (g(x if lim x + exists and equals 0. g(x 15

16 Putting all terms back togeter, we ae KL(p 0 p w,c = H(p 0 p 0 ( exp ( w T + c + p( exp ( w T + c + o c (exp(c ( p 0 ( log z(, + log z( 0, (0 0, (0 = KL(p 0 p + exp ( w T + c (p( p 0 ( + o c (exp(c Finally, we ae KL(p 0 p w,c KL(p 0 p = exp(c exp ( w T (p( p 0 ( + o c (exp(c (7 Te question now becomes: can we find a w suc tat exp ( w T (p( p 0 ( is negatie? As p 0 p, tere is a suc tat p( < p 0 (. Ten tere exists a positie scalar a suc tat ŵ = a ( 12 e (wit e = [1... 1] T yields exp ( ŵ T (p( p 0 ( < 0. Indeed, for, we ae exp(ŵ T exp(ŵ T = exp ( ŵ T ( = exp = exp ( ( a ( 1 T 2 e ( a i ( i 1 ( i i 2 For i suc tat i i > 0, we ae i = 1 and i = 0. Tus, i 1 2 = 1 2 and te term inside te exponential is negatie (since a is positie. For i suc tat i i < 0, we ae i = 0 and i = 1. Tus, i 1 2 = 1 2 and te term inside te exponential is also negatie. Furtermore, te terms come close to 0 as a goes to infinity. Since te sum can be decomposed as exp ( ŵ T ( (p( p 0 ( = exp(ŵ T exp ( ŵ T exp(ŵ T (p( p 0( = exp(ŵ T p( p 0 ( + exp ( ŵ T exp(ŵ T (p( p 0( b we ae 2 exp ( ŵ T (p( p 0 ( a + exp(ŵ T (p( p 0 ( < 0. 2 f(x x notation: f(x x + g(x if lim x + exists and equals 1. g(x 16

17 Terefore, tere is a alue â suc tat, if a > â, exp ( w T (p( p 0 ( < 0. Tis concludes te proof. 5.3 Proof of teorem 2.4 Proof. In te former proof, we ad ( ( 1 + exp w T + c z(, p w,c ( = 0, (0 (1 + exp (wt 0 + c z( 0, (0 Let ṽ ( be an arbitrary input ector and ŵ be defined in te same way as before, i.e. ŵ = a ṽ 1. 2 Now define ĉ = ŵ T ṽ + λ wit λ R. We ae: lim 1 + a exp(ŵt + ĉ = 1 for ṽ 1 + exp(ŵ T ṽ + ĉ = 1 + exp(λ Tus, we can see tat, for ṽ: lim p z(, bw,bc( = a 0 ṽ, (0 z(0, (0 + (0 (1 + exp (ŵt ṽ + ĉ z(ṽ, (0 z(, = 0, (0 z(0, (0 + (0 exp (λ z(ṽ, (0 z(, 1 = 0, (0 z(0, (0 P 1 + exp(λ (0 z(ṽ, (0 P 0, (0 z( 0, (0 z(, Remembering p( = 0, (0 z(0, (0, we ae for ṽ: Similarly, we can see tat lim p bw,bc( = a lim p bw,bc(ṽ = a p( 1 + exp(λp(ṽ [1 + exp(λ]p(ṽ 1 + exp(λp(ṽ Depending on te alue of λ, one can see tat adding a idden unit allows one to increase te probability of an arbitrary ṽ and to uniformly decrease te probability of eery oter by a multiplicatie factor. Howeer, one can also see tat, if p(ṽ = 0, ten p bw,bc (ṽ = 0 for all λ. We can terefore build te desired RBM as follows. Let us index te s oer te integers from 1 to 2 n and sort tem suc tat p 0 ( k+1 =... = p 0 ( 2 n = 0 < p 0 ( 1 p 0 ( 2... p 0 ( k (8 (9 17

18 Let us denote p i te distribution of an RBM wit i idden units. We start wit an RBM wose weigts and biases are all equal to 0. Te marginal distribution oer te isible units induced by tat RBM is te uniform distribution. Tus, p 0 ( 1 =... = p 0 ( 2 n = 2 n We define w 1 = a 1 ( and c 1 = w T λ 1. As sown before, we now ae: lim a 1 + p1 ( 1 = [1 + exp(λ 1]2 n 1 + exp(λ 1 2 n lim a 1 + p1 ( i = 2 n 1 + exp(λ 1 2 n i 2 As we can see, we can set p 1 ( 1 to a alue arbitrarily close to 1, wit a uniform distribution oer 2,..., 2 n. Ten, we can coose λ 2 suc tat p2 ( 2 p 2 ( 1 = p( 2. Tis is p( 1 possible since we can arbitrarily increase p 2 ( 2 wile multiplying te oter probabilities by a constant factor and since p( 2 p( 1 p1 ( 2 p 1. We can continue te procedure until ( 1 obtaining p k ( k. Te ratio pi ( j p i does not depend on te alue of i as long as i > j ( j 1 (because at eac suc step i, te two probabilities are multiplied by te same factor. We will ten ae p k ( k p k ( k 1 = p( k p( k 1,..., pk ( 2 p k ( 1 = p( 2 p( 1 p k ( k+1 =... = p k ( 2 n From tat, we can deduce tat p k ( 1 = ν k p( 1,..., p k ( k = ν k p( k wit ν k = 1 (2 n kp k ( 2 n. We also ae pk ( 1 p k ( 2 n = p1 ( 1 p 1 ( 2 n = 1 + exp(λ 1. Tus, p k ( 1 = p( 1 [1 (2 n kp k ( 2 n] = (1 + exp(λ 1 p k ( 2 n. Soling te aboe equations, we ae p k ( i = p( 1 1+exp(λ 1 +p( 1 (2 n k for i > k (10 1+exp(λ 1 p k ( i = p( i 1+exp(λ 1 +p( 1 (2 n k for i k (11 Using te logaritmic series identity around 0 (log(1 + x = x + o x 0 (x for KL(p p k wen λ 1 goes to infinity, we ae KL(p p k = i Tis concludes te proof. p( i (2n kp( i 1 + exp(λ 1 + o(exp( λ 1 λ 1 0 (12 18

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks 1 Representational Power of Restricted Boltzmann Machines and Deep Belief Networks Nicolas Le Roux and Yoshua Bengio Dept. IRO, Uniersité de Montréal C.P. 6128, Montreal, Qc, H3C 3J7, Canada {lerouxni,bengioy}@iro.umontreal.ca

More information

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann

More information

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber Representational Power of Restricted Boltzmann Machines and Deep Belief Networks Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber Introduction Representational abilities of functions with some

More information

Deep Belief Network Training Improvement Using Elite Samples Minimizing Free Energy

Deep Belief Network Training Improvement Using Elite Samples Minimizing Free Energy Deep Belief Network Training Improvement Using Elite Samples Minimizing Free Energy Moammad Ali Keyvanrad a, Moammad Medi Homayounpour a a Laboratory for Intelligent Multimedia Processing (LIMP), Computer

More information

Deep Belief Networks are compact universal approximators

Deep Belief Networks are compact universal approximators 1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation

More information

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs

Natural Language Understanding. Recap: probability, language models, and feedforward networks. Lecture 12: Recurrent Neural Networks and LSTMs Natural Language Understanding Lecture 12: Recurrent Neural Networks and LSTMs Recap: probability, language models, and feedforward networks Simple Recurrent Networks Adam Lopez Credits: Mirella Lapata

More information

MVT and Rolle s Theorem

MVT and Rolle s Theorem AP Calculus CHAPTER 4 WORKSHEET APPLICATIONS OF DIFFERENTIATION MVT and Rolle s Teorem Name Seat # Date UNLESS INDICATED, DO NOT USE YOUR CALCULATOR FOR ANY OF THESE QUESTIONS In problems 1 and, state

More information

Minimizing D(Q,P) def = Q(h)

Minimizing D(Q,P) def = Q(h) Inference Lecture 20: Variational Metods Kevin Murpy 29 November 2004 Inference means computing P( i v), were are te idden variables v are te visible variables. For discrete (eg binary) idden nodes, exact

More information

Notes on Neural Networks

Notes on Neural Networks Artificial neurons otes on eural etwors Paulo Eduardo Rauber 205 Consider te data set D {(x i y i ) i { n} x i R m y i R d } Te tas of supervised learning consists on finding a function f : R m R d tat

More information

A = h w (1) Error Analysis Physics 141

A = h w (1) Error Analysis Physics 141 Introduction In all brances of pysical science and engineering one deals constantly wit numbers wic results more or less directly from experimental observations. Experimental observations always ave inaccuracies.

More information

qwertyuiopasdfghjklzxcvbnmqwerty uiopasdfghjklzxcvbnmqwertyuiopasd fghjklzxcvbnmqwertyuiopasdfghjklzx cvbnmqwertyuiopasdfghjklzxcvbnmq

qwertyuiopasdfghjklzxcvbnmqwerty uiopasdfghjklzxcvbnmqwertyuiopasd fghjklzxcvbnmqwertyuiopasdfghjklzx cvbnmqwertyuiopasdfghjklzxcvbnmq qwertyuiopasdfgjklzxcbnmqwerty uiopasdfgjklzxcbnmqwertyuiopasd fgjklzxcbnmqwertyuiopasdfgjklzx cbnmqwertyuiopasdfgjklzxcbnmq Projectile Motion Quick concepts regarding Projectile Motion wertyuiopasdfgjklzxcbnmqwertyui

More information

Copyright c 2008 Kevin Long

Copyright c 2008 Kevin Long Lecture 4 Numerical solution of initial value problems Te metods you ve learned so far ave obtained closed-form solutions to initial value problems. A closedform solution is an explicit algebriac formula

More information

Time (hours) Morphine sulfate (mg)

Time (hours) Morphine sulfate (mg) Mat Xa Fall 2002 Review Notes Limits and Definition of Derivative Important Information: 1 According to te most recent information from te Registrar, te Xa final exam will be eld from 9:15 am to 12:15

More information

REVIEW LAB ANSWER KEY

REVIEW LAB ANSWER KEY REVIEW LAB ANSWER KEY. Witout using SN, find te derivative of eac of te following (you do not need to simplify your answers): a. f x 3x 3 5x x 6 f x 3 3x 5 x 0 b. g x 4 x x x notice te trick ere! x x g

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

1 Calculus. 1.1 Gradients and the Derivative. Q f(x+h) f(x)

1 Calculus. 1.1 Gradients and the Derivative. Q f(x+h) f(x) Calculus. Gradients and te Derivative Q f(x+) δy P T δx R f(x) 0 x x+ Let P (x, f(x)) and Q(x+, f(x+)) denote two points on te curve of te function y = f(x) and let R denote te point of intersection of

More information

Probabilistic Graphical Models Homework 1: Due January 29, 2014 at 4 pm

Probabilistic Graphical Models Homework 1: Due January 29, 2014 at 4 pm Probabilistic Grapical Models 10-708 Homework 1: Due January 29, 2014 at 4 pm Directions. Tis omework assignment covers te material presented in Lectures 1-3. You must complete all four problems to obtain

More information

lecture 26: Richardson extrapolation

lecture 26: Richardson extrapolation 43 lecture 26: Ricardson extrapolation 35 Ricardson extrapolation, Romberg integration Trougout numerical analysis, one encounters procedures tat apply some simple approximation (eg, linear interpolation)

More information

Preface. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

Preface. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed. Preface Here are my online notes for my course tat I teac ere at Lamar University. Despite te fact tat tese are my class notes, tey sould be accessible to anyone wanting to learn or needing a refreser

More information

Polynomial Interpolation

Polynomial Interpolation Capter 4 Polynomial Interpolation In tis capter, we consider te important problem of approximatinga function fx, wose values at a set of distinct points x, x, x,, x n are known, by a polynomial P x suc

More information

5.1 We will begin this section with the definition of a rational expression. We

5.1 We will begin this section with the definition of a rational expression. We Basic Properties and Reducing to Lowest Terms 5.1 We will begin tis section wit te definition of a rational epression. We will ten state te two basic properties associated wit rational epressions and go

More information

SECTION 3.2: DERIVATIVE FUNCTIONS and DIFFERENTIABILITY

SECTION 3.2: DERIVATIVE FUNCTIONS and DIFFERENTIABILITY (Section 3.2: Derivative Functions and Differentiability) 3.2.1 SECTION 3.2: DERIVATIVE FUNCTIONS and DIFFERENTIABILITY LEARNING OBJECTIVES Know, understand, and apply te Limit Definition of te Derivative

More information

Average Rate of Change

Average Rate of Change Te Derivative Tis can be tougt of as an attempt to draw a parallel (pysically and metaporically) between a line and a curve, applying te concept of slope to someting tat isn't actually straigt. Te slope

More information

Continuity and Differentiability Worksheet

Continuity and Differentiability Worksheet Continuity and Differentiability Workseet (Be sure tat you can also do te grapical eercises from te tet- Tese were not included below! Typical problems are like problems -3, p. 6; -3, p. 7; 33-34, p. 7;

More information

2.11 That s So Derivative

2.11 That s So Derivative 2.11 Tat s So Derivative Introduction to Differential Calculus Just as one defines instantaneous velocity in terms of average velocity, we now define te instantaneous rate of cange of a function at a point

More information

Te comparison of dierent models M i is based on teir relative probabilities, wic can be expressed, again using Bayes' teorem, in terms of prior probab

Te comparison of dierent models M i is based on teir relative probabilities, wic can be expressed, again using Bayes' teorem, in terms of prior probab To appear in: Advances in Neural Information Processing Systems 9, eds. M. C. Mozer, M. I. Jordan and T. Petsce. MIT Press, 997 Bayesian Model Comparison by Monte Carlo Caining David Barber D.Barber@aston.ac.uk

More information

Material for Difference Quotient

Material for Difference Quotient Material for Difference Quotient Prepared by Stepanie Quintal, graduate student and Marvin Stick, professor Dept. of Matematical Sciences, UMass Lowell Summer 05 Preface Te following difference quotient

More information

Efficient algorithms for for clone items detection

Efficient algorithms for for clone items detection Efficient algoritms for for clone items detection Raoul Medina, Caroline Noyer, and Olivier Raynaud Raoul Medina, Caroline Noyer and Olivier Raynaud LIMOS - Université Blaise Pascal, Campus universitaire

More information

Sin, Cos and All That

Sin, Cos and All That Sin, Cos and All Tat James K. Peterson Department of Biological Sciences and Department of Matematical Sciences Clemson University Marc 9, 2017 Outline Sin, Cos and all tat! A New Power Rule Derivatives

More information

Regularized Regression

Regularized Regression Regularized Regression David M. Blei Columbia University December 5, 205 Modern regression problems are ig dimensional, wic means tat te number of covariates p is large. In practice statisticians regularize

More information

Financial Econometrics Prof. Massimo Guidolin

Financial Econometrics Prof. Massimo Guidolin CLEFIN A.A. 2010/2011 Financial Econometrics Prof. Massimo Guidolin A Quick Review of Basic Estimation Metods 1. Were te OLS World Ends... Consider two time series 1: = { 1 2 } and 1: = { 1 2 }. At tis

More information

Introduction to Machine Learning. Recitation 8. w 2, b 2. w 1, b 1. z 0 z 1. The function we want to minimize is the loss over all examples: f =

Introduction to Machine Learning. Recitation 8. w 2, b 2. w 1, b 1. z 0 z 1. The function we want to minimize is the loss over all examples: f = Introduction to Macine Learning Lecturer: Regev Scweiger Recitation 8 Fall Semester Scribe: Regev Scweiger 8.1 Backpropagation We will develop and review te backpropagation algoritm for neural networks.

More information

Adaptive Neural Filters with Fixed Weights

Adaptive Neural Filters with Fixed Weights Adaptive Neural Filters wit Fixed Weigts James T. Lo and Justin Nave Department of Matematics and Statistics University of Maryland Baltimore County Baltimore, MD 150, U.S.A. e-mail: jameslo@umbc.edu Abstract

More information

MA455 Manifolds Solutions 1 May 2008

MA455 Manifolds Solutions 1 May 2008 MA455 Manifolds Solutions 1 May 2008 1. (i) Given real numbers a < b, find a diffeomorpism (a, b) R. Solution: For example first map (a, b) to (0, π/2) and ten map (0, π/2) diffeomorpically to R using

More information

The derivative function

The derivative function Roberto s Notes on Differential Calculus Capter : Definition of derivative Section Te derivative function Wat you need to know already: f is at a point on its grap and ow to compute it. Wat te derivative

More information

THE hidden Markov model (HMM)-based parametric

THE hidden Markov model (HMM)-based parametric JOURNAL OF L A TEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1 Modeling Spectral Envelopes Using Restricted Boltzmann Macines and Deep Belief Networks for Statistical Parametric Speec Syntesis Zen-Hua Ling,

More information

Exam 1 Review Solutions

Exam 1 Review Solutions Exam Review Solutions Please also review te old quizzes, and be sure tat you understand te omework problems. General notes: () Always give an algebraic reason for your answer (graps are not sufficient),

More information

1 The concept of limits (p.217 p.229, p.242 p.249, p.255 p.256) 1.1 Limits Consider the function determined by the formula 3. x since at this point

1 The concept of limits (p.217 p.229, p.242 p.249, p.255 p.256) 1.1 Limits Consider the function determined by the formula 3. x since at this point MA00 Capter 6 Calculus and Basic Linear Algebra I Limits, Continuity and Differentiability Te concept of its (p.7 p.9, p.4 p.49, p.55 p.56). Limits Consider te function determined by te formula f Note

More information

2.3 Product and Quotient Rules

2.3 Product and Quotient Rules .3. PRODUCT AND QUOTIENT RULES 75.3 Product and Quotient Rules.3.1 Product rule Suppose tat f and g are two di erentiable functions. Ten ( g (x)) 0 = f 0 (x) g (x) + g 0 (x) See.3.5 on page 77 for a proof.

More information

Bob Brown Math 251 Calculus 1 Chapter 3, Section 1 Completed 1 CCBC Dundalk

Bob Brown Math 251 Calculus 1 Chapter 3, Section 1 Completed 1 CCBC Dundalk Bob Brown Mat 251 Calculus 1 Capter 3, Section 1 Completed 1 Te Tangent Line Problem Te idea of a tangent line first arises in geometry in te context of a circle. But before we jump into a discussion of

More information

Polynomial Interpolation

Polynomial Interpolation Capter 4 Polynomial Interpolation In tis capter, we consider te important problem of approximating a function f(x, wose values at a set of distinct points x, x, x 2,,x n are known, by a polynomial P (x

More information

. If lim. x 2 x 1. f(x+h) f(x)

. If lim. x 2 x 1. f(x+h) f(x) Review of Differential Calculus Wen te value of one variable y is uniquely determined by te value of anoter variable x, ten te relationsip between x and y is described by a function f tat assigns a value

More information

Overdispersed Variational Autoencoders

Overdispersed Variational Autoencoders Overdispersed Variational Autoencoders Harsil Sa, David Barber and Aleksandar Botev Department of Computer Science, University College London Alan Turing Institute arsil.sa.15@ucl.ac.uk, david.barber@ucl.ac.uk,

More information

Function Composition and Chain Rules

Function Composition and Chain Rules Function Composition and s James K. Peterson Department of Biological Sciences and Department of Matematical Sciences Clemson University Marc 8, 2017 Outline 1 Function Composition and Continuity 2 Function

More information

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER /2019

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER /2019 ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS MATH00030 SEMESTER 208/209 DR. ANTHONY BROWN 6. Differential Calculus 6.. Differentiation from First Principles. In tis capter, we will introduce

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Differential Calculus (The basics) Prepared by Mr. C. Hull

Differential Calculus (The basics) Prepared by Mr. C. Hull Differential Calculus Te basics) A : Limits In tis work on limits, we will deal only wit functions i.e. tose relationsips in wic an input variable ) defines a unique output variable y). Wen we work wit

More information

Differentiation in higher dimensions

Differentiation in higher dimensions Capter 2 Differentiation in iger dimensions 2.1 Te Total Derivative Recall tat if f : R R is a 1-variable function, and a R, we say tat f is differentiable at x = a if and only if te ratio f(a+) f(a) tends

More information

2.8 The Derivative as a Function

2.8 The Derivative as a Function .8 Te Derivative as a Function Typically, we can find te derivative of a function f at many points of its domain: Definition. Suppose tat f is a function wic is differentiable at every point of an open

More information

Derivatives. By: OpenStaxCollege

Derivatives. By: OpenStaxCollege By: OpenStaxCollege Te average teen in te United States opens a refrigerator door an estimated 25 times per day. Supposedly, tis average is up from 10 years ago wen te average teenager opened a refrigerator

More information

LIMITS AND DERIVATIVES CONDITIONS FOR THE EXISTENCE OF A LIMIT

LIMITS AND DERIVATIVES CONDITIONS FOR THE EXISTENCE OF A LIMIT LIMITS AND DERIVATIVES Te limit of a function is defined as te value of y tat te curve approaces, as x approaces a particular value. Te limit of f (x) as x approaces a is written as f (x) approaces, as

More information

1. Consider the trigonometric function f(t) whose graph is shown below. Write down a possible formula for f(t).

1. Consider the trigonometric function f(t) whose graph is shown below. Write down a possible formula for f(t). . Consider te trigonometric function f(t) wose grap is sown below. Write down a possible formula for f(t). Tis function appears to be an odd, periodic function tat as been sifted upwards, so we will use

More information

Calculus I Homework: The Derivative as a Function Page 1

Calculus I Homework: The Derivative as a Function Page 1 Calculus I Homework: Te Derivative as a Function Page 1 Example (2.9.16) Make a careful sketc of te grap of f(x) = sin x and below it sketc te grap of f (x). Try to guess te formula of f (x) from its grap.

More information

IEOR 165 Lecture 10 Distribution Estimation

IEOR 165 Lecture 10 Distribution Estimation IEOR 165 Lecture 10 Distribution Estimation 1 Motivating Problem Consider a situation were we ave iid data x i from some unknown distribution. One problem of interest is estimating te distribution tat

More information

Recall from our discussion of continuity in lecture a function is continuous at a point x = a if and only if

Recall from our discussion of continuity in lecture a function is continuous at a point x = a if and only if Computational Aspects of its. Keeping te simple simple. Recall by elementary functions we mean :Polynomials (including linear and quadratic equations) Eponentials Logaritms Trig Functions Rational Functions

More information

SECTION 1.10: DIFFERENCE QUOTIENTS LEARNING OBJECTIVES

SECTION 1.10: DIFFERENCE QUOTIENTS LEARNING OBJECTIVES (Section.0: Difference Quotients).0. SECTION.0: DIFFERENCE QUOTIENTS LEARNING OBJECTIVES Define average rate of cange (and average velocity) algebraically and grapically. Be able to identify, construct,

More information

3.4 Worksheet: Proof of the Chain Rule NAME

3.4 Worksheet: Proof of the Chain Rule NAME Mat 1170 3.4 Workseet: Proof of te Cain Rule NAME Te Cain Rule So far we are able to differentiate all types of functions. For example: polynomials, rational, root, and trigonometric functions. We are

More information

Technology-Independent Design of Neurocomputers: The Universal Field Computer 1

Technology-Independent Design of Neurocomputers: The Universal Field Computer 1 Tecnology-Independent Design of Neurocomputers: Te Universal Field Computer 1 Abstract Bruce J. MacLennan Computer Science Department Naval Postgraduate Scool Monterey, CA 9393 We argue tat AI is moving

More information

Lab 6 Derivatives and Mutant Bacteria

Lab 6 Derivatives and Mutant Bacteria Lab 6 Derivatives and Mutant Bacteria Date: September 27, 20 Assignment Due Date: October 4, 20 Goal: In tis lab you will furter explore te concept of a derivative using R. You will use your knowledge

More information

Precalculus Test 2 Practice Questions Page 1. Note: You can expect other types of questions on the test than the ones presented here!

Precalculus Test 2 Practice Questions Page 1. Note: You can expect other types of questions on the test than the ones presented here! Precalculus Test 2 Practice Questions Page Note: You can expect oter types of questions on te test tan te ones presented ere! Questions Example. Find te vertex of te quadratic f(x) = 4x 2 x. Example 2.

More information

MAT 145. Type of Calculator Used TI-89 Titanium 100 points Score 100 possible points

MAT 145. Type of Calculator Used TI-89 Titanium 100 points Score 100 possible points MAT 15 Test #2 Name Solution Guide Type of Calculator Used TI-89 Titanium 100 points Score 100 possible points Use te grap of a function sown ere as you respond to questions 1 to 8. 1. lim f (x) 0 2. lim

More information

f a h f a h h lim lim

f a h f a h h lim lim Te Derivative Te derivative of a function f at a (denoted f a) is f a if tis it exists. An alternative way of defining f a is f a x a fa fa fx fa x a Note tat te tangent line to te grap of f at te point

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,

More information

Pre-Calculus Review Preemptive Strike

Pre-Calculus Review Preemptive Strike Pre-Calculus Review Preemptive Strike Attaced are some notes and one assignment wit tree parts. Tese are due on te day tat we start te pre-calculus review. I strongly suggest reading troug te notes torougly

More information

Consider a function f we ll specify which assumptions we need to make about it in a minute. Let us reformulate the integral. 1 f(x) dx.

Consider a function f we ll specify which assumptions we need to make about it in a minute. Let us reformulate the integral. 1 f(x) dx. Capter 2 Integrals as sums and derivatives as differences We now switc to te simplest metods for integrating or differentiating a function from its function samples. A careful study of Taylor expansions

More information

2.1 THE DEFINITION OF DERIVATIVE

2.1 THE DEFINITION OF DERIVATIVE 2.1 Te Derivative Contemporary Calculus 2.1 THE DEFINITION OF DERIVATIVE 1 Te grapical idea of a slope of a tangent line is very useful, but for some uses we need a more algebraic definition of te derivative

More information

THE IDEA OF DIFFERENTIABILITY FOR FUNCTIONS OF SEVERAL VARIABLES Math 225

THE IDEA OF DIFFERENTIABILITY FOR FUNCTIONS OF SEVERAL VARIABLES Math 225 THE IDEA OF DIFFERENTIABILITY FOR FUNCTIONS OF SEVERAL VARIABLES Mat 225 As we ave seen, te definition of derivative for a Mat 111 function g : R R and for acurveγ : R E n are te same, except for interpretation:

More information

NUMERICAL DIFFERENTIATION. James T. Smith San Francisco State University. In calculus classes, you compute derivatives algebraically: for example,

NUMERICAL DIFFERENTIATION. James T. Smith San Francisco State University. In calculus classes, you compute derivatives algebraically: for example, NUMERICAL DIFFERENTIATION James T Smit San Francisco State University In calculus classes, you compute derivatives algebraically: for example, f( x) = x + x f ( x) = x x Tis tecnique requires your knowing

More information

Notes on wavefunctions II: momentum wavefunctions

Notes on wavefunctions II: momentum wavefunctions Notes on wavefunctions II: momentum wavefunctions and uncertainty Te state of a particle at any time is described by a wavefunction ψ(x). Tese wavefunction must cange wit time, since we know tat particles

More information

Lecture 15. Interpolation II. 2 Piecewise polynomial interpolation Hermite splines

Lecture 15. Interpolation II. 2 Piecewise polynomial interpolation Hermite splines Lecture 5 Interpolation II Introduction In te previous lecture we focused primarily on polynomial interpolation of a set of n points. A difficulty we observed is tat wen n is large, our polynomial as to

More information

Solve exponential equations in one variable using a variety of strategies. LEARN ABOUT the Math. What is the half-life of radon?

Solve exponential equations in one variable using a variety of strategies. LEARN ABOUT the Math. What is the half-life of radon? 8.5 Solving Exponential Equations GOAL Solve exponential equations in one variable using a variety of strategies. LEARN ABOUT te Mat All radioactive substances decrease in mass over time. Jamie works in

More information

HOMEWORK HELP 2 FOR MATH 151

HOMEWORK HELP 2 FOR MATH 151 HOMEWORK HELP 2 FOR MATH 151 Here we go; te second round of omework elp. If tere are oters you would like to see, let me know! 2.4, 43 and 44 At wat points are te functions f(x) and g(x) = xf(x)continuous,

More information

Lecture XVII. Abstract We introduce the concept of directional derivative of a scalar function and discuss its relation with the gradient operator.

Lecture XVII. Abstract We introduce the concept of directional derivative of a scalar function and discuss its relation with the gradient operator. Lecture XVII Abstract We introduce te concept of directional derivative of a scalar function and discuss its relation wit te gradient operator. Directional derivative and gradient Te directional derivative

More information

MAT244 - Ordinary Di erential Equations - Summer 2016 Assignment 2 Due: July 20, 2016

MAT244 - Ordinary Di erential Equations - Summer 2016 Assignment 2 Due: July 20, 2016 MAT244 - Ordinary Di erential Equations - Summer 206 Assignment 2 Due: July 20, 206 Full Name: Student #: Last First Indicate wic Tutorial Section you attend by filling in te appropriate circle: Tut 0

More information

Combining functions: algebraic methods

Combining functions: algebraic methods Combining functions: algebraic metods Functions can be added, subtracted, multiplied, divided, and raised to a power, just like numbers or algebra expressions. If f(x) = x 2 and g(x) = x + 2, clearly f(x)

More information

Numerical Differentiation

Numerical Differentiation Numerical Differentiation Finite Difference Formulas for te first derivative (Using Taylor Expansion tecnique) (section 8.3.) Suppose tat f() = g() is a function of te variable, and tat as 0 te function

More information

How to Find the Derivative of a Function: Calculus 1

How to Find the Derivative of a Function: Calculus 1 Introduction How to Find te Derivative of a Function: Calculus 1 Calculus is not an easy matematics course Te fact tat you ave enrolled in suc a difficult subject indicates tat you are interested in te

More information

Some Review Problems for First Midterm Mathematics 1300, Calculus 1

Some Review Problems for First Midterm Mathematics 1300, Calculus 1 Some Review Problems for First Midterm Matematics 00, Calculus. Consider te trigonometric function f(t) wose grap is sown below. Write down a possible formula for f(t). Tis function appears to be an odd,

More information

Cubic Functions: Local Analysis

Cubic Functions: Local Analysis Cubic function cubing coefficient Capter 13 Cubic Functions: Local Analysis Input-Output Pairs, 378 Normalized Input-Output Rule, 380 Local I-O Rule Near, 382 Local Grap Near, 384 Types of Local Graps

More information

Analytic Functions. Differentiable Functions of a Complex Variable

Analytic Functions. Differentiable Functions of a Complex Variable Analytic Functions Differentiable Functions of a Complex Variable In tis capter, we sall generalize te ideas for polynomials power series of a complex variable we developed in te previous capter to general

More information

Lecture 10: Carnot theorem

Lecture 10: Carnot theorem ecture 0: Carnot teorem Feb 7, 005 Equivalence of Kelvin and Clausius formulations ast time we learned tat te Second aw can be formulated in two ways. e Kelvin formulation: No process is possible wose

More information

Edge Detection Based on the Newton Interpolation s Fractional Differentiation

Edge Detection Based on the Newton Interpolation s Fractional Differentiation Te International Arab Journal of Information Tecnology, Vol. 11, No. 3, May 014 3 Edge Detection Based on te Newton Interpolation s Fractional Differentiation Caobang Gao 1,, Jiliu Zou, 3, and Weiua Zang

More information

MTH-112 Quiz 1 Name: # :

MTH-112 Quiz 1 Name: # : MTH- Quiz Name: # : Please write our name in te provided space. Simplif our answers. Sow our work.. Determine weter te given relation is a function. Give te domain and range of te relation.. Does te equation

More information

Math 212-Lecture 9. For a single-variable function z = f(x), the derivative is f (x) = lim h 0

Math 212-Lecture 9. For a single-variable function z = f(x), the derivative is f (x) = lim h 0 3.4: Partial Derivatives Definition Mat 22-Lecture 9 For a single-variable function z = f(x), te derivative is f (x) = lim 0 f(x+) f(x). For a function z = f(x, y) of two variables, to define te derivatives,

More information

1. Questions (a) through (e) refer to the graph of the function f given below. (A) 0 (B) 1 (C) 2 (D) 4 (E) does not exist

1. Questions (a) through (e) refer to the graph of the function f given below. (A) 0 (B) 1 (C) 2 (D) 4 (E) does not exist Mat 1120 Calculus Test 2. October 18, 2001 Your name Te multiple coice problems count 4 points eac. In te multiple coice section, circle te correct coice (or coices). You must sow your work on te oter

More information

2.3 Algebraic approach to limits

2.3 Algebraic approach to limits CHAPTER 2. LIMITS 32 2.3 Algebraic approac to its Now we start to learn ow to find its algebraically. Tis starts wit te simplest possible its, and ten builds tese up to more complicated examples. Fact.

More information

Solution. Solution. f (x) = (cos x)2 cos(2x) 2 sin(2x) 2 cos x ( sin x) (cos x) 4. f (π/4) = ( 2/2) ( 2/2) ( 2/2) ( 2/2) 4.

Solution. Solution. f (x) = (cos x)2 cos(2x) 2 sin(2x) 2 cos x ( sin x) (cos x) 4. f (π/4) = ( 2/2) ( 2/2) ( 2/2) ( 2/2) 4. December 09, 20 Calculus PracticeTest s Name: (4 points) Find te absolute extrema of f(x) = x 3 0 on te interval [0, 4] Te derivative of f(x) is f (x) = 3x 2, wic is zero only at x = 0 Tus we only need

More information

arxiv: v3 [cs.lg] 14 Oct 2015

arxiv: v3 [cs.lg] 14 Oct 2015 Learning Deep Generative Models wit Doubly Stocastic MCMC Cao Du Jun Zu Bo Zang Dept. of Comp. Sci. & Tec., State Key Lab of Intell. Tec. & Sys., TNList Lab, Center for Bio-Inspired Computing Researc,

More information

232 Calculus and Structures

232 Calculus and Structures 3 Calculus and Structures CHAPTER 17 JUSTIFICATION OF THE AREA AND SLOPE METHODS FOR EVALUATING BEAMS Calculus and Structures 33 Copyrigt Capter 17 JUSTIFICATION OF THE AREA AND SLOPE METHODS 17.1 THE

More information

Teaching Differentiation: A Rare Case for the Problem of the Slope of the Tangent Line

Teaching Differentiation: A Rare Case for the Problem of the Slope of the Tangent Line Teacing Differentiation: A Rare Case for te Problem of te Slope of te Tangent Line arxiv:1805.00343v1 [mat.ho] 29 Apr 2018 Roman Kvasov Department of Matematics University of Puerto Rico at Aguadilla Aguadilla,

More information

Math 161 (33) - Final exam

Math 161 (33) - Final exam Name: Id #: Mat 161 (33) - Final exam Fall Quarter 2015 Wednesday December 9, 2015-10:30am to 12:30am Instructions: Prob. Points Score possible 1 25 2 25 3 25 4 25 TOTAL 75 (BEST 3) Read eac problem carefully.

More information

7 Semiparametric Methods and Partially Linear Regression

7 Semiparametric Methods and Partially Linear Regression 7 Semiparametric Metods and Partially Linear Regression 7. Overview A model is called semiparametric if it is described by and were is nite-dimensional (e.g. parametric) and is in nite-dimensional (nonparametric).

More information

Mixing Rates for the Alternating Gibbs Sampler over Restricted Boltzmann Machines and Friends

Mixing Rates for the Alternating Gibbs Sampler over Restricted Boltzmann Machines and Friends Mixing Rates for te Alternating Gibbs Sampler oer Restricted Boltzmann Macines and Friends Cristoper Tos CTOSH@CS.UCSD.EDU Department of Computer Science and Engineering, UC San Diego, 9500 Gilman Dr.,

More information

(a) At what number x = a does f have a removable discontinuity? What value f(a) should be assigned to f at x = a in order to make f continuous at a?

(a) At what number x = a does f have a removable discontinuity? What value f(a) should be assigned to f at x = a in order to make f continuous at a? Solutions to Test 1 Fall 016 1pt 1. Te grap of a function f(x) is sown at rigt below. Part I. State te value of eac limit. If a limit is infinite, state weter it is or. If a limit does not exist (but is

More information

Symmetry Labeling of Molecular Energies

Symmetry Labeling of Molecular Energies Capter 7. Symmetry Labeling of Molecular Energies Notes: Most of te material presented in tis capter is taken from Bunker and Jensen 1998, Cap. 6, and Bunker and Jensen 2005, Cap. 7. 7.1 Hamiltonian Symmetry

More information

Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn

More information

5 Ordinary Differential Equations: Finite Difference Methods for Boundary Problems

5 Ordinary Differential Equations: Finite Difference Methods for Boundary Problems 5 Ordinary Differential Equations: Finite Difference Metods for Boundary Problems Read sections 10.1, 10.2, 10.4 Review questions 10.1 10.4, 10.8 10.9, 10.13 5.1 Introduction In te previous capters we

More information

A MONTE CARLO ANALYSIS OF THE EFFECTS OF COVARIANCE ON PROPAGATED UNCERTAINTIES

A MONTE CARLO ANALYSIS OF THE EFFECTS OF COVARIANCE ON PROPAGATED UNCERTAINTIES A MONTE CARLO ANALYSIS OF THE EFFECTS OF COVARIANCE ON PROPAGATED UNCERTAINTIES Ronald Ainswort Hart Scientific, American Fork UT, USA ABSTRACT Reports of calibration typically provide total combined uncertainties

More information

Fractional Derivatives as Binomial Limits

Fractional Derivatives as Binomial Limits Fractional Derivatives as Binomial Limits Researc Question: Can te limit form of te iger-order derivative be extended to fractional orders? (atematics) Word Count: 669 words Contents - IRODUCIO... Error!

More information

Section 3: The Derivative Definition of the Derivative

Section 3: The Derivative Definition of the Derivative Capter 2 Te Derivative Business Calculus 85 Section 3: Te Derivative Definition of te Derivative Returning to te tangent slope problem from te first section, let's look at te problem of finding te slope

More information