A Better Way to Pretrain Deep Boltzmann Machines

Size: px
Start display at page:

Download "A Better Way to Pretrain Deep Boltzmann Machines"

Transcription

1 A Better Way to Pretrain Deep Botzmann Machines Rusan Saakhutdino Department of Statistics and Computer Science Uniersity of Toronto Geoffrey Hinton Department of Computer Science Uniersity of Toronto Abstract We describe how the pretraining agorithm for Deep Botzmann Machines (DBMs is reated to the pretraining agorithm for Deep Beief Networks and we show that under certain conditions, the pretraining procedure improes the ariationa ower bound of a two-hidden-ayer DBM. Based on this anaysis, we deeop a different method of pretraining DBMs that distributes the modeing work more eeny oer the hidden ayers. Our resuts on the MNIST and NORB datasets demonstrate that the new pretraining agorithm aows us to earn better generatie modes. Introduction A Deep Botzmann Machine (DBM is a type of binary pairwise Marko Random Fied with mutipe ayers of hidden random ariabes. Maximum ikeihood earning in DBMs, and other reated modes, is ery difficut because of the hard inference probem induced by the partition function [3,, 2, 6]. Mutipe ayers of hidden units make earning in DBM s far more difficut [3]. Learning meaningfu DBM modes, particuary when modeing high-dimensiona data, reies on the heuristic greedy pretraining procedure introduced by [7], which is based on earning a stack of modified Restricted Botzmann Machines (RBMs. Unfortunatey, unike the pretraining agorithm for Deep Beief Networks (DBNs, the existing procedure acks a proof that adding additiona ayers improes the ariationa bound on the og-probabiity that the mode assigns to the training data. In this paper, we first show that under certain conditions, the pretraining agorithm improes a ariationa ower bound of a two-ayer DBM. This resut gies a much deeper understanding of the reationship between the pretraining agorithms for Deep Botzmann Machines and Deep Beief Networks. Using this understanding, we introduce a new pretraining procedure for DBMs and show that it aows us to earn better generatie modes of handwritten digits and 3D objects. 2 Deep Botzmann Machines (DBMs A Deep Botzmann Machine is a network of symmetricay couped stochastic binary units. It contains a set of isibe units {0, } D, and a series of ayers of hidden units {0, } F, {0, } F2,..., h (L {0, } F L. There are connections ony between units in adjacent ayers. Consider a DBM with three hidden ayers, as shown in Fig., eft pane. The probabiity that the DBM assigns to a isibe ector is: P (; θ = Z(θ h ( exp ij i j + ij j j j + W (3 m h(2 m, ( m

2 Pretraining RBM Deep Beief Network Deep Botzmann Machine W (3 2W (3 W (3 W (3 RBM W ( RBM Figure : Left: Deep Beief Network (DBN and Deep Botzmann Machine (DBM. The top two ayers of a DBN form an undirected graph and the remaining ayers form a beief net with directed, top-down connections. For a DBM, a the connections are undirected. Right Pretraining a DBM with three hidden ayers consists of earning a stack of RBMs that are then composed to create a DBM. The first and ast RBMs in the stack need to be modified by using asymmetric weights. where h = {,, } are the set of hidden units, and θ = {,, W (3 } are the mode parameters, representing isibe-to-hidden and hidden-to-hidden symmetric interaction terms. Setting =0 and W (3 =0 recoers the Restricted Botzmann Machine (RBM mode. Approximate Learning: Exact maximum ikeihood earning in this mode is intractabe, but efficient approximate earning of DBMs can be carried out by using a mean-fied inference to estimate data-dependent expectations, and an MCMC based stochastic approximation procedure to approximate the mode s expected sufficient statistics [7]. In particuar, consider approximating the true posterior P (h ; θ with a fuy factorized approximating distribution oer the three sets of hidden F2 = F3 k= q(h( j q( q( = = µ ( units: Q(h ; µ = F j= k, where µ = {µ(, µ (2, µ (3 } are the mean-fied parameters with q(h ( i i for =, 2, 3. In this case, we can write down the ariationa ower bound on the og-probabiity of the data, which takes a particuary simpe form: og P (; θ µ ( + µ ( µ (2 + µ (2 W (3 µ (3 og Z(θ + H(Q, (2 where H( is the entropy functiona. Learning proceeds by finding the aue of µ that maximizes this ower bound for the current aue of mode parameters θ, which resuts in a set of the mean-fied fixed-point equations. Gien the ariationa parameters µ, the mode parameters θ are then updated to maximize the ariationa bound using stochastic approximation (for detais see [7,, 4, 5]. 3 Pretraining Deep Botzmann Machines The aboe earning procedure works quite poory when appied to DBMs that start with randomy initiaized weights. Hidden units in higher ayers are ery under-constrained so there is no consistent earning signa for their weights. To aeiate this probem, [7] introduced a ayer-wise pretraining agorithm based on earning a stack of modified Restricted Botzmann Machines (RBMs. The idea behind the pretraining agorithm is straightforward. When earning parameters of the first ayer RBM, the bottom-up weights are constrained to be twice the top-down weights (see Fig., right pane. Intuitiey, using twice the weights when inferring the states of the hidden units compensates for the initia ack of top-down feedback. Conersey, when pretraining the ast RBM in the stack, the top-down weights are constrained to be twice the bottom-up weights. For a the intermediate RBMs the weights are haed in both directions when composing them to form a DBM, as shown in Fig., right pane. This heuristic pretraining agorithm works surprisingy we in practice. Howeer, it is soey motiated by the need to end up with a mode that has symmetric weights, and does not proide any We omit the bias terms for carity of presentation. 2

3 usefu insights into what is happening during the pretraining stage. Furthermore, unike the pretraining agorithm for Deep Beief Networks (DBNs, it acks a proof that each time a ayer is added to the DBM, the ariationa bound improes. 3. Pretraining Agorithm for Deep Beief Networks We first briefy reiew the pretraining agorithm for Deep Beief Networks [2], which wi form the basis for deeoping a new pretraining agorithm for Deep Botzmann Machines. Consider pretraining a two-ayer DBN using a stack of RBMs. After earning the first RBM in the stack, we can write the generatie mode as: p(; = p( h( ; p( ;. The second RBM in the stack attempts to repace the prior p( ; by a better mode p( ; = p(h(, ;, thus improing the fit to the training data. More formay, for any approximating distribution Q(, the DBN s og-ikeihood has the foowing ariationa ower bound on the og probabiity of the training data {,..., N }: N og P ( n n n= E Q( n [ ] og P ( n ; n ( KL Q( n P ( ;. We set Q( n ; = P ( n ;, which is the true factoria posterior of the first ayer RBM. Initiay, when =, Q( n defines the DBN s true posterior oer, and the bound is tight. Maximizing the bound with respect to ony affects the ast KL term in the aboe equation, and amounts to maximizing: N N n= Q( n ; P ( ;. (3 This is equiaent to training the second ayer RBM with ectors drawn from Q( ; as data. Hence, the second RBM in the stack earns a better mode of the mixture oer a N training cases: /N n Q(h( n ;, caed the aggregated posterior. This scheme can be extended to training higher-ayer RBMs. Obsere that during the pretraining stage the whoe prior of the ower-ayer RBM is repaced by the next RBM in the stack. This eads to the hybrid Deep Beief Network mode, with the top two ayers forming a Restricted Botzmann Machine, and the ower ayers forming a directed sigmoid beief network (see Fig., eft pane. 3.2 A Variationa Bound for Pretraining a Two-ayer Deep Botzmann Machine Consider a simpe two-ayer DBM with tied weights =, as shown in Fig. 2a: P (; = exp ( +. (4 Z(, Simiar to DBNs, for any approximate posterior Q(, we can write a ariationa ower bound on the og probabiity that this DBM assigns to the training data: N og P ( n n n= E Q( n [ ] og P ( n ; n ( KL Q( n P ( ;. (5 The key insight is to note that the mode s margina distribution oer is the product of two identica distributions, one defined by an RBM composed of and, and the other defined by an identica RBM composed of and [8]: P ( ; = Z( ( ( e }{{} RBM with and 3 e } {{ } RBM with and. (6

4 h (2a h (2b a b c = Figure 2: Left: Pretraining a Deep Botzmann Machine with two hidden ayers. a The DBM with tied weights. b The second RBM with two sets of repicated hidden units, which wi repace haf of the st RBM s prior. c The resuting DBM with modified second hidden ayer. Right: The DBM with tied weights is trained to mode the data using one-step contrastie diergence. The idea is to keep one of these two RBMs and repace the other by the square root of a better prior P ( ;. In particuar, another RBM with two sets of repicated hidden units and tied weights P ( ; = h (2a,h P (2b (h(, h (2a, h (2b ; is trained to be a better mode of the aggregated ariationa posterior N n Q(h( n ; of the first mode (see Fig. 2b. By initiaizing =, the second-ayer RBM has exacty the same prior oer as the origina DBM. If the RBM is trained by maximizing the og ikeihood objectie: Q( n og P ( ;, (7 n then we obtain: KL(Q( n P ( ; KL(Q( n P ( ;. (8 n n Simiar to Eq. 6, the distribution oer defined by the second-ayer RBM is aso the product of two identica distributions. Once the two RBMs are composed to form a two-ayer DBM mode (see Fig. 2c, the margina distribution oer is the geometric mean of the two probabiity distributions: P ( ;, P ( ; defined by the first and second-ayer RBMs: P ( ;, = Z(, ( ( e e h( h. (2 (9 Based on Eqs. 8, 9, it is easy to show that the ariationa ower bound of Eq. 5 improes because repacing haf of the prior by a better mode reduces the KL diergence from the ariationa posterior: ( KL Q( n P ( ;, ( KL Q( n P ( ;. (0 n n Due to the conexity of asymmetric diergence, this is guaranteed to improe the ariationa bound of the training data by at east haf as much as fuy repacing the origina prior. This resut highights a major difference between DBNs and DBMs. The procedure for adding an extra ayer to a DBN repaces the fu prior oer the preious top ayer, whereas the procedure for adding an extra ayer to a DBM ony repaces haf of the prior. So in a DBM, the weights of the bottom ee RBM perform much more of the work than in a DBN, where the weights are ony used to define the ast stage of the generatie process P ( ;. This resut aso suggests that adding ayers to a DBM wi gie diminishing improements in the ariationa bound, compared to adding ayers to a DBN. This may expain why DBMs with three hidden ayers typicay perform worse than the DBMs with two hidden ayers [7, 8]. On the other hand, the disadantage of the pretraining procedure for Deep Beief Networks is that the top-ayer RBM is forced to do most of the modeing work. This may aso expain the need to use a arge number of hidden units in the top-ayer RBM [2]. There is, howeer, a way to design a new pretraining agorithm that woud spread the modeing work more equay across a ayers, hence bypassing shortcomings of the existing pretraining agorithms for DBNs and DBMs. 4

5 Repacing 2/3 of the Prior Practica Impementation a b a b c a b a b c Figure 3: Left: Pretraining a Deep Botzmann Machine with two hidden ayers. a The DBM with tied weights. b The second ayer RBM is trained to mode 2 /3 of the st RBM s prior. c The resuting DBM with modified second hidden ayer. Right: The corresponding practica impementation of the pretraining agorithm that uses asymmetric weights. 3.3 Controing the Amount of Modeing Work done by Each Layer Consider a sighty modified two-ayer DBM with two groups of repicated 2 nd -ayer units, h (2a and h (2b, and tied weights (see Fig. 3a. The mode s margina distribution oer is the product of three identica RBM distributions, defined by and, and h (2a, and and h (2b : P ( ; = Z( ( ( ( e e h(2a e h(2b h. ( h (2a h (2b During the pretraining stage, we keep one of these RBMs and repace the other two by a better prior P ( ;. To do so, simiar to Sec. 3.2, we train another RBM, but with three sets of hidden units and tied weights (see Fig. 3b. When we combine the two RBMs into a DBM, the margina distribution oer is the geometric mean of three probabiity distributions: one defined by the first-ayer RBM, and the remaining two defined by the second-ayer RBMs: P ( ;, = Z(, P (h( ; P ( ; P ( ; ( ( ( = e e h(2a e h(2b h. ( Z(, h (2a h (2b In this DBM, 2 /3 of the first RBM s prior oer the first hidden ayer has been repaced by the prior defined by the second-ayer RBM. The ariationa bound on the training data is guaranteed to improe by at east 2 /3 as much as fuy repacing the origina prior. Hence in this sighty modified DBM mode, the second ayer performs 2 /3 of the modeing work compared to the first ayer. Ceary, controing the number of repicated hidden groups aows us to easiy contro the amount of modeing work eft to the higher ayers in the stack. 3.4 Practica Impementation So far, we hae made the assumption that we start with a two-ayer DBM with tied weights. We now specify how one woud train this initia set of tied weights. Let us consider the origina two-ayer DBM in Fig. 2a with tied weights. If we knew the initia state ector, we coud train this DBM using one-step contrastie diergence (CD with mean fied reconstructions of both the isibe states and the top-ayer states, as shown in Fig. 2, right pane. Instead, we simpy set the initia state ector to be equa to the data,. Using mean-fied reconstructions for and, one-step CD is exacty equiaent to training a modified RBM with ony one hidden ayer but with bottom-up weights that are twice the top-down weights, as defined in the origina pretraining agorithm (see Fig., right pane. This way of training the simpe DBM with tied weights is unikey to maximize the ikeihood objectie, but in practice it produces surprisingy good modes that reconstruct the training data we. When earning the second RBM in the stack, instead of maintaining a set of repicated hidden groups, it wi often be conenient to approximate CD earning by training a modified RBM with one hidden ayer but with asymmetric bottom-up and top-down weights. 2 5

6 For exampe, consider pretraining a two-ayer DBM, in which we woud ike to spit the modeing work between the st and 2 nd -ayer RBMs as /3 and 2 /3. In this case, we train the first ayer RBM using one-step CD, but with the bottom-up weights constrained to be three times the top-down weights (see Fig. 3, right pane. The conditiona distributions needed for CD earning take form: P ( j = = + exp( ( i 3W ij i, P ( i = = + exp( j ij h( j. Conersey, for the second modified RBM in the stack, the top-down weights are constrained to be 3/2 times the bottom-up weights. The conditiona distributions take form: P ( = = + exp( j (2 2W j j, P (h( j = = + exp( (2 3W j. Note that this second-ayer modified RBM simpy approximates the proper RBM with three sets of repicated groups. In practice, this simpe approximation works we compared to training a proper RBM, and is much easier to impement. When combining the RBMs into a two-ayer DBM, we end up with and 2 in the first and second ayers, each performing /3 and 2 /3 of the modeing work respectiey: P (; θ = Z(θ, exp ( + 2. ( Parameters of the entire mode can be generatiey fine-tuned using the combination of the meanfied agorithm and the stochastic approximation agorithm described in Sec. 2 4 Pretraining a Three Layer Deep Botzmann Machine In the preious section, we showed that proided we start with a two-ayer DBM with tied weights, we can train the second-ayer RBM in a way that is guaranteed to improe the ariationa bound. For the DBM with more than two ayers, we hae not been abe to deeop a pretraining agorithm that is guaranteed to improe a ariationa bound. Howeer, resuts of Sec. 3 suggest that using simpe modifications when pretraining a stack of RBMs woud aow us to approximatey contro the amount of modeing work done by each ayer. 3 Pretraining a 3-ayer DBM 2W (3 4W (3 2W (3 4 3 Figure 4: Layer-wise pretraining of a 3-ayer Deep Botzmann Machine. 2 Consider earning a 3-ayer DBM, in which each ayer is forced to perform approximatey /3 of the modeing work. This can easiy be accompished by earning a stack of three modified RBMs. Simiar to the two-ayer mode, we train the first ayer RBM using one-step CD, but with the bottom-up weights constrained to be three times the top-down weights (see Fig. 4. Two-thirds of this RBM s prior wi be modeed by the 2 nd and 3 rd -ayer RBMs. For the second modified RBM in the stack, we use 4 bottom-up and 3 topdown. Note that we are using 4 bottom-up, as we are expecting to repace haf of the second RBM prior by a third RBM, hence spitting the remaining 2 /3 of the work equay between the top two ayers. If we were to pretrain ony a two-ayer DBM, we woud use 2 bottom-up and 3 top-down, as discussed in Sec For the ast RBM in the stack, we use 2W (3 bottom-up and 4 top-down. When combining the three RBMs into a three-ayer DBM, we end up with symmetric weights, 2, and 2W (3 in the first, second, and third ayers, with each ayer performing /3 of the modeing work: P (; θ = Z(θ exp h ( W (3. (2 6

7 Agorithm Greedy Pretraining Agorithm for a 3-ayer Deep Botzmann Machine : Train the st ayer RBM using one-step CD earning with mean fied reconstructions of the isibe ectors. Constrain the bottom-up weights, 3, to be three times the top-down weights,. 2: Freeze 3 that defines the st ayer of features, and use sampes from P ( ; 3 as the data for training the second RBM. 3: Train the 2 nd ayer RBM using one-step CD earning with mean fied reconstructions of the isibe ectors. Set the bottom-up weights to 4, and the top-down weights to 3. 4: Freeze 4 that defines the 2 nd ayer of features and use the sampes from P ( ; 4 as the data for training the next RBM. 5: Train the 3 rd -ayer RBM using one-step CD earning with mean fied reconstructions of its isibe ectors. During the earning, set the bottom-up weights to 2W (3, and the top-down weights to 4W (3. 6: Use the weights {, 2, 2W (3 } to compose a three-ayer Deep Botzmann Machine. The new pretraining procedure for a 3-ayer DBM is shown in Ag.. Note that compared to the origina agorithm, it requires amost no extra work and can be easiy integrated into existing code. Extensions to training DBMs with more ayers is triia. As we show in our experimenta resuts, this pretraining can improe the generatie performance of Deep Botzmann Machines. 5 Experimenta Resuts In our experiments we used the MNIST and NORB datasets. During greedy pretraining, each ayer was trained for 00 epochs using one-step contrastie diergence. Generatie fine-tuning of the fu DBM mode, using mean-fied together with stochastic approximation, required 300 epochs. In order to estimate the ariationa ower-bounds achieed by different pretraining agorithms, we need to estimate the goba normaization constant. Recenty, [0] demonstrated that Anneaed Importance Samping (AIS can be used to efficienty estimate the partition function of an RBM. We adopt AIS in our experiments as we. Together with ariationa inference this wi aow us to obtain good estimates of the ower bound on the og-probabiity of the training and test data. 5. MNIST The MNIST digit dataset contains 60,000 training and 0,000 test images of ten handwritten digits (0 to 9, with pixes. In our first experiment, we considered a standard two-ayer DBM with 500 and 000 hidden units 2, and used two different agorithms for pretraining it. The first pretraining agorithm, which we ca DBM- /2- /2, is the origina agorithm for pretraining DBMs, as introduced by [7] (see Fig.. Here, the modeing work between the st and 2 nd -ayer RBMs is spit equay. The second agorithm, DBM- /3-2 /3, uses a modified pretraining procedure of Sec. 3.4, so that the second RBM in the stack ends up doing 2 /3 of the modeing work compared to the st -ayer RBM. Resuts are shown in Tabe. Prior to the goba generatie fine-tuning, the estimate of the ower bound on the aerage test og-probabiity for DBM- /3-2 /3 was per test case, compared to 4.32 achieed by the standard pretraining agorithm DBM- /2- /2. The arge difference of about 7 nats shows that eaing more of the modeing work to the second ayer, which has a arger number of hidden units, substantiay improes the ariationa bound. After the goba generatie fine-tuning, DBM- /3-2 /3 achiees a ower bound of 83.43, which is better compared to 84.62, achieed by DBM- /2- /2. This is aso ower compared to the ower bound of 85.97, achieed by a carefuy trained two-hidden-ayer Deep Beief Network [0]. In our second experiment, we pretrained a 3-ayer Deep Botzmann Machine with 500, 500, and 000 hidden units. The existing pretraining agorithm, DBM- /2- /4- /4, approximatey spits the modeing between three RBMs in the stack as /2, /4, /4, so the weights in the st -ayer RBM perform haf of the work compared to the higher-ee RBMs. On the other hand, the new pretraining procedure (see Ag., which we ca DBM- /3- /3- /3, spits the modeing work equay across a three ayers. 2 These architectures hae been considered before in [7, 9], which aows us to proide a direct comparison. 7

8 Tabe : MNIST: Estimating the ower bound on the aerage training and test og-probabiities for two DBMs: one with two ayers (500 and 000 hidden units, and the other one with three ayers (500, 500, and 000 hidden units. Resuts are shown for arious pretraining agorithms, foowed by generatie fine-tuning. Pretraining Generatie Fine-Tuning Train Test Train Test 2 ayers DBM- /2- / DBM- /3-2 / ayers DBM- /2- /4- / DBM- /3- /3- / Tabe 2: NORB: Estimating the ower bound on the aerage training and test og-probabiities for two DBMs: one with two ayers (000 and 2000 hidden units, and the other one with three ayers (000, 000, and 2000 hidden units. Resuts are shown for arious pretraining agorithms, foowed by generatie fine-tuning. Pretraining Generatie Fine-Tuning Train Test Train Test 2 ayers DBM- /2- / DBM- /3-2 / ayers DBM- /2- /4- / DBM- /3- /3- / Tabe shows that DBM- /3- /3- /3 achiees a ower bound on the aerage test og-probabiity of 07.65, improing upon DBM- /2- /4- /4 s bound of The difference of about 0 nats further demonstrates that during the pretraining stage, it is rather crucia to push more of the modeing work to the higher ayers. After generatie fine-tuning, the bound on the test og-probabiities for DBM- /3- /3- /3 was 83.02, so with a new pretraining procedure, the three-hidden-ayer DBM performs sighty better than the two-hidden-ayer DBM. With the origina pretraining procedure, the 3-ayer DBM achiees a bound of 85.0, which is worse than the bound of 84.62, achieed by the 2-ayer DBM, as reported by [7, 9]. 5.2 NORB The NORB dataset [4] contains images of 50 different 3D toy objects with 0 objects in each of fie generic casses: cars, trucks, panes, animas, and humans. Each object is photographed from different iewpoints and under arious ighting conditions. The training set contains 24,300 stereo image pairs of 25 objects, 5 per cass, whie the test set contains 24,300 stereo pairs of the remaining, different 25 objects. From the training data, 4,300 were set aside for aidation. To dea with raw pixe data, we foowed the approach of [5] by first earning a Gaussian-binary RBM with 4000 hidden units, and then treating the the actiities of its hidden ayer as preprocessed binary data. Simiar to the MNIST experiments, we trained two Deep Botzmann Machines: one with two ayers (000 and 2000 hidden units, and the other one with three ayers (000, 000, and 2000 hidden units. Tabe 2 reeas that for both DBMs, the new pretraining achiees much better ariationa bounds on the aerage test og-probabiity. Een after the goba generatie fine-tuning, Deep Botzmann Machines, pretrained using a new agorithm, improe upon standard DBMs by at east 5 nats. 6 Concusion In this paper we proided a better understanding of how the pretraining agorithms for Deep Beief Networks and Deep Botzmann Machines are reated, and used this understanding to deeop a different method of pretraining. Unike many of the existing pretraining agorithms for DBNs and DBMs, the new procedure can distribute the modeing work more eeny oer the hidden ayers. Our resuts on the MNIST and NORB datasets demonstrate that the new pretraining agorithm aows us to earn much better generatie modes. Acknowedgments This research was funded by NSERC, Eary Researcher Award, and gifts from Microsoft and Googe. G.H. and R.S. are feows of the Canadian Institute for Adanced Research. 8

9 References [] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, [2] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast earning agorithm for deep beief nets. Neura Computation, 8(7: , [3] H. Larochee, Y. Bengio, J. Louradour, and P. Lambin. Exporing strategies for training deep neura networks. Journa of Machine Learning Research, 0: 40, [4] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with inariance to pose and ighting. In CVPR (2, pages 97 04, [5] V. Nair and G. E. Hinton. Impicit mixtures of restricted Botzmann machines. In Adances in Neura Information Processing Systems, oume 2, [6] M. A. Ranzato. Unsuperised earning of feature hierarchies. In Ph.D. New York Uniersity, [7] R. R. Saakhutdino and G. E. Hinton. Deep Botzmann machines. In Proceedings of the Internationa Conference on Artificia Inteigence and Statistics, oume 2, [8] R. R. Saakhutdino and G. E. Hinton. An efficient earning procedure for Deep Botzmann Machines. Neura Computation, 24: , 202. [9] R. R. Saakhutdino and H. Larochee. Efficient earning of deep Botzmann machines. In Proceedings of the Internationa Conference on Artificia Inteigence and Statistics, oume 3, 200. [0] R. R. Saakhutdino and I. Murray. On the quantitatie anaysis of deep beief networks. In Proceedings of the Internationa Conference on Machine Learning, oume 25, pages , [] T. Tieeman. Training restricted Botzmann machines using approximations to the ikeihood gradient. In ICML. ACM, [2] M. Weing and G. E. Hinton. A new earning agorithm for mean fied Botzmann machines. Lecture Notes in Computer Science, 245, [3] M. Weing and C. Sutton. Learning in marko random fieds with contrastie free energies. In Internationa Workshop on AI and Statistics (AISTATS 2005, [4] L. Younes. On the conergence of Markoian stochastic agorithms with rapidy decreasing ergodicity rates, March [5] A. L. Yuie. The conergence of contrastie diergences. In Adances in Neura Information Processing Systems,

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Part IX The EM agorithm In the previous set of notes, we taked about the EM agorithm as appied to fitting a mixture of Gaussians. In this set of notes, we give a broader view

More information

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons Expectation-Maximization for Estimating Parameters for a Mixture of Poissons Brandon Maone Department of Computer Science University of Hesini February 18, 2014 Abstract This document derives, in excrutiating

More information

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks 1 Representational Power of Restricted Boltzmann Machines and Deep Belief Networks Nicolas Le Roux and Yoshua Bengio Dept. IRO, Uniersité de Montréal C.P. 6128, Montreal, Qc, H3C 3J7, Canada {lerouxni,bengioy}@iro.umontreal.ca

More information

Stochastic Variational Inference with Gradient Linearization

Stochastic Variational Inference with Gradient Linearization Stochastic Variationa Inference with Gradient Linearization Suppementa Materia Tobias Pötz * Anne S Wannenwetsch Stefan Roth Department of Computer Science, TU Darmstadt Preface In this suppementa materia,

More information

A Brief Introduction to Markov Chains and Hidden Markov Models

A Brief Introduction to Markov Chains and Hidden Markov Models A Brief Introduction to Markov Chains and Hidden Markov Modes Aen B MacKenzie Notes for December 1, 3, &8, 2015 Discrete-Time Markov Chains You may reca that when we first introduced random processes,

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

Learning Deep Architectures

Learning Deep Architectures Learning Deep Architectures Yoshua Bengio, U. Montreal Microsoft Cambridge, U.K. July 7th, 2009, Montreal Thanks to: Aaron Courville, Pascal Vincent, Dumitru Erhan, Olivier Delalleau, Olivier Breuleux,

More information

Convolutional Networks 2: Training, deep convolutional networks

Convolutional Networks 2: Training, deep convolutional networks Convoutiona Networks 2: Training, deep convoutiona networks Hakan Bien Machine Learning Practica MLP Lecture 8 30 October / 6 November 2018 MLP Lecture 8 / 30 October / 6 November 2018 Convoutiona Networks

More information

An Efficient Learning Procedure for Deep Boltzmann Machines

An Efficient Learning Procedure for Deep Boltzmann Machines ARTICLE Communicated by Yoshua Bengio An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov rsalakhu@utstat.toronto.edu Department of Statistics, University of Toronto, Toronto,

More information

Multimodal Learning with Deep Boltzmann Machines

Multimodal Learning with Deep Boltzmann Machines Journa of Machine Learning Research 15 (2014) 2949-2980 Submitted 6/13; Revised 3/14; Pubished 9/14 Mutimoda Learning with Deep Botzmann Machines Nitish Srivastava Department of Computer Science University

More information

Deep Boltzmann Machines

Deep Boltzmann Machines Deep Boltzmann Machines Ruslan Salakutdinov and Geoffrey E. Hinton Amish Goel University of Illinois Urbana Champaign agoel10@illinois.edu December 2, 2016 Ruslan Salakutdinov and Geoffrey E. Hinton Amish

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with? Bayesian Learning A powerfu and growing approach in machine earning We use it in our own decision making a the time You hear a which which coud equay be Thanks or Tanks, which woud you go with? Combine

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Paragraph Topic Classification

Paragraph Topic Classification Paragraph Topic Cassification Eugene Nho Graduate Schoo of Business Stanford University Stanford, CA 94305 enho@stanford.edu Edward Ng Department of Eectrica Engineering Stanford University Stanford, CA

More information

Learning Deep Boltzmann Machines using Adaptive MCMC

Learning Deep Boltzmann Machines using Adaptive MCMC Ruslan Salakhutdinov Brain and Cognitive Sciences and CSAIL, MIT 77 Massachusetts Avenue, Cambridge, MA 02139 rsalakhu@mit.edu Abstract When modeling high-dimensional richly structured data, it is often

More information

Soft Clustering on Graphs

Soft Clustering on Graphs Soft Custering on Graphs Kai Yu 1, Shipeng Yu 2, Voker Tresp 1 1 Siemens AG, Corporate Technoogy 2 Institute for Computer Science, University of Munich kai.yu@siemens.com, voker.tresp@siemens.com spyu@dbs.informatik.uni-muenchen.de

More information

Knowledge Extraction from DBNs for Images

Knowledge Extraction from DBNs for Images Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental

More information

FOURIER SERIES ON ANY INTERVAL

FOURIER SERIES ON ANY INTERVAL FOURIER SERIES ON ANY INTERVAL Overview We have spent considerabe time earning how to compute Fourier series for functions that have a period of 2p on the interva (-p,p). We have aso seen how Fourier series

More information

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton

An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey Hinton Computer Science and Artificial Intelligence Laboratory Technical Report MIT-CSAIL-TR-2010-037 August 4, 2010 An Efficient Learning Procedure for Deep Boltzmann Machines Ruslan Salakhutdinov and Geoffrey

More information

Projection onto A Nonnegative Max-Heap

Projection onto A Nonnegative Max-Heap Projection onto A Nonnegatie Max-Heap Jun Liu Arizona State Uniersity Tempe, AZ 85287, USA j.iu@asu.edu Liang Sun Arizona State Uniersity Tempe, AZ 85287, USA sun.iang@asu.edu Jieping Ye Arizona State

More information

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah How the backpropagation agorithm works Srikumar Ramaingam Schoo of Computing University of Utah Reference Most of the sides are taken from the second chapter of the onine book by Michae Nieson: neuranetworksanddeepearning.com

More information

6.434J/16.391J Statistics for Engineers and Scientists May 4 MIT, Spring 2006 Handout #17. Solution 7

6.434J/16.391J Statistics for Engineers and Scientists May 4 MIT, Spring 2006 Handout #17. Solution 7 6.434J/16.391J Statistics for Engineers and Scientists May 4 MIT, Spring 2006 Handout #17 Soution 7 Probem 1: Generating Random Variabes Each part of this probem requires impementation in MATLAB. For the

More information

Statistical Learning Theory: A Primer

Statistical Learning Theory: A Primer Internationa Journa of Computer Vision 38(), 9 3, 2000 c 2000 uwer Academic Pubishers. Manufactured in The Netherands. Statistica Learning Theory: A Primer THEODOROS EVGENIOU, MASSIMILIANO PONTIL AND TOMASO

More information

), enthalpy transport (i.e., the heat content that moves with the molecules)

), enthalpy transport (i.e., the heat content that moves with the molecules) Steady-state conseration statements for a composite of ces and airspace In steady state, conseration of moecues requires that the tota fux into a representatie oume of mesophy is equa to the fux out storage

More information

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine

Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Fast Inference and Learning for Modeling Documents with a Deep Boltzmann Machine Nitish Srivastava nitish@cs.toronto.edu Ruslan Salahutdinov rsalahu@cs.toronto.edu Geoffrey Hinton hinton@cs.toronto.edu

More information

Explicit overall risk minimization transductive bound

Explicit overall risk minimization transductive bound 1 Expicit overa risk minimization transductive bound Sergio Decherchi, Paoo Gastado, Sandro Ridea, Rodofo Zunino Dept. of Biophysica and Eectronic Engineering (DIBE), Genoa University Via Opera Pia 11a,

More information

STA 216 Project: Spline Approach to Discrete Survival Analysis

STA 216 Project: Spline Approach to Discrete Survival Analysis : Spine Approach to Discrete Surviva Anaysis November 4, 005 1 Introduction Athough continuous surviva anaysis differs much from the discrete surviva anaysis, there is certain ink between the two modeing

More information

17 Lecture 17: Recombination and Dark Matter Production

17 Lecture 17: Recombination and Dark Matter Production PYS 652: Astrophysics 88 17 Lecture 17: Recombination and Dark Matter Production New ideas pass through three periods: It can t be done. It probaby can be done, but it s not worth doing. I knew it was

More information

ANALYSIS OF FLOW INSIDE THE FOCUSING TUBE OF THE ABRASIVE WATER JET CUTTING HEAD

ANALYSIS OF FLOW INSIDE THE FOCUSING TUBE OF THE ABRASIVE WATER JET CUTTING HEAD 7 American WJTA Conference and Expo August 9-, 7 Houston, Texas Paper ANALYSIS OF FLOW INSIDE THE FOCUSING TUBE OF THE ABRASIVE WATER JET CUTTING HEAD Viém Mádr, Jana Viiamsoá, Libor M. Haáč VŠB Technica

More information

A. Distribution of the test statistic

A. Distribution of the test statistic A. Distribution of the test statistic In the sequentia test, we first compute the test statistic from a mini-batch of size m. If a decision cannot be made with this statistic, we keep increasing the mini-batch

More information

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones

ASummaryofGaussianProcesses Coryn A.L. Bailer-Jones ASummaryofGaussianProcesses Coryn A.L. Baier-Jones Cavendish Laboratory University of Cambridge caj@mrao.cam.ac.uk Introduction A genera prediction probem can be posed as foows. We consider that the variabe

More information

Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations

Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations Optimaity of Inference in Hierarchica Coding for Distributed Object-Based Representations Simon Brodeur, Jean Rouat NECOTIS, Département génie éectrique et génie informatique, Université de Sherbrooke,

More information

Multilayer Kerceptron

Multilayer Kerceptron Mutiayer Kerceptron Zotán Szabó, András Lőrincz Department of Information Systems, Facuty of Informatics Eötvös Loránd University Pázmány Péter sétány 1/C H-1117, Budapest, Hungary e-mai: szzoi@csetehu,

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

Stat 155 Game theory, Yuval Peres Fall Lectures 4,5,6

Stat 155 Game theory, Yuval Peres Fall Lectures 4,5,6 Stat 155 Game theory, Yuva Peres Fa 2004 Lectures 4,5,6 In the ast ecture, we defined N and P positions for a combinatoria game. We wi now show more formay that each starting position in a combinatoria

More information

AST 418/518 Instrumentation and Statistics

AST 418/518 Instrumentation and Statistics AST 418/518 Instrumentation and Statistics Cass Website: http://ircamera.as.arizona.edu/astr_518 Cass Texts: Practica Statistics for Astronomers, J.V. Wa, and C.R. Jenkins, Second Edition. Measuring the

More information

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah How the backpropagation agorithm works Srikumar Ramaingam Schoo of Computing University of Utah Reference Most of the sides are taken from the second chapter of the onine book by Michae Nieson: neuranetworksanddeepearning.com

More information

Steepest Descent Adaptation of Min-Max Fuzzy If-Then Rules 1

Steepest Descent Adaptation of Min-Max Fuzzy If-Then Rules 1 Steepest Descent Adaptation of Min-Max Fuzzy If-Then Rues 1 R.J. Marks II, S. Oh, P. Arabshahi Λ, T.P. Caude, J.J. Choi, B.G. Song Λ Λ Dept. of Eectrica Engineering Boeing Computer Services University

More information

Learning Fully Observed Undirected Graphical Models

Learning Fully Observed Undirected Graphical Models Learning Fuy Observed Undirected Graphica Modes Sides Credit: Matt Gormey (2016) Kayhan Batmangheich 1 Machine Learning The data inspires the structures we want to predict Inference finds {best structure,

More information

Optimal Blind Nonlinear Least-Squares Carrier Phase and Frequency Offset Estimation for Burst QAM Modulations

Optimal Blind Nonlinear Least-Squares Carrier Phase and Frequency Offset Estimation for Burst QAM Modulations Optima Bind Noninear Least-Squares Carrier Phase and Frequency Offset Estimation for Burst QAM Moduations Yan Wang Erchin Serpedin and Phiippe Cibat Dept. of Eectrica Engineering Texas A&M Uniersity Coege

More information

From Margins to Probabilities in Multiclass Learning Problems

From Margins to Probabilities in Multiclass Learning Problems From Margins to Probabiities in Muticass Learning Probems Andrea Passerini and Massimiiano Ponti 2 and Paoo Frasconi 3 Abstract. We study the probem of muticass cassification within the framework of error

More information

Uniformly Reweighted Belief Propagation: A Factor Graph Approach

Uniformly Reweighted Belief Propagation: A Factor Graph Approach Uniformy Reweighted Beief Propagation: A Factor Graph Approach Henk Wymeersch Chamers University of Technoogy Gothenburg, Sweden henkw@chamers.se Federico Penna Poitecnico di Torino Torino, Itay federico.penna@poito.it

More information

Deep Gaussian Processes for Multi-fidelity Modeling

Deep Gaussian Processes for Multi-fidelity Modeling Deep Gaussian Processes for Muti-fideity Modeing Kurt Cutajar EURECOM Sophia Antipois, France Mark Puin Andreas Damianou Nei Lawrence Javier Gonzáez Abstract Muti-fideity modes are prominenty used in various

More information

Simulations of Droplets falling on a solid surface Using Phase-Field Method

Simulations of Droplets falling on a solid surface Using Phase-Field Method APCOM & ISCM 11-14 th December, 013, Singapore Simuations of Dropets faing on a soid surface Using Phase-Fied Method T. Sakakiabara¹, *T.Takaki 1, and M.Kurata 1 Graduate Schoo of Science and Technoogy,

More information

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network An Agorithm for Pruning Redundant Modues in Min-Max Moduar Network Hui-Cheng Lian and Bao-Liang Lu Department of Computer Science and Engineering, Shanghai Jiao Tong University 1954 Hua Shan Rd., Shanghai

More information

Power Control and Transmission Scheduling for Network Utility Maximization in Wireless Networks

Power Control and Transmission Scheduling for Network Utility Maximization in Wireless Networks ower Contro and Transmission Scheduing for Network Utiity Maximization in Wireess Networks Min Cao, Vivek Raghunathan, Stephen Hany, Vinod Sharma and. R. Kumar Abstract We consider a joint power contro

More information

arxiv: v2 [cond-mat.stat-mech] 14 Nov 2008

arxiv: v2 [cond-mat.stat-mech] 14 Nov 2008 Random Booean Networks Barbara Drosse Institute of Condensed Matter Physics, Darmstadt University of Technoogy, Hochschustraße 6, 64289 Darmstadt, Germany (Dated: June 27) arxiv:76.335v2 [cond-mat.stat-mech]

More information

Discrete Techniques. Chapter Introduction

Discrete Techniques. Chapter Introduction Chapter 3 Discrete Techniques 3. Introduction In the previous two chapters we introduced Fourier transforms of continuous functions of the periodic and non-periodic (finite energy) type, we as various

More information

Cryptanalysis of PKP: A New Approach

Cryptanalysis of PKP: A New Approach Cryptanaysis of PKP: A New Approach Éiane Jaumes and Antoine Joux DCSSI 18, rue du Dr. Zamenhoff F-92131 Issy-es-Mx Cedex France eiane.jaumes@wanadoo.fr Antoine.Joux@ens.fr Abstract. Quite recenty, in

More information

Model-based Clustering by Probabilistic Self-organizing Maps

Model-based Clustering by Probabilistic Self-organizing Maps EEE TRANATN N NERA NETRK, V. XX, N., ode-based ustering by robabiistic ef-organizing aps hih-ian heng, Hsin-hia u, ember, EEE, and Hsin-in ang, ember, EEE Abstract n this paper, we consider the earning

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

A Novel Learning Method for Elman Neural Network Using Local Search

A Novel Learning Method for Elman Neural Network Using Local Search Neura Information Processing Letters and Reviews Vo. 11, No. 8, August 2007 LETTER A Nove Learning Method for Eman Neura Networ Using Loca Search Facuty of Engineering, Toyama University, Gofuu 3190 Toyama

More information

Deep Belief Networks are compact universal approximators

Deep Belief Networks are compact universal approximators 1 Deep Belief Networks are compact universal approximators Nicolas Le Roux 1, Yoshua Bengio 2 1 Microsoft Research Cambridge 2 University of Montreal Keywords: Deep Belief Networks, Universal Approximation

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

hole h vs. e configurations: l l for N > 2 l + 1 J = H as example of localization, delocalization, tunneling ikx k

hole h vs. e configurations: l l for N > 2 l + 1 J = H as example of localization, delocalization, tunneling ikx k Infinite 1-D Lattice CTDL, pages 1156-1168 37-1 LAST TIME: ( ) ( ) + N + 1 N hoe h vs. e configurations: for N > + 1 e rij unchanged ζ( NLS) ζ( NLS) [ ζn unchanged ] Hund s 3rd Rue (Lowest L - S term of

More information

Appendix for Stochastic Gradient Monomial Gamma Sampler

Appendix for Stochastic Gradient Monomial Gamma Sampler 3 4 5 6 7 8 9 3 4 5 6 7 8 9 3 4 5 6 7 8 9 3 3 3 33 34 35 36 37 38 39 4 4 4 43 44 45 46 47 48 49 5 5 5 53 54 Appendix for Stochastic Gradient Monomia Gamma Samper A The Main Theorem We provide the foowing

More information

The Relationship Between Discrete and Continuous Entropy in EPR-Steering Inequalities. Abstract

The Relationship Between Discrete and Continuous Entropy in EPR-Steering Inequalities. Abstract The Reationship Between Discrete and Continuous Entropy in EPR-Steering Inequaities James Schneeoch 1 1 Department of Physics and Astronomy, University of Rochester, Rochester, NY 14627 arxiv:1312.2604v1

More information

Discrete Techniques. Chapter Introduction

Discrete Techniques. Chapter Introduction Chapter 3 Discrete Techniques 3. Introduction In the previous two chapters we introduced Fourier transforms of continuous functions of the periodic and non-periodic (finite energy) type, as we as various

More information

BP neural network-based sports performance prediction model applied research

BP neural network-based sports performance prediction model applied research Avaiabe onine www.jocpr.com Journa of Chemica and Pharmaceutica Research, 204, 6(7:93-936 Research Artice ISSN : 0975-7384 CODEN(USA : JCPRC5 BP neura networ-based sports performance prediction mode appied

More information

On the Goal Value of a Boolean Function

On the Goal Value of a Boolean Function On the Goa Vaue of a Booean Function Eric Bach Dept. of CS University of Wisconsin 1210 W. Dayton St. Madison, WI 53706 Lisa Heerstein Dept of CSE NYU Schoo of Engineering 2 Metrotech Center, 10th Foor

More information

arxiv:hep-ph/ v1 15 Jan 2001

arxiv:hep-ph/ v1 15 Jan 2001 BOSE-EINSTEIN CORRELATIONS IN CASCADE PROCESSES AND NON-EXTENSIVE STATISTICS O.V.UTYUZH AND G.WILK The Andrzej So tan Institute for Nucear Studies; Hoża 69; 00-689 Warsaw, Poand E-mai: utyuzh@fuw.edu.p

More information

Denoising Autoencoders

Denoising Autoencoders Denoising Autoencoders Oliver Worm, Daniel Leinfelder 20.11.2013 Oliver Worm, Daniel Leinfelder Denoising Autoencoders 20.11.2013 1 / 11 Introduction Poor initialisation can lead to local minima 1986 -

More information

Chemical Kinetics Part 2

Chemical Kinetics Part 2 Integrated Rate Laws Chemica Kinetics Part 2 The rate aw we have discussed thus far is the differentia rate aw. Let us consider the very simpe reaction: a A à products The differentia rate reates the rate

More information

SUPPLEMENTARY MATERIAL TO INNOVATED SCALABLE EFFICIENT ESTIMATION IN ULTRA-LARGE GAUSSIAN GRAPHICAL MODELS

SUPPLEMENTARY MATERIAL TO INNOVATED SCALABLE EFFICIENT ESTIMATION IN ULTRA-LARGE GAUSSIAN GRAPHICAL MODELS ISEE 1 SUPPLEMENTARY MATERIAL TO INNOVATED SCALABLE EFFICIENT ESTIMATION IN ULTRA-LARGE GAUSSIAN GRAPHICAL MODELS By Yingying Fan and Jinchi Lv University of Southern Caifornia This Suppementary Materia

More information

pp in Backpropagation Convergence Via Deterministic Nonmonotone Perturbed O. L. Mangasarian & M. V. Solodov Madison, WI Abstract

pp in Backpropagation Convergence Via Deterministic Nonmonotone Perturbed O. L. Mangasarian & M. V. Solodov Madison, WI Abstract pp 383-390 in Advances in Neura Information Processing Systems 6 J.D. Cowan, G. Tesauro and J. Aspector (eds), Morgan Kaufmann Pubishers, San Francisco, CA, 1994 Backpropagation Convergence Via Deterministic

More information

Appendix for Stochastic Gradient Monomial Gamma Sampler

Appendix for Stochastic Gradient Monomial Gamma Sampler Appendix for Stochastic Gradient Monomia Gamma Samper A The Main Theorem We provide the foowing theorem to characterize the stationary distribution of the stochastic process with SDEs in (3) Theorem 3

More information

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih

Unsupervised Learning of Hierarchical Models. in collaboration with Josh Susskind and Vlad Mnih Unsupervised Learning of Hierarchical Models Marc'Aurelio Ranzato Geoff Hinton in collaboration with Josh Susskind and Vlad Mnih Advanced Machine Learning, 9 March 2011 Example: facial expression recognition

More information

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron Neura Information Processing - Letters and Reviews Vo. 5, No. 2, November 2004 LETTER A Soution to the 4-bit Parity Probem with a Singe Quaternary Neuron Tohru Nitta Nationa Institute of Advanced Industria

More information

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto

Au-delà de la Machine de Boltzmann Restreinte. Hugo Larochelle University of Toronto Au-delà de la Machine de Boltzmann Restreinte Hugo Larochelle University of Toronto Introduction Restricted Boltzmann Machines (RBMs) are useful feature extractors They are mostly used to initialize deep

More information

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks

Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Reading Group on Deep Learning Session 4 Unsupervised Neural Networks Jakob Verbeek & Daan Wynen 206-09-22 Jakob Verbeek & Daan Wynen Unsupervised Neural Networks Outline Autoencoders Restricted) Boltzmann

More information

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents MARKOV CHAINS AND MARKOV DECISION THEORY ARINDRIMA DATTA Abstract. In this paper, we begin with a forma introduction to probabiity and expain the concept of random variabes and stochastic processes. After

More information

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber

Representational Power of Restricted Boltzmann Machines and Deep Belief Networks. Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber Representational Power of Restricted Boltzmann Machines and Deep Belief Networks Nicolas Le Roux and Yoshua Bengio Presented by Colin Graber Introduction Representational abilities of functions with some

More information

Statistical Learning Theory: a Primer

Statistical Learning Theory: a Primer ??,??, 1 6 (??) c?? Kuwer Academic Pubishers, Boston. Manufactured in The Netherands. Statistica Learning Theory: a Primer THEODOROS EVGENIOU AND MASSIMILIANO PONTIL Center for Bioogica and Computationa

More information

Issues Causal Modeling is Usually Unidirectional

Issues Causal Modeling is Usually Unidirectional Fuzzy Diagnosis Kai Goebe Bi Cheetham RPI/GE Goba Research goebe@cs.rpi.edu cheetham@cs.rpi.edu Diagnosis In diagnosis, we are trying to find the right cause for a set of symptoms Diagnosis is Driven by

More information

C. Fourier Sine Series Overview

C. Fourier Sine Series Overview 12 PHILIP D. LOEWEN C. Fourier Sine Series Overview Let some constant > be given. The symboic form of the FSS Eigenvaue probem combines an ordinary differentia equation (ODE) on the interva (, ) with a

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

THE THREE POINT STEINER PROBLEM ON THE FLAT TORUS: THE MINIMAL LUNE CASE

THE THREE POINT STEINER PROBLEM ON THE FLAT TORUS: THE MINIMAL LUNE CASE THE THREE POINT STEINER PROBLEM ON THE FLAT TORUS: THE MINIMAL LUNE CASE KATIE L. MAY AND MELISSA A. MITCHELL Abstract. We show how to identify the minima path network connecting three fixed points on

More information

arxiv: v3 [cs.lg] 18 Mar 2013

arxiv: v3 [cs.lg] 18 Mar 2013 Hierarchical Data Representation Model - Multi-layer NMF arxiv:1301.6316v3 [cs.lg] 18 Mar 2013 Hyun Ah Song Department of Electrical Engineering KAIST Daejeon, 305-701 hyunahsong@kaist.ac.kr Abstract Soo-Young

More information

Learning Structured Weight Uncertainty in Bayesian Neural Networks

Learning Structured Weight Uncertainty in Bayesian Neural Networks Learning Structured Weight Uncertainty in Bayesian Neura Networks Shengyang Sun Changyou Chen Lawrence Carin Tsinghua University Duke University Duke University Abstract Deep neura networks DNNs are increasingy

More information

Moreau-Yosida Regularization for Grouped Tree Structure Learning

Moreau-Yosida Regularization for Grouped Tree Structure Learning Moreau-Yosida Reguarization for Grouped Tree Structure Learning Jun Liu Computer Science and Engineering Arizona State University J.Liu@asu.edu Jieping Ye Computer Science and Engineering Arizona State

More information

An Information Geometrical View of Stationary Subspace Analysis

An Information Geometrical View of Stationary Subspace Analysis An Information Geometrica View of Stationary Subspace Anaysis Motoaki Kawanabe, Wojciech Samek, Pau von Bünau, and Frank C. Meinecke Fraunhofer Institute FIRST, Kekuéstr. 7, 12489 Berin, Germany Berin

More information

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1 Inductive Bias: How to generaize on nove data CS 478 - Inductive Bias 1 Overfitting Noise vs. Exceptions CS 478 - Inductive Bias 2 Non-Linear Tasks Linear Regression wi not generaize we to the task beow

More information

Lecture 6: Moderately Large Deflection Theory of Beams

Lecture 6: Moderately Large Deflection Theory of Beams Structura Mechanics 2.8 Lecture 6 Semester Yr Lecture 6: Moderatey Large Defection Theory of Beams 6.1 Genera Formuation Compare to the cassica theory of beams with infinitesima deformation, the moderatey

More information

The EM Algorithm applied to determining new limit points of Mahler measures

The EM Algorithm applied to determining new limit points of Mahler measures Contro and Cybernetics vo. 39 (2010) No. 4 The EM Agorithm appied to determining new imit points of Maher measures by Souad E Otmani, Georges Rhin and Jean-Marc Sac-Épée Université Pau Veraine-Metz, LMAM,

More information

Chemical Kinetics Part 2. Chapter 16

Chemical Kinetics Part 2. Chapter 16 Chemica Kinetics Part 2 Chapter 16 Integrated Rate Laws The rate aw we have discussed thus far is the differentia rate aw. Let us consider the very simpe reaction: a A à products The differentia rate reates

More information

Determining The Degree of Generalization Using An Incremental Learning Algorithm

Determining The Degree of Generalization Using An Incremental Learning Algorithm Determining The Degree of Generaization Using An Incrementa Learning Agorithm Pabo Zegers Facutad de Ingeniería, Universidad de os Andes San Caros de Apoquindo 22, Las Condes, Santiago, Chie pzegers@uandes.c

More information

Phase Change Equation of State for FSI Applications

Phase Change Equation of State for FSI Applications 15 th Internationa LS-DYNA Users Conference FSI / ALE Phase Change Equation of State for FSI Appications Mhamed Soui, Ramzi Messahe Lie Uniersity France Cyri Regan, Camie Ruiuc Ingeiance Technoogies, Agence

More information

Problem set 6 The Perron Frobenius theorem.

Problem set 6 The Perron Frobenius theorem. Probem set 6 The Perron Frobenius theorem. Math 22a4 Oct 2 204, Due Oct.28 In a future probem set I want to discuss some criteria which aow us to concude that that the ground state of a sef-adjoint operator

More information

Two view learning: SVM-2K, Theory and Practice

Two view learning: SVM-2K, Theory and Practice Two view earning: SVM-2K, Theory and Practice Jason D.R. Farquhar jdrf99r@ecs.soton.ac.uk Hongying Meng hongying@cs.york.ac.uk David R. Hardoon drh@ecs.soton.ac.uk John Shawe-Tayor jst@ecs.soton.ac.uk

More information

Notes on Backpropagation with Cross Entropy

Notes on Backpropagation with Cross Entropy Notes on Backpropagation with Cross Entropy I-Ta ee, Dan Gowasser, Bruno Ribeiro Purue University October 3, 07. Overview This note introuces backpropagation for a common neura network muti-cass cassifier.

More information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information

Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Measuring the Usefulness of Hidden Units in Boltzmann Machines with Mutual Information Mathias Berglund, Tapani Raiko, and KyungHyun Cho Department of Information and Computer Science Aalto University

More information

Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn

More information

The Binary Space Partitioning-Tree Process Supplementary Material

The Binary Space Partitioning-Tree Process Supplementary Material The inary Space Partitioning-Tree Process Suppementary Materia Xuhui Fan in Li Scott. Sisson Schoo of omputer Science Fudan University ibin@fudan.edu.cn Schoo of Mathematics and Statistics University of

More information

A proposed nonparametric mixture density estimation using B-spline functions

A proposed nonparametric mixture density estimation using B-spline functions A proposed nonparametric mixture density estimation using B-spine functions Atizez Hadrich a,b, Mourad Zribi a, Afif Masmoudi b a Laboratoire d Informatique Signa et Image de a Côte d Opae (LISIC-EA 4491),

More information

First-Order Corrections to Gutzwiller s Trace Formula for Systems with Discrete Symmetries

First-Order Corrections to Gutzwiller s Trace Formula for Systems with Discrete Symmetries c 26 Noninear Phenomena in Compex Systems First-Order Corrections to Gutzwier s Trace Formua for Systems with Discrete Symmetries Hoger Cartarius, Jörg Main, and Günter Wunner Institut für Theoretische

More information

Paper presented at the Workshop on Space Charge Physics in High Intensity Hadron Rings, sponsored by Brookhaven National Laboratory, May 4-7,1998

Paper presented at the Workshop on Space Charge Physics in High Intensity Hadron Rings, sponsored by Brookhaven National Laboratory, May 4-7,1998 Paper presented at the Workshop on Space Charge Physics in High ntensity Hadron Rings, sponsored by Brookhaven Nationa Laboratory, May 4-7,998 Noninear Sef Consistent High Resoution Beam Hao Agorithm in

More information

<C 2 2. λ 2 l. λ 1 l 1 < C 1

<C 2 2. λ 2 l. λ 1 l 1 < C 1 Teecommunication Network Contro and Management (EE E694) Prof. A. A. Lazar Notes for the ecture of 7/Feb/95 by Huayan Wang (this document was ast LaT E X-ed on May 9,995) Queueing Primer for Muticass Optima

More information

Supervised i-vector Modeling - Theory and Applications

Supervised i-vector Modeling - Theory and Applications Supervised i-vector Modeing - Theory and Appications Shreyas Ramoji, Sriram Ganapathy Learning and Extraction of Acoustic Patterns LEAP) Lab, Eectrica Engineering, Indian Institute of Science, Bengauru,

More information

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA)

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA) 1 FRST 531 -- Mutivariate Statistics Mutivariate Discriminant Anaysis (MDA) Purpose: 1. To predict which group (Y) an observation beongs to based on the characteristics of p predictor (X) variabes, using

More information