Cascaded redundancy reduction

Size: px

Start display at page:

Download "Cascaded redundancy reduction"

Coral Eaton
5 years ago
Views:

1 Network: Comput. Neural Syst. 9 (1998) Printe in the UK PII: S X(98) Cascae reunancy reuction Virginia R e Sa an Geoffrey E Hinton Department of Computer Science, University of Toronto, Toronto, Ontario, M5S 1A4, Canaa Receive 13 September 1997 Abstract. We escribe a metho for incrementally constructing a hierarchical generative moel of an ensemble of binary ata vectors. The moel is compose of stochastic, binary, logistic units. Hien units are ae to the moel one at a time with the goal of minimizing the information require to escribe the ata vectors using the moel. In aition to the top own generative weights that efine the moel, there are bottom up recognition weights that etermine the binary states of the hien units given a ata vector. Even though the stochastic generative moel can prouce each ata vector in many ways, the recognition moel is force to pick just one of these ways. The recognition moel therefore unerestimates the ability of the generative moel to preict the ata, but this unerestimation greatly simplifies the process of searching for the generative an recognition weights of a new hien unit. 1. Introuction Unsupervise learning algorithms attempt to extract statistical structure from an ensemble of input vectors without the help of an aitional teaching signal. One way of efining what it means to extract statistical structure is to appeal to a stochastic generative moel that prouces ata vectors from hien stochastic variables. If a relatively simple generative moel is likely to have prouce the observe ata vectors, then we can view the unerlying states that the moel uses to generate a particular ata vector as an economical representation of that vector. In this paper we consier the problem of fitting a generative moel that is compose of multiple layers of stochastic binary units. The bottom layer consists of visible units which correspon to the iniviual components of a binary ata vector. The moel generates ata by starting at the top layer an working ownwars. At the top level, each unit turns on ranomly an inepenently with a probability that epens only on its generative bias. At each subsequent level, a unit, i, receives top own input that epens on the alreay ecie binary states, s j, of units in the layer above an the generative weights, g j,i. This top own input is combine with the unit s generative bias, g 0,i to prouce the unit s total generative input, x i : x i = g 0,i + s j g j,i (1) j>i Present aress: Sloan Center for Theoretical Neurobiology, Department of Physiology Box 0444, 513 Parnassus Ave, San Francisco, CA , USA. The bias is equivalent to a generative weight from a bias unit that is always on. For notational convenience we efine this noe as unit X/98/ $19.50 c 1998 IOP Publishing Lt 73

2 74 V R e Sa an G E Hinton The total generative input is then put through a logistic function to etermine the probability, p i, of the unit being turne on: 1 p i = σ(x i ) =. (2) 1+e x i Because the units are stochastic, each top own generative pass through the moel will typically prouce a ifferent ata vector an each ata vector can typically be prouce in many ifferent ways. If we start with a multilayer moel with a fixe architecture an we want to learn the generative weights an biases, the obvious objective function is the log likelihoo of the observe ata uner the generative moel: log p( θ) = ( ) log p(α θ)p( α, θ) (3) α where is an inex over ata vectors, α is an inex over all possible assignments of binary states to the hien units an θ is the generative weights an biases. If units receive top own generative connections from many units in the layer above, to maximize this objective function using either the EM algorithm or graient ascent is an intractable problem. Both of these algorithms nee to compute the posterior istribution over exponentially many hien state vectors, α, given a ata vector,, an the generative parameters, θ. Instea of computing the posterior istribution exactly, it can be approximate using Gibbs sampling (Neal 1992) or mean-fiel methos (Jaakkola et al 1996), though neither of these approaches seems to have much biological plausibility. A Helmholtz machine (Dayan et al 1995) uses a separate set of recognition connections to compute an approximation to the posterior istribution. It can be shown that the use of an approximation still allows EM to maximize a lower boun on the log probability of the ata, with the tightness of the boun epening on the quality of the approximation (Neal an Hinton 1998). For one version of the Helmholtz machine (Hinton et al 1995), there is a very simple wake sleep learning algorithm that uses only locally available information. An extreme version of the iea behin Helmholtz machines is for the recognition connections to approximate the posterior istribution using a single vector of hien states. Although this provies a much worse boun than the other approximations it greatly simplifies the search for the recognition an generative weights because there is no nee to integrate across the many alternative ways of proucing the ata. The hope is that the computational convenience will compensate for the looser boun. A seconary aim is to see how much is gaine by allowing the recognition connections of a Helmholtz machine to prouce a whole istribution rather than a single hien state vector. In this form, the algorithm is similar to that of Relich (1993). The biggest ifference is in the search strategy an the construction in this case of a stochastic generative moel. Given a ata vector,, the recognition connections prouce a binary state vector, s, over the hien units. Using s we get an upper boun on the negative log probability of the ata: log p( θ) = log α p(α θ)p( α, θ) log p(s θ)p( s,θ). (4) This is essentially the same ebate as Viterbi versus Baum Welch in the fitting of hien Markov moels, except that Baum Welch computes the true posterior istribution rather than just an approximation to it.

3 Cascae reunancy reuction 75 This boun an its erivatives can be compute in a time that is linear in the number of connections, so it is much more efficient to perform graient escent in the upper boun than in the true negative log probability of the ata. 2. A minimum escription length (MDL) interpretation of the objective function There is an alternative interpretation of the upper boun which is mathematically equivalent, but conceptually ifferent. We think in terms of a communication game in which a sener must communicate each ata vector to a receiver. Rather than sening the iniviual components of each ata vector separately, the sener first uses the recognition connections to prouce a hierarchical representation of the ata vector. Then she sens this representation starting at the top level an working ownwars. The iea is that if the sener has foun a goo moel for representing the ata, it shoul take fewer bits to sen the ata using the representations prescribe by the moel than to sen the raw ata. In the full MDL framework, we must also inclue the one-time cost of communicating the moel itself. In this paper we ignore the moel cost, but it coul be use as a criterion for eciing when to stop enlarging the moel (only a the new unit when the improvement in coing cost is more than the cost of communicating the new unit; this is epenent on the coing strategy for the weights). Accoring to Shannon s coing theorem, if a sener an a receiver have agree upon a probability istribution uner which a iscrete event has probability p, the event can be communicate using log 2 p bits. This is an asymptotic result that involves using block coing techniques, but the etails are irrelevant here. On average, it is best for the sener an receiver to agree on the correct probability istribution, but this is not require. They can use any istribution they like. For communicating the binary state, si, of unit i that is prouce by the recognition connections, the sener an receiver use the Bernoulli istribution pi,1 p i that is obtaine by applying the generative moel to the states in the layer above (see equation (2)). This istribution is available to the receiver because the states of the units in the layer above have alreay been communicate. So it requires si log 2 pi (1 si ) log 2 (1 p i ) bits to communicate s i. The cost of communicating a ata vector is simply the sum of this quantity over all units, incluing the visible units that represent the ata. So the escription length is C() = i s i log 2 p i (1 s i ) log 2 (1 p i ) = log p(s θ)p( s,θ). (5) 3. Why a hien unit helps Consier the cost of escribing the ata using no hien units. The only ajustable parameters in the moel are the generative biases of the visible units. The cost in equation (5) is minimize by setting these biases so that p i = σ(g 0,i ) is equal to the fraction of the ata vectors in which unit i is on. We call this the base-rate moel. The base-rate moel makes goo use of the iniviual frequencies with which visible units come on, but it ignores all correlations. Now, suppose we have a single hien unit, k. If a subset of the visible units ten to come on together, this reunancy can be capture by setting the recognition weights, r i,k, of the hien unit so that it comes on in these circumstances an setting its generative weights, g k,i, so as to increase p i for all the units in the subset. By moifying

4 76 V R e Sa an G E Hinton the generative biases of the visible units it is also possible to reuce p i when the hien unit is off. In effect, the hien unit allows the base rates to be epenent on which ata vector is being escribe. With just one hien unit we obtain a mixture of two base-rate moels. After aing one hien unit, we coul split the ata into two isjoint subsets an apply the same algorithm again to each subset separately. This woul create a tree structure an woul be a natural extension of ecision tree algorithms like CART (Breiman et al 1984) to the task of probability ensity estimation. Unfortunately, splitting the ata in this way is usually a ba strategy when there are multiple ifferent regularities in the ata. Suppose, for example, that components of the ata vector can be split into two subsets. Within each subset there are weak correlations, but between subsets components are inepenent. The first split in a tree can use all of the ata to etect the reunancy within one of the subsets, but at the next level own the other reunancy must be iscovere separately in each half of the tree using only half the ata. To avoi this problem, we consier all of the ata when aing each new hien unit, but we take into account the work that is alreay being one by the previous hien units in creating appropriate top own probabilities for escribing the states of lower units. We also allow new hien units to receive recognition connections from existing hien units an to sen generative connections to them. In aing a new hien unit, k, we greeily attempt to minimize the cost of escribing the ata using the new unit an all the previous ones. This cost inclues the cost of escribing the state of k an the cost of escribing the state of every pre-existing unit, i, using the new generative probabilities create by combining the new bias for i with the generative weight from k an the pre-existing generative weights from units above i: C = ( s k log σ(g 0,k ) (1 sk ) log[1 σ(g 0,k)] ) + ( s i log[σ(x i )] (1 si )log[1 σ(x i)] ) (6) i where x i is the generative input to unit i: x i = g 0,i + sk g k,i + sj g j,i. k>j>i The term k>j>i s j g j,i is unaffecte by aing unit k, so it can be store for each ata vector in the training set, thus eliminating much computation. To emphasize this an to simplify subsequent equations we efine G k>j>i sj g j,i. (7) k>j>i In searching for the best generative an recognition weights for k, we allow the generative biases of all earlier units to be moifie, but not the other generative weights or the recognition weights or biases of other units. 4. Creating a smooth search space Once a new hien unit has been ae to the network, it behaves eterministically. Its recognition weights cause it to be either on or off for each ata vector. However, the search for a suitable set of recognition weights is easier if the weights have a smooth, ifferentiable effect on the unit s behaviour. This can be achieve by using a stochastic unit whose probability of being on is a smooth monotonic function of its recognition weights.

5 Cascae reunancy reuction 77 Each ata vector etermines a single state for all previous hien units but for the new unit we consier both possible states an compute the expecte value of the cost function in equation (6) given this stochastic behaviour. To make the cost function represent the negative log probability of the ata vector allowing for both states of the new hien unit, it is necessary to subtract the entropy of the state of the new hien unit, as explaine in Hinton an Zemel (1994). If this entropy term is omitte, the expecte cost is always minimize by scaling up all of the units recognition weights so that it behaves eterministically. While searching for the best recognition weights we want the hien unit to behave stochastically in orer to smooth the search space, but at the conclusion of the search we want the hien unit to behave eterministically. This can be achieve by using a temperature parameter which scales both the entropy term an the softness of the logistic function use in recognition (but not the one use in generation). During the search the temperature is reuce from 1 to 0 in small steps. At a given temperature, the cost function to be minimize is ( [ q k s i log ( σ(g k>j>i + g 0,i + g k,i ) ) C = i<k (1 si ) log ( 1 σ(g k>j>i + g 0,i + g k,i ) )] +(1 qk ) [ si log σ(g k>j>i + g 0,i) (1 si ) log ( 1 σ(g k>j>i + g 0,i) )]) + [ q k log σ(g 0,k ) (1 qk ) log ( 1 σ(g 0,k ) )] + T [ (q k log q k ) + (1 q k ) log(1 q k )] (8) where q k is the recognition probability of unit k for ata vector. The last line of equation (8) is the entropy of unit k (weighte by T ) an the penultimate line is the expecte cost of coing the state of unit k given its generative bias. The first two lines of the equation are the cost of coing all the other units given that unit k is on, weighte by the probability that k is on. The thir an fourth lines are the cost if k is off, weighte by the corresponing probability. 5. The stanar CRR learning proceure We have explore several variants of the cascae reunancy reuction (CRR) learning proceure. To simplify the iscussion we first escribe the stanar proceure. The outermost loop of the proceure consists of aing hien units one at a time in a cascae fashion (shown in figure 1) as in Fahlman an Lebiere (1990). Once a unit has been ae, its recognition weights an bias are never change. In searching for the best generative an recognition weights for the new unit, CRR ecreases the temperature from 1 to 0 in steps, an at each temperature it performs an alternating optimization that closely resembles the EM algorithm. Holing the recognition weights fixe, an iterative search is performe for the optimal generative weights an biases (an M-step). Then, holing the generative weights an biases fixe, the recognition weights

6 78 V R e Sa an G E Hinton generative connections recognition connections g 0,k q k r 0,k g 0,j s j g k,i r i,k r 0,j G k>j>i s i g 0,i Figure 1. The CRR network. The binary activity of unit i, resulting from application of pattern to the network, is given by si. The unit currently being consiere (k) has analogue activity given by qk. The recognition weight from unit i to unit k is given by r i,k an the generative weights from k to i by g k,i The bias unit is efine to be unit 0. For a given input pattern we represent the sum generative input from all ae units j to input unit i by G k>j>i. an bias are iteratively improve (a partial E-step). The overall algorithm has the form: Set the initial generative biases for visible units Repeat aing hien units until stopping criterion is met Create a new unit, k, with ranom recognition weights an recognition bias Set T = 1 Repeat until T = 0 Repeat N times Do Newton searches for new generative biases of all previous units, j Do Newton searches for generative weights from new unit, k Do a conjugate graient search for recognition weights an bias of k Decrease T using a temperature scheule A the new hien unit to the network Set the generative bias of the new hien unit Use conjugate graient to back-fit ALL generative weights an biases Upating the recognition weights The upate rule for the recognition weights can be obtaine by taking the partial erivative of the cost with respect to each moifiable recognition weight. This gives C r j,k = 1 T C qk (1 q k )s j (9) q k

7 Cascae reunancy reuction 79 where C qk = ( si log σ(g k>j>i + g 0,i + g k,i ) i<k σ(g k>j>i + g (1 si 0,i) ) log 1 σ(g k>j>i + g ) 0,i + g k,i ) 1 σ(g k>j>i + g 0,i) σ(g 0,k ) log (1 σ(g 0,k )) + T log qk (1 qk (10) ). A conjugate graient search technique is use to etermine successive steps in the search Upating the generative weights The generative weights coul also be upate using a graient metho. However, we can o better by taking avantage of the inepenence between the iniviual weights an the monotonicity of the erivative. First note that the partial erivative with respect to one of the generative weights epens only on that generative weight an the bias weight to the same unit: C = q ( k s g i σ(g k>j>i + g 0,i + g k,i ) ). (11) k,i Similarly, for the partial erivative with respect to one of the bias weights: C = q ( k s g i σ(g k>j>i + g 0,i + g k,i ) ) (1 qk )( si σ(g k>j>i + g 0,i) ). (12) 0,i This gives two equations (with two unknowns). However, we will show below that fining the local extrema involves solving equations of only one variable. Setting the erivative equal to zero in equations (11) an (12) gives 0 = q ( k s i σ(g k>j>i + g 0,i + g k,i ) ) 0 = ( q k ( s i σ(g k>j>i + g 0,i + g k,i ) ) + (1 q k )( s i σ(g k>j>i + g 0,i) )). an combining gives 0 = (1 q k )( s i σ(g k>j>i + g 0,i) ) which we can solve for g 0,i. Unfortunately, this equation cannot be solve irectly, but as it is a monotonic function it coul be solve using a binary search technique. We chose, however, to use Newton s metho. Letting A = (1 qk )( si σ(g k>j>i + g 0,i) ) we solve for g 0,i using ( ) g 0,i (t + 1) = g 0,i (t) + sign(a) min 2, A(g 0,i (t)) (A/g 0,i ) ɛ (13) where ɛ is a positive safety factor use to avoi instability where the secon erivative (first erivative of A) vanishes. This, together with the constraint that the maximum step size have magnitue 2, gives a robust implementation of Newton s algorithm. Note that ue to the monotonicity of A, A/g 0,i is always non-positive.

8 80 V R e Sa an G E Hinton Once we have solve for g 0,i in equation (13), we can use it to fin g k,i in a similar manner: B = q ( k s i σ(g k>j>i + g 0,i + g k,i ) ) ( ) (14) g k,i (t + 1) = g k,i (t) + sign(b) min 2, B(g 0,i,g k,i (t))) (B/g k,i ) ɛ Back-fitting the generative weights Greey algorithms have the isavantage that a step taken early may lea to subsequent poor performance. Ieally, we woul like to go back an moify earlier connections with ae hinsight. Unfortunately, to o this for all connections woul efeat the efficiency avantages of a greey algorithm. Changing the recognition weights to a lower unit coul invaliate the recognition weights to higher units as they epen on activity propagate from lower units. This woul also invaliate the generative weights as they epen on the activate recognition pattern s. The generative weights an biases, however, can be upate for little cost an without invaliating the recognition connections. The upate rule is etermine from the appropriate graient: ( ( C )) = s j s i σ s k g k,i g j,i k>i ( C = s i σ s k g k,i ). (16) g 0,i k>i The conjugate graient algorithm is also use to upate these weights. A few conjugate graient steps of back-fitting are performe after the aition of each new unit. 6. Algorithm moifications 6.1. Aing a unit more carefully One way to avoi some of the worst local minima while aing units is to consier a pool of units for each new input unit as in cascae-correlation (Fahlman an Lebiere 1990). These units can be initialize with ifferent ranom weights an traine inepenently in parallel. After training, the performance of each caniate unit can be assesse an the best unit ae. We foun that for a fixe number of units ae, this i improve the performance, but as iscusse below, for better performance with a fixe amount of time, serial consieration of caniate units was better. Only one caniate unit was traine at any time but if it i not lea to a lower coing cost it was not inclue. To avoi overfitting, the coing costs were evaluate on a separate valiation set, rather than on the ata use to train the weights. This evaluation was performe before the back-fitting stage Mini-batches All the iterative upates are one using batch algorithms. For the ata-set we use, an most suitable ata-sets, the size of the set is very large an the patterns within it are quite reunant. This suggests that an on-line algorithm woul be more efficient. To maintain (15)

9 Cascae reunancy reuction 81 the parameter-free avantages of the conjugate graient batch algorithm while increasing the learning spee we investigate the effect of using mini-batches. For each consiere unit 20% of the patterns, balance for class content, were use for the recognition weight searches an the generative Newton searches. Using this strategy, we achieve rapi initial ecrease in coing cost. At this stage it became more efficient to increase the size of the mini-batches instea of increasing the number of minimizing steps pursue for each unit. The batch size traes off spee for appropriate escent irection. Early in the search, the exact estimate of the graient is not important to allow significant improvement. When the search has evolve to a reasonable solution, further progress epens on progressively more accurate estimates of the graient of the esire function. This suggests that an aaptively growing batch size that monitors the percentage of rejecte units an increases the batch size when it gets too large woul be a goo strategy. 7. Results We teste the algorithm using a ata-set previously use to test similar algorithms (Hinton et al 1995; Frey et al 1996). This ata-set consists of normalize igits quantize into 8 8 binary images from the US Postal Service Office of Avance Technology. As in Frey et al (1996) the ata were ivie into 6000 training examples, 2000 valiation examples an 5000 examples for testing. The valiation ata were use to evaluate whether a unit shoul be ae an also, as they are alreay loae, for back-fitting the generative weights after aition of a unit. Training curves are shown in figures 2 an 3 for training with the full (training) ata-set, the 20% mini-batches as well as a run with a graually increasing batch size. The batches were 10% for the first 20 iterations, 20% for the next 20 iterations, 50% for the next 10 iterations, an 100% for the last 30 iterations. The curves plot the CRR calculate coing cost (an upper boun on the true coing cost) on the valiation set. As mentione, figure full batches 1/5 batches aaptive batches Coing Cost (bits) Time (secons) Figure 2. Coing cost versus time. The x-axis gives the number of secons when run on a 200 MHz R4400 chip with a 4 MB seconary cache. Each plotte point represents the aition of one unit. The plot for the mini-batches was run for s but ae no units after 5500 s. The y-axis gives the average coing cost over the ata-set (in bits) for the network at that stage.

10 82 V R e Sa an G E Hinton full batches 1/5 batches aaptive batches Coing Cost (bits) Number of Hien Units Figure 3. Coing cost versus number of ae units. This graph plots the same information as in figure 2, but with respect to the number of units in the current network. This is not just a simple rescale version of figure 2, as the consieration of each unit i not always result in a unit being ae to the network. shows that the mini-batches allow faster learning (particularly at the beginning) but are limite in their asymptotic accuracy. Increasing the batch size throughout learning, allows fast initial learning an goo late learning. Figure 3 shows that when using the valiation set, to ecie about aing a new unit, the resulting efficiency of the networks are similar for all the training methos. While the objective function of our network was specifically esigne, for simplicity, to optimize the coing cost using a single hien state per ata pattern, the resulting generative moel is a stochastic moel capable of coing each pattern with a istribution over hien states just like those constructe using Gibbs sampling, mean-fiel methos, an Helmholtz machines. We can therefore estimate the actual cost of coing the ata using the true posterior istribution of our traine networks. Testing the network traine with the increasing batch size, we foun that the CRR estimate coing cost (using s as the sole posterior state) was significantly overestimating the true coing cost of the network, particularly as the size of the network grew. Figure 4 shows the estimate coing cost using the CRR (single hien state) cost on the valiation set (as shown in figure 2) reprouce besie the coing cost on the test set calculate using Gibbs sampling of the full posterior istribution of hien states. Note that the Gibbs sample estimate of the test error gives a smaller coing cost (for all but the smallest network) an that the ifference increases with the size of the network. The Gibbs sample coing cost, on the test ata, for the network with 30 hien units was 38.6 bits, compare with the CRR valiation coing cost of 40.8 bits. For comparison a Helmholtz machine traine using the wake sleep algorithm with 72 hien units (in a layere network architecture) achieve a test-set coing cost of 39.1 bits in 3120 s (Frey et al 1996). This coing cost was achieve using the recognition istribution learne by the algorithm. For networks of this size, it was not possible to compute the true log-likelihoo of the ata uner the generative network learne by the wake sleep algorithm. However, Brenan Frey (1977) has foun that in a variety of cases in which the networks are small enough to compute the log-likelihoo of the ata exactly, the coing cost given by the recognition istribution was very close to the true

11 Cascae reunancy reuction CRR valiation error (upper boun) Gibbs sample test error (close approximation) 52 Coing Cost (bits) Time (secons) Figure 4. Coing cost versus time. This graph shows the effect of consiering the whole posterior istribution over hien states when calculating the coing cost. The whole posterior istribution is approximate using prolonge Gibbs sampling. log-likelihoo uner the generative moel. Frey et al (1996) reporte the total training time as 7200 s, which inclue training several architectures from which they picke the best. 8. Discussion The central iea of this training algorithm is to avoi the computational cost of computing the posterior istribution over the hien states by using a single-state approximation. This gives computational simplicity at the expense of increase coing cost. We foun that even though the network was traine to optimize coing cost using a single hien state, it ha a reuce cost when the full posterior istribution over hien states was consiere. Consiere in this way, the algorithm prouce networks with similar coing cost to those create using the wake sleep algorithm. The avantage of the CRR metho is that the size of the network oes not have to be pre-etermine. Also, it is hope that as the networks are biase to performing well with single-state posteriors, they might lea to simpler more unerstanable representations. On the particular problem we trie, our networks of 30 units achieve a lower coing cost than the 72 hien unit network traine with the wake sleep algorithm. Acknowlegments We thank Brenan Frey for proviing us with the coe for the Gibbs sampling calculations an the performance results for the stochastic Helmholtz machine. This research was fune by the Institute for Robotics an Intelligent Systems, the Information Technology Research Center, an NSERC. GH is the Nesbitt Burns Fellow of the Canaian Institute for Avance Research.

12 84 V R e Sa an G E Hinton References Breiman L, Frieman J H an Olshen R A an Stone C J 1984 Classification an Regression Trees (Belmont, CA: Wasworth) Dayan P, Hinton G E, Neal R M an Zemel R S 1995 The Helmholtz machine Neural Comput Fahlman S E an Lebiere C 1990 The Cascae-Correlation Learning Architecture Avances in Neural Information Processing Systems 2, e D S Touretzky (San Mateo, CA: Kaufmann) pp Frey B J 1997 personal communication Frey B J, Hinton G E an Dayan P 1996 Does the wake sleep algorithm prouce goo ensity estimators? Avances in Neural Information Processing Systems 8 e D S Touretzky, M C Mozer an M E Hasselmo (Cambrige, MA: MIT Press) pp Hinton G E, Dayan P, Frey B J an Neal R M 1995 The wake sleep algorithm for unsupervise neural networks Science Hinton G E an Zemel R S 1994 Autoencoers, minimum ecription length an Helmholz free energy Avances in Neural Information Proceessing Systems 6 e J D Cowan, G Tesauro an J Alspector (San Mateo, CA: Morgan Kaufmann) pp 3 10 Jaakkola T, Saul L K an Joran M I 1996 Fast learning by bouning likelihoos in sigmoi type belief networks Avances in Neural Information Processing Systems 8, e D S Touretzky, M C Mozer an M E Hasselmo (Cambrige, MA: MIT Press) pp Neal R M 1992 Connectionist learning of belief networks Artificial Intelligence Neal R M an Hinton G E 1998 A new view of the EM algorithm that justifies incremental an other variants Learning in Graphical Moels e M I Joran (Dorrecht: Kluwer) to be publishe (currently available from ftp://ftp.cs.utoronto.ca/pub/rafor/em.ps.z) Relich A N 1993 Reunancy reuction as a strategy for unsupervise learning Neural Comput

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing