Approximating Bayesian inference with a sparse distributed memory system

Size: px
Start display at page:

Download "Approximating Bayesian inference with a sparse distributed memory system"

Transcription

1 Approximating Bayesian inference with a sparse distributed memory system Joshua T. Abbott (joshua.abbott@berkeley.edu) Jessica B. Hamrick (jhamrick@berkeley.edu) Thomas L. Griffiths (tom griffiths@berkeley.edu) Department of Psychology, University of California, Berkeley, CA USA Abstract Probabilistic models of cognition have enjoyed recent success in explaining how people make inductive inferences. Yet, the difficult computations over structured representations that are often required by these models seem incompatible with the continuous and distributed nature of human minds. To reconcile this issue, and to understand the implications of constraints on probabilistic models, we take the approach of formalizing the mechanisms by which cognitive and neural processes could approximate Bayesian inference. Specifically, we show that an associative memory system using sparse, distributed representations can be reinterpreted as an importance sampler, a Monte Carlo method of approximating Bayesian inference. This capacity is illustrated through two case studies: a simple letter reconstruction task, and the classic problem of property induction. Broadly, our work demonstrates that probabilistic models can be implemented in a practical, distributed manner, and helps bridge the gap between algorithmic- and computationallevel models of cognition. eywords: Bayesian inference, importance sampling, rational process models, associative memory models, sparse distributed memory Introduction Probabilistic models of cognition can be used to explain the complex inductive inferences people make every day, such as identifying the content of images or learning new concepts from limited evidence (Griffiths, Chater, emp, Perfors, & Tenenbaum, 2010; Tenenbaum, emp, Griffiths, & Goodman, 2011). However, these models are typically formulated at what Marr (1982) called the computational level, focusing on the abstract problems people have to solve and their ideal solutions. As a result, they explain why people behave the way they do, rather than how cognitive and neural processes support these behaviors. This approach is thus quite different from previous work on modeling human cognition, which focused on Marr s algorithmic and implementation levels, and has been criticized because it seems to imply that human minds and brains need to solve intractable computational problems and use structured representations (Gigerenzer & Todd, 1999; McClelland et al., 2010). Understanding the actual commitments that computational-level accounts of human cognition based on probabilistic models make at the algorithmic and implementation level requires considering how these levels of analysis could be bridged (Griffiths, Vul, & Sanborn, 2012). Identifying specific cognitive algorithms and neural architectures that can approximate Bayesian inference is a key step towards knowing whether it really poses an intractable problem for human minds, or whether structured representations need to be used to implement models that involve structured probability distributions. In this paper, we take on this challenge by showing that an associative memory using sparse distributed representations can be used to approximate Bayesian inference, producing behavior consistent with a structured statistical model while using distributed representations of the kind normally associated with artificial neural networks. The associative memory that we use to approximate Bayesian inference implements a Monte Carlo algorithm known as importance sampling. Previous work has shown that this algorithm can be implemented in a common psychological process model an exemplar model (Shi, Griffiths, Feldman, & Sanborn, 2010). Shi and Griffiths (2009) further demonstrated that importance sampling can be implemented with a radial basis function neural network. However, this neural network used a localist representation, in which each hypothesis considered by the model had to be represented with a single neuron a grandmother cell. While this might be plausible for modeling aspects of perception in which a wide range of neurons prefer specific stimuli, it becomes less plausible for modeling complex cognitive tasks in which hypotheses correspond to structured representations. For example, having separate neurons for each concept or causal structure we consider seems implausible. We demonstrate that an associative memory that uses distributed representations specifically, sparse distributed memory (SDM) (anerva, 1988, 1993) can be used to approximate Bayesian inference through importance sampling. The underlying idea is simple: we use the associative memory to store and retrieve exemplars, allowing us to build on the equivalence between exemplar models and importance sampling. The critical advance is that this is done using distributed representations, meaning that arbitrary hypotheses can be represented, and arbitrary distributions of exemplars encoded by a single architecture. We show that the SDM naturally implements one class of Bayesian models, and explain how to generalize it to implement a broader range of models. The plan of the paper is as follows. First, we give a brief overview of performing Bayesian inference with importance sampling and summarize how the sparse distributed memory system is implemented. Next, we formalize how importance sampling can be performed using a SDM. We provide two case studies drawn from existing literature in which we use the SDM to approximate existing Bayesian models. The first case study is a simple example involving reconstructing English letters from noisy inputs, and the second is a more sophisticated model of property induction. We conclude the paper with a discussion of implications and future directions.

2 Background Our results depend on two important sets of mathematical ideas: approximating Bayesian inference by importance sampling, and sparse distributed memories. We introduce these ideas in turn. Bayesian inference and importance sampling Probabilistic models of cognition provide rational solutions to problems of inductive inference, where probability distributions represent degrees of belief and are updated as more data becomes available. Beliefs are updated by applying Bayes rule, which says that the posterior probability of a hypothesis, h, given observed data, d, is proportional to the probability of observing d if h were the correct hypothesis (known as the likelihood) multiplied by the prior probability of that hypothesis: p(d h)p(h) p(h d) = H p(d h)p(h)dh (1) Unfortunately, computing the integral in the denominator is computationally expensive and often intractable. This has resulted in the development of many algorithms for approximating Bayesian inference. For the sake of illustration, consider the case in which we have noisy observations x of a stimulus x. To recover the value of x, we use Bayes rule to compute the posterior distribution over x : p(x x) = p(x x )p(x ) x p(x x )p(x )dx (2) It is often desirable to compute the expectation of the posterior distribution over some function f (x ): E[ f (x ) x] = f (x )p(x x)dx (3) where the choice of f (x ) depends on the task. However, evaluating this expectation still requires computing the full posterior distribution. To approximate expectations over posterior distributions, we can use a Monte Carlo method known as importance sampling (see, e.g., Neal, 1993), in which a finite set of samples are used to represent the posterior. These samples are drawn from a surrogate distribution q(x ) and assigned weights proportional to the ratio p(x x)/q(x ): E[ f (x ) x] = f (x ) p(x x) q(x ) q(x )dx (4) Given a set of samples {x k } distributed according to q(x ), this integral can be approximated by: E[ f (x ) x] 1 f (xk ) p(x k x) q(xk ) (5) with the approximation becoming more precise as becomes larger. A 2 A 1 A 3 A 4 A 5 A 7 x A 6 D A 9 A M A 8 A M-1 A M-3 A M-2 Figure 1: An illustration of the basic read and write operations over SDMs. The outer dotted line represents the space of 2 N possible addresses while the squares with labels A m represent the M sampled hard addresses used for storage. The address being requested for operation is the x in the center of the blue circle of radius D. The hard addresses selected for operating correspond to the blue squares within the Hamming radius of x. One possible choice for q(x ) is the prior, p(x ), which yields importance weights proportional to the likelihood, p(x x ). Formally, E[ f (x ) x] 1 = 1 = α(x) f (xk ) p(x k x) p(xk ) f (xk ) p(x x k )p(x k ) p(xk )p(x) f (x k )p(x x k ) (6) where we assume x k is drawn from the prior, p(x ), and α(x) is a constant of proportionality that depends only on x. Returning to the general case of data d and hypotheses h, we can use the same approximation to compute the expectation of a function f (h) given observed data, with E[ f (h) d] α(d) f (h k )p(d h k ) (7) where h k is drawn from the prior p(h), and α(d) is a constant of proportionality that depends only on d. Sparse distributed memory Sparse distributed memory (SDM) was developed as an algorithmic-level model of human memory, designed to encapsulate the notion that distances between concepts in memory correspond to distances between points in highdimensional space (anerva, 1988, 1993). In particular, it has a natural interpretation as an artificial neural network that uses distributed representations. SDMs preserve distances between items in memory by storing them in a distributed manner. Assume that items enter memory as two strings of bits, with N bits indicating a

3 location and L bits indicating its content. A conventional approach would be to sequentially enumerate the set of 2 N locations (more technically, addresses), storing items by setting the content bits at the appropriate address in turn. In contrast, a SDM samples M N addresses a j to use as registers from the space of 2 N possible addresses. Items are then stored by changing the bits that encode the content associated with multiple addresses, according to how close those addresses are to the target location. The algorithms for writing to and reading from a SDM are given below and are illustrated in Figure 1. Writing A SDM stores an L-bit vector, z, associated with an N-bit location x by storing the pattern at multiple addresses a j near x. Since the set of M available addresses does not enumerate the total space of 2 N, there may be very few addresses near x. Consequently, z is written to all addresses a j that are within a Hamming distance D of x (i.e. those a j that differ from x in D or fewer bits). The contents of these selected addresses are modified to store the pattern z such that each bit in the contents is increased or decreased by 1 depending on whether or not that bit in z is a 1 or 0, respectively. A SDM can be constructed as a neural network with N units in the input layer, a hidden layer with M units for each sampled address, and an output layer with L units. The weights between the input and hidden layer correspond to the M N matrix A = [a 1 ;a 2 ;...;a M ] of hard addresses, and the weights between the hidden and output layer correspond to the M L matrix C of contents stored at each address. The rule for writing z to memory address x is expressed as: y = Θ D (Ax ) (8) C = C + zy (9) where Θ D is a function that converts its argument zeros and ones, with Θ D (w) = 1 if 2 1 (w N) D and 0 otherwise. y is thus a selection vector that picks out a particular set of addresses. The expected number of addresses selected in y is a function of N, M, and D: E[y T y] = M D ( ) N 2 N (10) d=1 d Reading To read a pattern out of memory from address x, the SDM again computes a M-bit selection vector y of addresses within Hamming distance D of x. The contents of each address selected by this vector are summed, resulting in a vector ẑ of length L. The rule for reading ẑ from memory address x is expressed as: y = Θ D (Ax) (11) ẑ = C T y (12) The output ẑ can then be passed through a non-linearity to return a binary vector if desired. SDMs as importance samplers Previous work formalizing a probabilistic interpretation of SDMs (Anderson, 1989) lays the groundwork for using SDMs to perform Bayesian inference. Here, we show that the output of SDMs approximates the expectation of a function f (x ) over the posterior distribution p(x x) by linking its behavior to that of the importance sampler in Equation 6. Writing Let w(a j,x ) be the probability of writing to address a j given an input address x. In the standard SDM, this is 1 if the Hamming distance between a j and x is less than or equal to D and 0 otherwise. In the limit, the number of addresses increases to the point where we will always be able to write to exactly x (i.e., to set D = 0). Thus, this writing probability must satisfy the following constraint: lim w(a j,x ) = M 2 N N i=1 δ(x i a ji ) (13) After writing (address, data) pairs (x k, z k), the value of the counter associated with bit i of address a j will be: c j = w(a j,x k )z k (14) Reading We are given a location x, which as before is a corrupted version of x. Let r(x,a j ) be the probability that we read from address a j given input x. In the standard SDM, this is 1 if the Hamming distance between a j and x is less than or equal to D and 0 otherwise. Then, the output of the SDM for a particular set of addresses A is: ẑ = M j=1 c j r(x,a j ) (15) where c j is defined in Equation 14. To see how this output behaves for any SDM, we consider the expected value of ẑ over sampled sets of addresses A. We first substitute Equation 14 into Equation 15 and simplify according to linearity of expectation: [ M ( ) ] E A [ẑ x] = E A w(a j,x k )z k r(x,a j ) (16) j=1 = z k E A [ M j=1w(a j,x k )r(x,a j) ] (17) As our address space grows larger (as in Equation 13), this approaches: lim E A[ẑ x] = M 2 N z k a j δ(x k a j)r(x,a j ) da j (18) Thus, in the limit, the expected value of ẑ read from the SDM will be: E A [ẑ x] = z k r(x,x k ) (19) Comparing Equation 19 to Equation 6 yields our main result: SDMs can perform importance sampling by defining a

4 reading density proportional to a likelihood function, approximating the posterior expectation of the items stored in memory. More formally, the expectation of ẑ given x is proportional to the output of the importance sampling approximation of the expectation of f (x ) with respect to p(x x ): E A [ẑ x] f (x k )p(x x ) (20) provided z k f (x k ) and r(x,x k ) p(x x ). The utility of this result is limited with the standard formulation of the SDM, as it only holds in the limit where the address size becomes large and D becomes small, meaning that r(x,x ) becomes a delta function. Instead, we can consider generalizations in which we use different values of D for reading and writing (D r and D w, respectively), or where we choose r(x,x ) more freely. These modifications allow us to approximate a variety of Bayesian models using SDMs. We make a further note, which is that in most practical applications, the address space will not approach the limit (i.e. M 2 N ). For any sampled set of addresses, we would still expect the value of ẑ to be near the posterior mean E k [ f (x ) x], but we cannot make any statement about how close. We leave it as an area for future work to place analytic bounds on the accuracy of the SDM s approximation. For the case studies we present here, we estimate the variance of the SDM using Monte Carlo approximations. In the following two sections, we evaluate SDMs as a scheme for approximating Bayesian inference in two tasks: one where the Bayesian likelihood matches the standard SDM read rule, and one where the SDM read rule is adjusted to match the Bayesian likelihood. The second case further illustrates how SDMs can be applied to more general problems of Bayesian inference that go beyond simply removing noise from a stimulus. Letter reconstruction As a simple illustration of approximating Bayesian inference with a SDM, we consider the task of recovering images of English letters, x, from noisy observations x. To solve this problem, we set up a Bayesian model loosely based on that presented in Rumelhart and Siple (1974). Each letter of the alphabet is encoded as a binary feature vector of length N = 14 based on the Rumelhart-Siple font template (Figure 2). SDM approximation In the Bayesian model, we wish to reconstruct the original letters x from the exemplars x by computing the mean of the posterior distribution over original letters, p(x x), i.e. f (x ) = x. Each of the original letters is associated with a prior probability, p(x ), set to be the relative letter frequency of English text (Lewand, 2000). Following the generative model, a noisy image x is produced from the original letter x via a noise process in which at most B bits are flipped (uniformly). Given a noisy image vector x, we define the likelihood p(x x ) to be uniformly distributed over the number of possible bit strings in a hypersphere of radius D r ( ) Figure 2: The Rumelhart-Siple font feature map and the Rumelhart-Siple representation for the letter A along with its binary feature pattern. In the SDM, we sample exemplars x of the original letters from the prior p(x ) and write them to the SDM at inputs x (i.e., z = x ). The likelihood defined for the Bayesian model is naturally compatible with the standard read rule of a SDM, where we only consider hypotheses x which are within Hamming distance D r of x. Thus, we can set the read function r(x,x) of the SDM to be a variation of this likelihood: [ p(x x ) r(x,x D ( r N d+1 ) ] 1 ) = d=1 d x x D r 0 otherwise The corrupted images x are the inputs that we attempt to read from and D r is the SDM s read radius; we additionally define D w to be the write radius. The intuition is that we write the original letters z = x to input x, and read from x outputs ẑ which are similar to the mean of the Bayesian posterior. Analysis To evaluate how well the SDM approximates the posterior mean, we sampled 1000 exemplars from the prior distribution p(x ). We then created three images x for each letter in the alphabet, each with two bits of corruption, yielding a total of 72 test images. We repeated this simulation sampling from the prior and generating test images 20 times for each SDM with parameters D r, D w, and M, where each simulation used a different set of sampled addresses A. We determined the appropriate settings of the read and write radii, D r and D w, by considering the constraints imposed by the SDM and the specific problem of letter reconstruction. The mean Hamming distance between pairs of letters was , indicating that the read radius must lie somewhere between 2 (the amount of corruption) and 5. Choosing a radius outside these bounds would have the effect of returning an inaccurate expectation, either because the noise was greater than the signal (D r < 2) or because too many hypotheses were considered (D r > 5). So, we chose D r = 2, thus ensuring that the true x would always fall within this radius, and also minimizing the number of incorrect hypotheses considered. For the write radius, we considered three different values, D w = {0,1,2}. We stored all 1000 letters in each SDM, varying the number of hard addresses, M, among eight evenly spaced values

5 Correlation Correlations between Bayes and SDM in letter recognition task Size of SDM address space (M) D w = 0 D w = 1 D w = 2 D w = 3 Figure 3: The average correlations between the Bayesian posterior mean and pre-thresholded SDM outputs for the task of recovering a Rumelhart-Siple letter from a noisy observation. D r = 2 for each of the 4 values of D w presented. between 2048 and 2 N = We then read the value ẑ for each corrupted input x, for different address spaces A. For the Bayesian model, we analytically calculated the posterior mean by evaluating the full posterior for each x and then calculating the average x weighed by its posterior probability. We then computed the average correlation between the SDM values of ẑ and the Bayesian posterior mean across sampled addresses A. These results are displayed in Figure 3. The SDMs with D w = 0 performed similarly to the Bayesian model only when M = 2 N (ρ = , se = ), reflecting the intuition that it s highly unlikely to find an exact address from a random sample of 2 N. Conversely, the SDMs with D w = 2 and D w = 3 had near-constant correlations with the Bayesian model (ρ 0.88 for D w = 2 and ρ 0.79 for D w = 3), regardless of the size of the hard address space. This behavior was also in line with our expectations: with D w = 2 and D w = 3, the probability of having no hard addresses within D w of x is extremely low for M = 2048; this probability only decreases as M increases. In summary, these results show SDMs can naturally approximate Bayesian models of noisy stimulus reconstruction. Here, we set the likelihood function to follow the standard SDM read rule. In the next section we consider a more general example of Bayesian inference and explore the consequences of adjusting the SDM read rule to match the likelihood in question. Property induction If you are told that a horse has a particular property, protein, what other animals do you think share this protein? Questions of this type are considered problems of property induction, where one or more categories in a domain are observed to have some novel property and the task is to determine how the property is distributed over the domain. For our analyses we explore the category-based induction task introduced by Osherson, Smith, Wilkie, Lopez, and Shafir (1990), where judgments are made as to whether certain animals can get a disease knowing other animals that can catch it. SDM approximation We solve this problem with a Bayesian model of property induction based on emp and Tenenbaum (2009). Here, we observe a set of examples d of a concept C (known to have property ) and we aim to calculate p(y C d), the probability that object y is also a member of C. Thus, averaging over all possible concepts, we compute: p(y C d) = p(y C h)p(h d) (21) h H where the hypothesis space H is the set of all possible concepts in a domain, p(y C h) = 1 if y is in hypothesis h, and 0 otherwise. The posterior probability p(h d) can be computed via Bayes rule from Equation 1. We explore two variations of likelihood functions based on assumptions of how the data were generated. If the data are generated uniformly at random, the likelihood follows from weak sampling: p(d h) = 1 if all examples in d are in h and p(d h) = 0 otherwise. If the data are generated at random from the true concept C, the likelihood follows from strong sampling, where p(d h) = 1/ h n if all n examples in d belong to h, and p(d h) = 0 otherwise. This likelihood function incorporates the size principle, where hypotheses with fewer items are given more weight than hypotheses with more items for the same set of data (Tenenbaum & Griffiths, 2001). In the SDM, the location x corresponds to a hypothesis h, and the content z corresponds to data d. For both weak and strong sampling assumptions, we set the read function of the SDM, r(d, h), to be proportional to the Bayesian likelihood by weighting the selection vector y by p(d h). Analysis To evaluate the performance of this modified SDM, we used the category-based induction dataset from Osherson et al. (1990), consisting of 36 two-premise arguments and 45 threepremise arguments ( e.g., Cat, Dog, Horse can get disease X ) for a domain of 10 animals. Thus, each observation d is encoded as a binary feature vector of length N = 10. We used a taxonomic hypothesis space following emp and Tenenbaum (2009), constructed from human similarity judgments for each possible pairing of animals. As before, we stored 1000 exemplars sampled from p(h) in the SDM, and varied the number of hard addresses among eight evenly spaced values between 128 and 2 N = We evaluated this modified SDM against the Bayesian model of property induction for 3 values of D w over a constant reading radius D r = 0. The results are presented in Figure 4 1. We find the SDM approximates Bayesian inference for a variety of write radii and, as predicted, matches Bayes when 1 We note that these correlations are not strictly monotonic as one might intuit, due to sampling error in the address space (most notable in the case when D w = 0.)

6 1 Correlations between Bayes and SDM in property induction task with weak sampling 1 Correlations between Bayes and SDM in property induction task with strong sampling Correlation Correlation D w = 0 D w = D w = 0 D w = 1 0 D w = Size of SDM address space (M) 0 D w = Size of SDM address space (M) Figure 4: The average correlations between the Bayesian posterior and pre-thresholded SDM outputs assuming weak sampling (left panel), and assuming strong sampling (right panel) for the property induction task. D r = 0 for each of the 3 values of D w presented. D w = 0 and M = 2 N. Furthermore, the SDMs that use a weak or strong sampling read rule correlate equally well with the Bayesian models that use weak or strong sampling likelihoods, respectively. This nicely illustrates the correspondence between the SDM s read rule and a Bayesian likelihood, and implies that SDMs can approximate a broad range of Bayesian models by adjusting the read rule to match the likelihood function. Conclusion What constraints do the algorithmic and implementation levels impose on probabilistic models of cognition? We explored this question by considering whether the computations employed by a distributed representation of associative memory could approximate Bayesian inference. By choosing the SDM read rule to appropriately match the the likelihood function, we showed in two separate scenarios that SDMs can accurately implement a specific form of Bayesian inference called importance sampling. Future work will take our analyses one step further and investigate whether SDMs can approximate Bayesian inference without modifying their read rule. Specifically, can we encode the data in a manner such that reading it from a standard SDM is still proportional to the likelihood? If so, it would make SDMs an even more general and appealing approach to performing Bayesian inference. Regardless, the results presented in this paper are an important step towards an explanation of how the structured representations assumed by probabilistic models of cognition could be expressed in the distributed, continuous representations commonly used in algorithmic-level models such as neural networks. Acknowledgments. This work was supported by grant number FA from the Air Force Office of Scientific Research and a Berkeley Fellowship awarded to JBH. References Anderson, C. H. (1989). A conditional probability interpretation of anerva s sparse distributed memory. In International joint conference on neural networks (pp ). Gigerenzer, G., & Todd, P. (1999). Simple heuristics that make us smart. Oxford University Press, USA. Griffiths, T., Chater, N., emp, C., Perfors, A., & Tenenbaum, J. (2010). Probabilistic models of cognition: exploring representations and inductive biases. Trends in Cognitive Sciences, 14(8), Griffiths, T., Vul, E., & Sanborn, A. (2012). Bridging levels of analysis for probabilistic models of cognition. Current Directions in Psychological Science, 21(4), anerva, P. (1988). Sparse Distributed Memory Systems. MIT Press. anerva, P. (1993). Sparse Distributed Memory and Related Models. In M. Hassoun (Ed.), Associative neural memories: Theory and implementation (pp ). Oxford University Press. emp, C., & Tenenbaum, J. B. (2009). Structured statistical models of inductive reasoning. Psychological Review, 116(1), Lewand, R. (2000). Cryptological mathematics. Mathematical Association of America. Marr, D. (1982). Vision. San Francisco, CA: W. H. Freeman. McClelland, J., Botvinick, M., Noelle, D., Plaut, D., Rogers, T., Seidenberg, M., et al. (2010). Letting structure emerge: connectionist and dynamical systems approaches to cognition. Trends in Cognitive Sciences, 14(8), Neal, R. (1993). Probabilistic inference using Markov chain Monte Carlo methods (Tech. Rep. No. CRG-TR-93-1). Department of Computer Science, University of Toronto. Osherson, D., Smith, E., Wilkie, O., Lopez, A., & Shafir, E. (1990). Category-based induction. Psychological Review, 97(2), 185. Rumelhart, D., & Siple, P. (1974). Process of recognizing tachistoscopically presented words. Psychological Review, 81(2), 99. Shi, L., & Griffiths, T. (2009). Neural implementation of hierarchical Bayesian inference by importance sampling. Advances in Neural Information Processing Systems, 22, Shi, L., Griffiths, T. L., Feldman, N. H., & Sanborn, A. N. (2010). Exemplar models as a mechanism for performing Bayesian inference. Psychonomic Bulletin & Review, 17(4), Tenenbaum, J., & Griffiths, T. (2001). Generalization, similarity, and Bayesian inference. Behavioral and Brain Sciences, 24(4), Tenenbaum, J., emp, C., Griffiths, T., & Goodman, N. (2011). How to grow a mind: Statistics, structure, and abstraction. Science, 331(6022),

Generalisation and inductive reasoning. Computational Cognitive Science 2014 Dan Navarro

Generalisation and inductive reasoning. Computational Cognitive Science 2014 Dan Navarro Generalisation and inductive reasoning Computational Cognitive Science 2014 Dan Navarro Where are we at? Bayesian statistics as a general purpose tool for doing inductive inference A specific Bayesian

More information

Part IV: Monte Carlo and nonparametric Bayes

Part IV: Monte Carlo and nonparametric Bayes Part IV: Monte Carlo and nonparametric Bayes Outline Monte Carlo methods Nonparametric Bayesian models Outline Monte Carlo methods Nonparametric Bayesian models The Monte Carlo principle The expectation

More information

Revealing inductive biases through iterated learning

Revealing inductive biases through iterated learning Revealing inductive biases through iterated learning Tom Griffiths Department of Psychology Cognitive Science Program UC Berkeley with Mike Kalish, Brian Christian, Simon Kirby, Mike Dowman, Stephan Lewandowsky

More information

Journal of Mathematical Psychology

Journal of Mathematical Psychology Journal of Mathematical Psychology 56 () 8 85 Contents lists available at SciVerse ScienceDirect Journal of Mathematical Psychology journal homepage: www.elsevier.com/locate/jmp Notes and comment Bayesian

More information

Understanding Aha! Moments

Understanding Aha! Moments Understanding Aha! Moments Hermish Mehta Department of Electrical Engineering & Computer Sciences University of California, Berkeley Berkeley, CA 94720 hermish@berkeley.edu Abstract In this paper, we explore

More information

Answers and expectations

Answers and expectations Answers and expectations For a function f(x) and distribution P(x), the expectation of f with respect to P is The expectation is the average of f, when x is drawn from the probability distribution P E

More information

Computational Cognitive Science

Computational Cognitive Science Computational Cognitive Science Lecture 9: A Bayesian model of concept learning Chris Lucas School of Informatics University of Edinburgh October 16, 218 Reading Rules and Similarity in Concept Learning

More information

Bayesian probability theory and generative models

Bayesian probability theory and generative models Bayesian probability theory and generative models Bruno A. Olshausen November 8, 2006 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using

More information

Learning and Memory in Neural Networks

Learning and Memory in Neural Networks Learning and Memory in Neural Networks Guy Billings, Neuroinformatics Doctoral Training Centre, The School of Informatics, The University of Edinburgh, UK. Neural networks consist of computational units

More information

Structure learning in human causal induction

Structure learning in human causal induction Structure learning in human causal induction Joshua B. Tenenbaum & Thomas L. Griffiths Department of Psychology Stanford University, Stanford, CA 94305 jbt,gruffydd @psych.stanford.edu Abstract We use

More information

Modeling human function learning with Gaussian processes

Modeling human function learning with Gaussian processes Modeling human function learning with Gaussian processes Thomas L. Griffiths Christopher G. Lucas Joseph J. Williams Department of Psychology University of California, Berkeley Berkeley, CA 94720-1650

More information

Combining causal and similarity-based reasoning

Combining causal and similarity-based reasoning Combining causal and similarity-based reasoning Charles Kemp, Patrick Shafto, Allison Berke & Joshua B. Tenenbaum Department of Brain and Cognitive Sciences, MIT, Cambridge, MA 239 {ckemp,shafto,berke,jbt}@mit.edu

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

CS540 Machine learning L9 Bayesian statistics

CS540 Machine learning L9 Bayesian statistics CS540 Machine learning L9 Bayesian statistics 1 Last time Naïve Bayes Beta-Bernoulli 2 Outline Bayesian concept learning Beta-Bernoulli model (review) Dirichlet-multinomial model Credible intervals 3 Bayesian

More information

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

More information

Bayesian Concept Learning

Bayesian Concept Learning Learning from positive and negative examples Bayesian Concept Learning Chen Yu Indiana University With both positive and negative examples, it is easy to define a boundary to separate these two. Just with

More information

Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Machine Learning and Cognitive Science

Machine Learning and Cognitive Science Machine Learning and Cognitive Science Josh Tenenbaum MIT Department of Brain and Cognitive Sciences CSAIL MLSS 2009 Cambridge, UK Human learning and machine learning: a long-term relationship Unsupervised

More information

9 Forward-backward algorithm, sum-product on factor graphs

9 Forward-backward algorithm, sum-product on factor graphs Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 9 Forward-backward algorithm, sum-product on factor graphs The previous

More information

Bayesian Inference. Introduction

Bayesian Inference. Introduction Bayesian Inference Introduction The frequentist approach to inference holds that probabilities are intrinsicially tied (unsurprisingly) to frequencies. This interpretation is actually quite natural. What,

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Outline. Limits of Bayesian classification Bayesian concept learning Probabilistic models for unsupervised and semi-supervised category learning

Outline. Limits of Bayesian classification Bayesian concept learning Probabilistic models for unsupervised and semi-supervised category learning Outline Limits of Bayesian classification Bayesian concept learning Probabilistic models for unsupervised and semi-supervised category learning Limitations Is categorization just discrimination among mutually

More information

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007 Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007 Particle Filtering a brief introductory tutorial Frank Wood Gatsby, August 2007 Problem: Target Tracking A ballistic projectile has been launched in our direction and may or may not land near enough to

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets Neural Networks for Machine Learning Lecture 11a Hopfield Nets Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Hopfield Nets A Hopfield net is composed of binary threshold

More information

Bayesian Learning Extension

Bayesian Learning Extension Bayesian Learning Extension This document will go over one of the most useful forms of statistical inference known as Baye s Rule several of the concepts that extend from it. Named after Thomas Bayes this

More information

Human-level concept learning through probabilistic program induction

Human-level concept learning through probabilistic program induction B.M Lake, R. Salakhutdinov, J.B. Tenenbaum Human-level concept learning through probabilistic program induction journal club at two aspects in which machine learning spectacularly lags behind human learning

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Classification: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu September 21, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Computational Cognitive Science

Computational Cognitive Science Computational Cognitive Science Lecture 8: Frank Keller School of Informatics University of Edinburgh keller@inf.ed.ac.uk Based on slides by Sharon Goldwater October 14, 2016 Frank Keller Computational

More information

CHAPTER 3. Pattern Association. Neural Networks

CHAPTER 3. Pattern Association. Neural Networks CHAPTER 3 Pattern Association Neural Networks Pattern Association learning is the process of forming associations between related patterns. The patterns we associate together may be of the same type or

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Inference and estimation in probabilistic time series models

Inference and estimation in probabilistic time series models 1 Inference and estimation in probabilistic time series models David Barber, A Taylan Cemgil and Silvia Chiappa 11 Time series The term time series refers to data that can be represented as a sequence

More information

Introduction to Gaussian Processes

Introduction to Gaussian Processes Introduction to Gaussian Processes Iain Murray murray@cs.toronto.edu CSC255, Introduction to Machine Learning, Fall 28 Dept. Computer Science, University of Toronto The problem Learn scalar function of

More information

CSCI 252: Neural Networks and Graphical Models. Fall Term 2016 Prof. Levy. Architecture #7: The Simple Recurrent Network (Elman 1990)

CSCI 252: Neural Networks and Graphical Models. Fall Term 2016 Prof. Levy. Architecture #7: The Simple Recurrent Network (Elman 1990) CSCI 252: Neural Networks and Graphical Models Fall Term 2016 Prof. Levy Architecture #7: The Simple Recurrent Network (Elman 1990) Part I Multi-layer Neural Nets Taking Stock: What can we do with neural

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2017. Tom M. Mitchell. All rights reserved. *DRAFT OF September 16, 2017* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Computational Cognitive Science

Computational Cognitive Science Computational Cognitive Science Lecture 9: Bayesian Estimation Chris Lucas (Slides adapted from Frank Keller s) School of Informatics University of Edinburgh clucas2@inf.ed.ac.uk 17 October, 2017 1 / 28

More information

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida Bayesian Statistical Methods Jeff Gill Department of Political Science, University of Florida 234 Anderson Hall, PO Box 117325, Gainesville, FL 32611-7325 Voice: 352-392-0262x272, Fax: 352-392-8127, Email:

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Learning with Probabilities

Learning with Probabilities Learning with Probabilities CS194-10 Fall 2011 Lecture 15 CS194-10 Fall 2011 Lecture 15 1 Outline Bayesian learning eliminates arbitrary loss functions and regularizers facilitates incorporation of prior

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

A Biologically-Inspired Model for Recognition of Overlapped Patterns

A Biologically-Inspired Model for Recognition of Overlapped Patterns A Biologically-Inspired Model for Recognition of Overlapped Patterns Mohammad Saifullah Department of Computer and Information Science Linkoping University, Sweden Mohammad.saifullah@liu.se Abstract. In

More information

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction 15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive

More information

Lecture 9: Bayesian Learning

Lecture 9: Bayesian Learning Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes Optimal

More information

The Causal Sampler: A Sampling Approach to Causal Representation, Reasoning and Learning

The Causal Sampler: A Sampling Approach to Causal Representation, Reasoning and Learning The Causal Sampler: A Sampling Approach to Causal Representation, Reasoning and Learning Zachary J. Davis (zach.davis@nyu.edu) Bob Rehder (bob.rehder@nyu.edu) Department of Psychology, New York University

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Markov Transitions between Attractor States in a Recurrent Neural Network

Markov Transitions between Attractor States in a Recurrent Neural Network Markov Transitions between Attractor States in a Recurrent Neural Network Jeremy Bernstein Computation and Neural Systems California Institute of Technology, USA bernstein@caltech.edu Ishita Dasgupta Department

More information

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Development of Stochastic Artificial Neural Networks for Hydrological Prediction Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental

More information

Theory-based causal inference

Theory-based causal inference Theory-based causal inference Joshua B. Tenenbaum & Thomas L. Griffiths Department of Brain and Cognitive Sciences MIT, Cambridge, MA 039 {jbt, gruffydd}@mit.edu Abstract People routinely make sophisticated

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Auto-Encoding Variational Bayes

Auto-Encoding Variational Bayes Auto-Encoding Variational Bayes Diederik P Kingma, Max Welling June 18, 2018 Diederik P Kingma, Max Welling Auto-Encoding Variational Bayes June 18, 2018 1 / 39 Outline 1 Introduction 2 Variational Lower

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Confidence Estimation Methods for Neural Networks: A Practical Comparison , 6-8 000, Confidence Estimation Methods for : A Practical Comparison G. Papadopoulos, P.J. Edwards, A.F. Murray Department of Electronics and Electrical Engineering, University of Edinburgh Abstract.

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Chapter 8&9: Classification: Part 3 Instructor: Yizhou Sun yzsun@ccs.neu.edu March 12, 2013 Midterm Report Grade Distribution 90-100 10 80-89 16 70-79 8 60-69 4

More information

Looking forwards and backwards: Similarities and differences in prediction and retrodiction

Looking forwards and backwards: Similarities and differences in prediction and retrodiction Looking forwards and backwards: Similarities and differences in prediction and retrodiction Kevin A Smith (k2smith@ucsd.edu) and Edward Vul (evul@ucsd.edu) University of California, San Diego Department

More information

Bayesian Deep Learning

Bayesian Deep Learning Bayesian Deep Learning Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp June 06, 2018 Mohammad Emtiyaz Khan 2018 1 What will you learn? Why is Bayesian inference

More information

Quantum Probability in Cognition. Ryan Weiss 11/28/2018

Quantum Probability in Cognition. Ryan Weiss 11/28/2018 Quantum Probability in Cognition Ryan Weiss 11/28/2018 Overview Introduction Classical vs Quantum Probability Brain Information Processing Decision Making Conclusion Introduction Quantum probability in

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

Statistical learning. Chapter 20, Sections 1 3 1

Statistical learning. Chapter 20, Sections 1 3 1 Statistical learning Chapter 20, Sections 1 3 Chapter 20, Sections 1 3 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

Formal Modeling in Cognitive Science

Formal Modeling in Cognitive Science Formal Modeling in Cognitive Science Lecture 9: Application of Bayes Theorem; Discrete Random Variables; Steve Renals (notes by Frank Keller) School of Informatics University of Edinburgh s.renals@ed.ac.uk

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Formal Modeling in Cognitive Science Lecture 19: Application of Bayes Theorem; Discrete Random Variables; Distributions. Background.

Formal Modeling in Cognitive Science Lecture 19: Application of Bayes Theorem; Discrete Random Variables; Distributions. Background. Formal Modeling in Cognitive Science Lecture 9: ; Discrete Random Variables; Steve Renals (notes by Frank Keller) School of Informatics University of Edinburgh s.renals@ed.ac.uk February 7 Probability

More information

Dynamical Causal Learning

Dynamical Causal Learning Dynamical Causal Learning David Danks Thomas L. Griffiths Institute for Human & Machine Cognition Department of Psychology University of West Florida Stanford University Pensacola, FL 3251 Stanford, CA

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Entropies & Information Theory

Entropies & Information Theory Entropies & Information Theory LECTURE I Nilanjana Datta University of Cambridge,U.K. See lecture notes on: http://www.qi.damtp.cam.ac.uk/node/223 Quantum Information Theory Born out of Classical Information

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

2 : Directed GMs: Bayesian Networks

2 : Directed GMs: Bayesian Networks 10-708: Probabilistic Graphical Models 10-708, Spring 2017 2 : Directed GMs: Bayesian Networks Lecturer: Eric P. Xing Scribes: Jayanth Koushik, Hiroaki Hayashi, Christian Perez Topic: Directed GMs 1 Types

More information

Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning. Chapter 20, Sections 1 4 1 Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows. Chapter 5 Two Random Variables In a practical engineering problem, there is almost always causal relationship between different events. Some relationships are determined by physical laws, e.g., voltage

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/

More information

A Brief and Friendly Introduction to Mixed-Effects Models in Linguistics

A Brief and Friendly Introduction to Mixed-Effects Models in Linguistics A Brief and Friendly Introduction to Mixed-Effects Models in Linguistics Cluster-specific parameters ( random effects ) Σb Parameters governing inter-cluster variability b1 b2 bm x11 x1n1 x21 x2n2 xm1

More information

DAG models and Markov Chain Monte Carlo methods a short overview

DAG models and Markov Chain Monte Carlo methods a short overview DAG models and Markov Chain Monte Carlo methods a short overview Søren Højsgaard Institute of Genetics and Biotechnology University of Aarhus August 18, 2008 Printed: August 18, 2008 File: DAGMC-Lecture.tex

More information

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein

Kalman filtering and friends: Inference in time series models. Herke van Hoof slides mostly by Michael Rubinstein Kalman filtering and friends: Inference in time series models Herke van Hoof slides mostly by Michael Rubinstein Problem overview Goal Estimate most probable state at time k using measurement up to time

More information

Introduction to Bayesian methods in inverse problems

Introduction to Bayesian methods in inverse problems Introduction to Bayesian methods in inverse problems Ville Kolehmainen 1 1 Department of Applied Physics, University of Eastern Finland, Kuopio, Finland March 4 2013 Manchester, UK. Contents Introduction

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References

Hierarchy. Will Penny. 24th March Hierarchy. Will Penny. Linear Models. Convergence. Nonlinear Models. References 24th March 2011 Update Hierarchical Model Rao and Ballard (1999) presented a hierarchical model of visual cortex to show how classical and extra-classical Receptive Field (RF) effects could be explained

More information

Different Criteria for Active Learning in Neural Networks: A Comparative Study

Different Criteria for Active Learning in Neural Networks: A Comparative Study Different Criteria for Active Learning in Neural Networks: A Comparative Study Jan Poland and Andreas Zell University of Tübingen, WSI - RA Sand 1, 72076 Tübingen, Germany Abstract. The field of active

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Variational Methods in Bayesian Deconvolution

Variational Methods in Bayesian Deconvolution PHYSTAT, SLAC, Stanford, California, September 8-, Variational Methods in Bayesian Deconvolution K. Zarb Adami Cavendish Laboratory, University of Cambridge, UK This paper gives an introduction to the

More information

Why is Deep Learning so effective?

Why is Deep Learning so effective? Ma191b Winter 2017 Geometry of Neuroscience The unreasonable effectiveness of deep learning This lecture is based entirely on the paper: Reference: Henry W. Lin and Max Tegmark, Why does deep and cheap

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Prediction of Data with help of the Gaussian Process Method

Prediction of Data with help of the Gaussian Process Method of Data with help of the Gaussian Process Method R. Preuss, U. von Toussaint Max-Planck-Institute for Plasma Physics EURATOM Association 878 Garching, Germany March, Abstract The simulation of plasma-wall

More information