# CPSC 540: Machine Learning

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016

2 Admin I went through project proposals: Some of you got a message on Piazza. No news is good news. A5 coming tomorrow. Project submission details coming next week.

3 Overview of Bayesian Inference Tasks In Bayesian approach, we typically work with the posterior p(θ x) = 1 Z p(x θ)p(θ) = 1 Z p(θ), where Z makes the distribution sum/integrate to 1.

4 Overview of Bayesian Inference Tasks In Bayesian approach, we typically work with the posterior p(θ x) = 1 Z p(x θ)p(θ) = 1 Z p(θ), where Z makes the distribution sum/integrate to 1. Typically, we need to compute expectation of some f with respect to posterior, E[f(θ)] = f(θ)p(θ x)dθ. θ

5 Overview of Bayesian Inference Tasks In Bayesian approach, we typically work with the posterior p(θ x) = 1 Z p(x θ)p(θ) = 1 Z p(θ), where Z makes the distribution sum/integrate to 1. Typically, we need to compute expectation of some f with respect to posterior, E[f(θ)] = f(θ)p(θ x)dθ. Examples: If f(θ) = p( x θ), we get posterior predictive. If f(θ) = 1 and we use p(θ), we get marginal likelihood Z. If f(θ) = I(θ S) we get probability of S (e.g., marginals or conditionals). θ

6 Last Time: Conjugate Prior and Monte Carlo Methods Last time we saw two ways to deal with this: 1 Conjugate priors: Apply when p(x θ) is in the exponential family. Set p(θ) to a conjugate prior, and posterior will have the same form. Integrals will often have closed-form solutions, but restricted class of models.

7 Last Time: Conjugate Prior and Monte Carlo Methods Last time we saw two ways to deal with this: 1 Conjugate priors: Apply when p(x θ) is in the exponential family. Set p(θ) to a conjugate prior, and posterior will have the same form. Integrals will often have closed-form solutions, but restricted class of models. 2 Monte Carlo methods: sample θ i from p(θ x) and use: E[f(θ)] = f(θ)p(θ)dθ 1 n n f(θ i ). i=1

8 Last Time: Conjugate Prior and Monte Carlo Methods Last time we saw two ways to deal with this: 1 Conjugate priors: Apply when p(x θ) is in the exponential family. Set p(θ) to a conjugate prior, and posterior will have the same form. Integrals will often have closed-form solutions, but restricted class of models. 2 Monte Carlo methods: sample θ i from p(θ x) and use: E[f(θ)] = f(θ)p(θ)dθ 1 n n f(θ i ). i=1 We discussed basic Monte Carlo methods: Inverse CDF, ancestral sampling, rejection sampling, importance sampling. Work well in low dimensions or for posteriors with analytic properties.

9 Limitations of Simple Monte Carlo Methods These methods tend not to work in complex situations: Inverse CDF may not be avaiable. Conditional needed for ancestral sampling may be hard to compute. Rejection sampling tends to reject almost all samples. Importance sampling tends gives almost zero weight to all samples. We want an algorithm that gets better over time.

10 Limitations of Simple Monte Carlo Methods These methods tend not to work in complex situations: Inverse CDF may not be avaiable. Conditional needed for ancestral sampling may be hard to compute. Rejection sampling tends to reject almost all samples. Importance sampling tends gives almost zero weight to all samples. We want an algorithm that gets better over time. Two main strategies: Sequential Monte Carlo: Importance sampling where proposal q t changes over time from simple to posterior. Particle Filter Explained without Equations :

11 Limitations of Simple Monte Carlo Methods These methods tend not to work in complex situations: Inverse CDF may not be avaiable. Conditional needed for ancestral sampling may be hard to compute. Rejection sampling tends to reject almost all samples. Importance sampling tends gives almost zero weight to all samples. We want an algorithm that gets better over time. Two main strategies: Sequential Monte Carlo: Importance sampling where proposal q t changes over time from simple to posterior. Particle Filter Explained without Equations : Markov chain Monte Carlo (MCMC). Design Markov chain whose stationary distribution is the posterior.

12 Motivating Example: Sampling from a UGM High-dimensional integration problems arise in other settings: Bayesian graphical models and Bayesian neural networks. Deep belief networks, Boltzmann machines.

13 Motivating Example: Sampling from a UGM High-dimensional integration problems arise in other settings: Bayesian graphical models and Bayesian neural networks. Deep belief networks, Boltzmann machines. Recall the definition of a discrete paiwise undirected graphical model (UGM): p(x) = d j=1 φ j(x j ) (i,j) E φ ij(x i, x j ) Z = p(x) Z.

14 Motivating Example: Sampling from a UGM High-dimensional integration problems arise in other settings: Bayesian graphical models and Bayesian neural networks. Deep belief networks, Boltzmann machines. Recall the definition of a discrete paiwise undirected graphical model (UGM): p(x) = d j=1 φ j(x j ) (i,j) E φ ij(x i, x j ) In this model: Compute p(x) is easy. Computing Z is #P-hard. Generating a sample is NP-hard (at least). Z = p(x) Z.

15 Motivating Example: Sampling from a UGM High-dimensional integration problems arise in other settings: Bayesian graphical models and Bayesian neural networks. Deep belief networks, Boltzmann machines. Recall the definition of a discrete paiwise undirected graphical model (UGM): p(x) = d j=1 φ j(x j ) (i,j) E φ ij(x i, x j ) In this model: Compute p(x) is easy. Computing Z is #P-hard. Generating a sample is NP-hard (at least). Z = p(x) Z. With rejection sampling, probability of acceptance might be arbitrarily small. But there is a simple MCMC method...

16 Gibbs Sampling for Discrete UGMs A Gibbs sampling algorithm for pairwise UGMs: Start with some configuration x 0, then repeat the following: 1 Choose a variable j uniformly at random. 2 Set x t+1 j = x t j, and sample x t j from its conditional, x t+1 j p(x j x t j) = p(x j x t MB(j)).

17 Gibbs Sampling for Discrete UGMs A Gibbs sampling algorithm for pairwise UGMs: Start with some configuration x 0, then repeat the following: 1 Choose a variable j uniformly at random. 2 Set x t+1 j = x t j, and sample x t j from its conditional, x t+1 j p(x j x t j) = p(x j x t MB(j)). Analogy: sampling version of coordinate descent: Transformed d-dimensional sampling into 1-dimensional sampling.

18 Gibbs Sampling for Discrete UGMs A Gibbs sampling algorithm for pairwise UGMs: Start with some configuration x 0, then repeat the following: 1 Choose a variable j uniformly at random. 2 Set x t+1 j = x t j, and sample x t j from its conditional, x t+1 j p(x j x t j) = p(x j x t MB(j)). Analogy: sampling version of coordinate descent: Transformed d-dimensional sampling into 1-dimensional sampling. These iterations are very cheap: Need to know p(x t ) for each value of x t j. Then sample from a single discrete random variable. Does this work? How long does this take?

19 Gibbs Sampling in Action Start with some initial value: x 0 = [ ]. Select random j: j = 3. Sample variable j: x 2 = [ ]. Select random j: j = 1. Sample variable j: x 3 = [ ]. Select random j: j = 2. Sample variable j: x 4 = [ ].. Use all these samples to make approximation of p(x).

20 Gibbs Sampling in Action: UGMs Consider using a UGM for image denoising: We have Unary potentials φ j for each position. Pairwise potentials φ ij for neighbours on grid. Parameters are trained as CRF (next time). Goal is to produce a noise-free image.

21 Gibbs Sampling in Action: UGMs Gibbs samples after every 100d iterations:

22 Gibbs Sampling in Action: UGMs Mean image and marginal decoding:

23 Gibbs Sampling in Action: Multivariate Gaussian Gibbs sampling works for general distributions. E.g., sampling from multivariate Gaussian by univariate Gaussian sampling.

24 Outline 1 Gibbs Sampling 2 Markov Chain Monte Carlo 3 Metropolis-Hastings 4 Non-Parametric Bayes

25 Homoegenous Markov Chains and Invariant Distribution Given initial distribution p(x 0 ) Markov chain assumes that p(x t x 1:t 1 ) = p(x t x t 1 ), which we call the Markov property.

26 Homoegenous Markov Chains and Invariant Distribution Given initial distribution p(x 0 ) Markov chain assumes that p(x t x 1:t 1 ) = p(x t x t 1 ), which we call the Markov property. Important special case is homogenous Markov chains, where p(x t = s x t 1 = s ) = p(x t 1 = s x t 2 = s ), for all s, s, and t (transition probabilities don t change over time).

27 Homoegenous Markov Chains and Invariant Distribution Given initial distribution p(x 0 ) Markov chain assumes that p(x t x 1:t 1 ) = p(x t x t 1 ), which we call the Markov property. Important special case is homogenous Markov chains, where p(x t = s x t 1 = s ) = p(x t 1 = s x t 2 = s ), for all s, s, and t (transition probabilities don t change over time). Under weak conditions, homogenous chains converge to an invariant distribution, p(s) = s p(x t = s x t 1 = s )p(s ). E.g., p(x t x t 1 ) > 0 is sufficient, or weaker condition of irreducible and aperiodic.

28 Markov Chain Monte Carlo Markov chain Monte Carlo (MCMC): given target p, design transitions such that as n. 1 n n f(x t ) t=1 x f(x)p(x)dx and/or x n p, We are generating dependent samples that still solve the integral.

29 Markov Chain Monte Carlo Markov chain Monte Carlo (MCMC): given target p, design transitions such that as n. 1 n n f(x t ) t=1 x f(x)p(x)dx and/or x n p, We are generating dependent samples that still solve the integral. There are many transitions that will yield posterior as invariant distribution. Typically easy to design sampler, but hard to characterize rate of convergence.

30 Markov Chain Monte Carlo Markov chain Monte Carlo (MCMC): given target p, design transitions such that as n. 1 n n f(x t ) t=1 x f(x)p(x)dx and/or x n p, We are generating dependent samples that still solve the integral. There are many transitions that will yield posterior as invariant distribution. Typically easy to design sampler, but hard to characterize rate of convergence. Gibbs sampling satisfies the above under very weak conditions.

31 Markov Chain Monte Carlo Markov chain Monte Carlo (MCMC): given target p, design transitions such that as n. 1 n n f(x t ) t=1 x f(x)p(x)dx and/or x n p, We are generating dependent samples that still solve the integral. There are many transitions that will yield posterior as invariant distribution. Typically easy to design sampler, but hard to characterize rate of convergence. Gibbs sampling satisfies the above under very weak conditions. Typically, we don t take all samples: Burn in: throw away the initial samples when we haven t converged to stationary. Thinning: only keep every k samples, since they will be highly correlated.

32 Markov Chain Monte Carlo Markov chain Monte Carlo (MCMC): given target p, design transitions such that as n. 1 n n f(x t ) t=1 x f(x)p(x)dx and/or x n p, We are generating dependent samples that still solve the integral. There are many transitions that will yield posterior as invariant distribution. Typically easy to design sampler, but hard to characterize rate of convergence. Gibbs sampling satisfies the above under very weak conditions. Typically, we don t take all samples: Burn in: throw away the initial samples when we haven t converged to stationary. Thinning: only keep every k samples, since they will be highly correlated. It can very hard to diagnose if we reached invariant distribution. Recent work showed that this is P-space hard (much worse than NP-hard).

33 Markov Chain Monte Carlo

34 Gibbs Sampilng: Variations Block Gibbs sampling samples multiple variables: Sample a number of variables k > 1 jointly. Sample a tree-structured subgraph of a UGM.

35 Gibbs Sampilng: Variations Block Gibbs sampling samples multiple variables: Sample a number of variables k > 1 jointly. Sample a tree-structured subgraph of a UGM. Auxiliary-variable sampling: Introduce variables to sample bigger blocks: E.g., introduce z variables in mixture models. Also used in Bayesian logistic regression.

36 Gibbs Sampilng: Variations Block Gibbs sampling samples multiple variables: Sample a number of variables k > 1 jointly. Sample a tree-structured subgraph of a UGM. Auxiliary-variable sampling: Introduce variables to sample bigger blocks: E.g., introduce z variables in mixture models. Also used in Bayesian logistic regression. Collapsed or Rao-Blackwellized: integrate out variables that are not of interest. Provably decrease variance of sampler. E.g., integrate out hidden states in Bayesian hidden Markov model.

37 Block Gibbs Sampling in Action For denoising task, we could use two tree-structured blocks:

38 Block Gibbs Sampling in Action Gibbs vs. tree-structured block-gibbs samples:

39 Limitations of Gibbs Sampling Gibbs sampling is nice because it has no parameters: You just need to decide on the blocks and auxiliary variables.

40 Limitations of Gibbs Sampling Gibbs sampling is nice because it has no parameters: You just need to decide on the blocks and auxiliary variables. But it isn t always ideal: Samples can be very correlated: slow progress. Conditional may not have a nice form: If Markov blanket is not conjugate, need rejection/importance sampling.

41 Limitations of Gibbs Sampling Gibbs sampling is nice because it has no parameters: You just need to decide on the blocks and auxiliary variables. But it isn t always ideal: Samples can be very correlated: slow progress. Conditional may not have a nice form: If Markov blanket is not conjugate, need rejection/importance sampling. Generalization that can address these is Metropolis-Hastings: Oldest algorithm among the Best of the 20th Century.

42 Outline 1 Gibbs Sampling 2 Markov Chain Monte Carlo 3 Metropolis-Hastings 4 Non-Parametric Bayes

43 Metropolis Algorithms The Metropolis algorithm for sampling from a continuous p(x): Start from some x 0 and on iteration t: 1 Add zero-mean Gaussian noise to x t to generate x t. 2 Generate u from a U(0, 1).

44 Metropolis Algorithms The Metropolis algorithm for sampling from a continuous p(x): Start from some x 0 and on iteration t: 1 Add zero-mean Gaussian noise to x t to generate x t. 2 Generate u from a U(0, 1). 3 Accept the sample and set x t+1 = x t if u p( xt ) p(x t ), and otherwise reject the sample and set x t+1 = x t. A random walk, but sometimes rejecting steps that decrease probability: Another valid MCMC algorithm, although convergence may again be slow.

45 Metropolis Algorithm in Action

46 Metropolis Algorithm Analysis Markov chain with transitions p ss = p(x t = s x t 1 = s) is reversible if there exists p such that p(s)p ss = p(s )p s s, which is called detailed balance.

47 Metropolis Algorithm Analysis Markov chain with transitions p ss = p(x t = s x t 1 = s) is reversible if there exists p such that p(s)p ss = p(s )p s s, which is called detailed balance. Assuming we reach stationary, detailed balance is sufficent for p to be the stationary distribution, p(s)p ss = p(s )p s s s s p(s)p ss = p(s ) s s p(s)p ss = p(s ) s p ss }{{} =1 (stationary condition)

48 Metropolis Algorithm Analysis Metropolis algorithm has p ss > 0 and satisfies detailed balance, p(s)p ss = p(s )p s s. We can show this by defining transition kernel { } T ss = min 1, p(s ), p(s)

49 Metropolis Algorithm Analysis Metropolis algorithm has p ss > 0 and satisfies detailed balance, p(s)p ss = p(s )p s s. We can show this by defining transition kernel { } T ss = min 1, p(s ), p(s) and observing that { } { } p(s)t ss = p(s) min 1, p(s 1 ) Z = p(s) min 1, p(s ) p(s) 1 Z { } p(s) = p(s) min 1, p(s ) = min { p(s), p(s ) } p(s) { = p(s ) min 1, p(s) } p(s = p(s )T s ) s.

50 Metropolis-Hastings Instead of Gaussian noise, consider a general proposal distribution q: Value q( x t x t ) is probability of proposing x t.

51 Metropolis-Hastings Instead of Gaussian noise, consider a general proposal distribution q: Value q( x t x t ) is probability of proposing x t. Metropolis-Hastings accepts proposal if u p( xt )q(x t x t ) p(x t )q( x t x t ), where extra terms ensure detailed balance for asymmetric q: E.g., if you are more likely to propose to go from x t to x t than the reverse.

52 Metropolis-Hastings Instead of Gaussian noise, consider a general proposal distribution q: Value q( x t x t ) is probability of proposing x t. Metropolis-Hastings accepts proposal if u p( xt )q(x t x t ) p(x t )q( x t x t ), where extra terms ensure detailed balance for asymmetric q: E.g., if you are more likely to propose to go from x t to x t than the reverse. This again works under very weak conditions, such as q( x t x t ) > 0. Gibbs sampling is a special case, but we have a lot of flexibility: You can make performance much better/worse with an appropriate q.

53 Metropolis-Hastings Simple choices for proposal distribution q: Metropolis originally used random walks: x t = x t 1 + ɛ for ɛ N (0, Σ). Hastings originally used independent proposal: q(x t x t 1 ) = q(x t ).

54 Metropolis-Hastings Simple choices for proposal distribution q: Metropolis originally used random walks: x t = x t 1 + ɛ for ɛ N (0, Σ). Hastings originally used independent proposal: q(x t x t 1 ) = q(x t ). Gibbs sampling updates single variable based on conditional: In this case the acceptance rate is 1 so we never reject.

55 Metropolis-Hastings Simple choices for proposal distribution q: Metropolis originally used random walks: x t = x t 1 + ɛ for ɛ N (0, Σ). Hastings originally used independent proposal: q(x t x t 1 ) = q(x t ). Gibbs sampling updates single variable based on conditional: In this case the acceptance rate is 1 so we never reject. Mixture model for q: e.g., between big and small moves.

56 Metropolis-Hastings Simple choices for proposal distribution q: Metropolis originally used random walks: x t = x t 1 + ɛ for ɛ N (0, Σ). Hastings originally used independent proposal: q(x t x t 1 ) = q(x t ). Gibbs sampling updates single variable based on conditional: In this case the acceptance rate is 1 so we never reject. Mixture model for q: e.g., between big and small moves. Adaptive MCMC : tries to update q as we go: needs to be done carefully. Particle MCMC : use particle filter to make proposal.

57 Metropolis-Hastings Simple choices for proposal distribution q: Metropolis originally used random walks: x t = x t 1 + ɛ for ɛ N (0, Σ). Hastings originally used independent proposal: q(x t x t 1 ) = q(x t ). Gibbs sampling updates single variable based on conditional: In this case the acceptance rate is 1 so we never reject. Mixture model for q: e.g., between big and small moves. Adaptive MCMC : tries to update q as we go: needs to be done carefully. Particle MCMC : use particle filter to make proposal. Unlike rejection sampling, we don t want acceptance rate as high as possible: High acceptance rate may mean we re not moving very much. Low acceptance rate definitely means we re not moving very much. Designing q is an art.

58 Metropolis-Hastings Metropolis-Hastings for sampling from mixture of Gaussians: High acceptance rate could mean we are staying in one mode. We may to proposal to be mixture between random walk and mode jumping.

59 Outline 1 Gibbs Sampling 2 Markov Chain Monte Carlo 3 Metropolis-Hastings 4 Non-Parametric Bayes

60 Stochastic Processes and Non-Parametric Bayes A stochastic process is an infinite collection of random variables {x i }. Non-parametric Bayesian methods use priors defined on stochastic processes: Allows extremely-flexible prior, and posterior complexity grows with data size. Typically set up so that samples from posterior are finite-sized. The two most common priors are Gaussian processes and Dirichlet processes: Gaussian processes define prior on space of functions (universal approximators). Dirichlet processes define prior on space of probabilities (without fixing dimension).

61 Gaussian Processes Recall that we can partition a multivariate Gaussian: µ = [ [ ] ] Σxx Σ µ x, µ y, Σ = xy, Σ yx Σ yy and marginal distribution wrt x variables is just a N (µ x, Σ x x) Gaussian.

62 Gaussian Processes Recall that we can partition a multivariate Gaussian: µ = [ [ ] ] Σxx Σ µ x, µ y, Σ = xy, Σ yx Σ yy and marginal distribution wrt x variables is just a N (µ x, Σ x x) Gaussian. Generalization of this to infinite variables is Gaussian processes (GPs): Infinite collection of random variables. Any finite set from collection follows a Gaussian distribution.

63 Gaussian Processes Recall that we can partition a multivariate Gaussian: µ = [ [ ] ] Σxx Σ µ x, µ y, Σ = xy, Σ yx Σ yy and marginal distribution wrt x variables is just a N (µ x, Σ x x) Gaussian. Generalization of this to infinite variables is Gaussian processes (GPs): Infinite collection of random variables. Any finite set from collection follows a Gaussian distribution. GPs are specified by a mean function m and covariance function k: If m(x) = E[f(x)], k(x, x ) = E[(f(x) m(x))(f(x ) m(x )) T ], then we say that f(x) GP(m(x), k(x, x )).

64 Regression Models as Gaussian Processes For example, predictions made by linear regression with Gaussian prior f(x) = φ(x) T w, w N (0, Σ), are a Gaussian process with mean function E[f(x)] = E[φ(x) T w] = φ(x) T E[w] = 0. and covariance function E[f(x)f(x) T ] = φ(x) T E[ww T ]φ(x ) = φ(x)σφ(x ).

65 Gaussian Processes

66 Gaussian Process Model Selection We can view a Gaussian process as a prior distribution over smooth functions. Most common choice of covariance is RBF.

67 Gaussian Process Model Selection We can view a Gaussian process as a prior distribution over smooth functions. Most common choice of covariance is RBF. Is this the same as using kernels?

68 Gaussian Process Model Selection We can view a Gaussian process as a prior distribution over smooth functions. Most common choice of covariance is RBF. Is this the same as using kernels? Yes, this is Bayesian linear regression plus the kernel trick.

69 Gaussian Process Model Selection So why do we care? We can get estimate of uncertainty in the prediction. We can use marginal likelihood to learn the kernel/covariance. Non-hierarchical approach: Write kernel in terms of parameters, optimize parameters to learn kernel.

70 Gaussian Process Model Selection So why do we care? We can get estimate of uncertainty in the prediction. We can use marginal likelihood to learn the kernel/covariance. Non-hierarchical approach: Write kernel in terms of parameters, optimize parameters to learn kernel. Hierarchical approach: put a hyper-prior of types of kernels. Can be viewed as an automatic statistician:

71 Recall the finite mixture model: Dirichlet Process p(x θ) = k π c p(x θ c ). c=1 Non-parametric Bayesian methods allow us to consider infinite mixture model, p(x θ) = π c p(x θ c ). Common choice for prior on π values is Dirichlet process: Also called Chinese restaurant process and stick-breaking process. For finite datasets, only a fixed number of clusters have π c 0. But don t need to pick number of clusters, grows with data size. c=1

72 Dirichlet Process Gibbs sampling in Dirichlet process mixture model in action:

73 Dirichlet Process Gibbs sampling in Dirichlet process mixture model in action: We could alternately put a prior on k: Reversible-jump MCMC can be used to sample from models of different sizes. There a variety of interesting extensions: Beta process. Hierarchical Dirichlet process,. Polya trees. Infinite hidden Markov models.

74 Summary Markov chain Monte Carlo generates a sequence of dependent samples: But asymptotically these samples come from the posterior.

75 Summary Markov chain Monte Carlo generates a sequence of dependent samples: But asymptotically these samples come from the posterior. Gibbs sampling is special of repeatedly sampling one variable at time. Works poorly, but effective extensions like block/collapsed Gibbs.

76 Summary Markov chain Monte Carlo generates a sequence of dependent samples: But asymptotically these samples come from the posterior. Gibbs sampling is special of repeatedly sampling one variable at time. Works poorly, but effective extensions like block/collapsed Gibbs. Metropolis-Hastings is generalization allowing arbtirary proposals.

77 Summary Markov chain Monte Carlo generates a sequence of dependent samples: But asymptotically these samples come from the posterior. Gibbs sampling is special of repeatedly sampling one variable at time. Works poorly, but effective extensions like block/collapsed Gibbs. Metropolis-Hastings is generalization allowing arbtirary proposals. Non-Parametric Bayesian methods use flexible infinite-dimensional priors: Allows model complexity to grow with data size. Next time: most cited ML paper in the 00s and variational inference.

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

### Monte Carlo Methods. Geoff Gordon February 9, 2006

Monte Carlo Methods Geoff Gordon ggordon@cs.cmu.edu February 9, 2006 Numerical integration problem 5 4 3 f(x,y) 2 1 1 0 0.5 0 X 0.5 1 1 0.8 0.6 0.4 Y 0.2 0 0.2 0.4 0.6 0.8 1 x X f(x)dx Used for: function

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

### Gaussian Processes (10/16/13)

STA561: Probabilistic machine learning Gaussian Processes (10/16/13) Lecturer: Barbara Engelhardt Scribes: Changwei Hu, Di Jin, Mengdi Wang 1 Introduction In supervised learning, we observe some inputs

### Density Estimation. Seungjin Choi

Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

### Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

1 / 27 Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 27 Monte Carlo Integration The big question : Evaluate E p(z) [f(z)]

### A Process over all Stationary Covariance Kernels

A Process over all Stationary Covariance Kernels Andrew Gordon Wilson June 9, 0 Abstract I define a process over all stationary covariance kernels. I show how one might be able to perform inference that

### Bayesian Machine Learning

Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

### Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

### Sampling Algorithms for Probabilistic Graphical models

Sampling Algorithms for Probabilistic Graphical models Vibhav Gogate University of Washington References: Chapter 12 of Probabilistic Graphical models: Principles and Techniques by Daphne Koller and Nir

### DAG models and Markov Chain Monte Carlo methods a short overview

DAG models and Markov Chain Monte Carlo methods a short overview Søren Højsgaard Institute of Genetics and Biotechnology University of Aarhus August 18, 2008 Printed: August 18, 2008 File: DAGMC-Lecture.tex

### Gentle Introduction to Infinite Gaussian Mixture Modeling

Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for

### MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

### Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond January 18, 2017 Contents 1 Batch and Recursive Estimation 2 Towards Bayesian Filtering 3 Kalman Filter and Bayesian Filtering and Smoothing

### Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Sequential Monte Carlo and Particle Filtering Frank Wood Gatsby, November 2007 Importance Sampling Recall: Let s say that we want to compute some expectation (integral) E p [f] = p(x)f(x)dx and we remember

### An introduction to Sequential Monte Carlo

An introduction to Sequential Monte Carlo Thang Bui Jes Frellsen Department of Engineering University of Cambridge Research and Communication Club 6 February 2014 1 Sequential Monte Carlo (SMC) methods

### The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning and Inference in DAGs We discussed learning in DAG models, log p(x W ) = n d log p(x i j x i pa(j),

### Approximate Inference using MCMC

Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

### Probabilistic Graphical Models

2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

### Bayesian GLMs and Metropolis-Hastings Algorithm

Bayesian GLMs and Metropolis-Hastings Algorithm We have seen that with conjugate or semi-conjugate prior distributions the Gibbs sampler can be used to sample from the posterior distribution. In situations,

### Machine Learning Lecture 5

Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

### Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

Sampling Methods Oliver Schulte - CMP 419/726 Bishop PRML Ch. 11 Recall Inference or General Graphs Junction tree algorithm is an exact inference method for arbitrary graphs A particular tree structure

### Markov Chain Monte Carlo (MCMC)

School of Computer Science 10-708 Probabilistic Graphical Models Markov Chain Monte Carlo (MCMC) Readings: MacKay Ch. 29 Jordan Ch. 21 Matt Gormley Lecture 16 March 14, 2016 1 Homework 2 Housekeeping Due

### Probabilistic Reasoning in Deep Learning

Probabilistic Reasoning in Deep Learning Dr Konstantina Palla, PhD palla@stats.ox.ac.uk September 2017 Deep Learning Indaba, Johannesburgh Konstantina Palla 1 / 39 OVERVIEW OF THE TALK Basics of Bayesian

### GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

### Markov Networks.

Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Expectation Maximization Mark Schmidt University of British Columbia Winter 2018 Last Time: Learning with MAR Values We discussed learning with missing at random values in data:

### Bayesian linear regression

Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

### SAMPLING ALGORITHMS. In general. Inference in Bayesian models

SAMPLING ALGORITHMS SAMPLING ALGORITHMS In general A sampling algorithm is an algorithm that outputs samples x 1, x 2,... from a given distribution P or density p. Sampling algorithms can for example be

### Approximate inference in Energy-Based Models

CSC 2535: 2013 Lecture 3b Approximate inference in Energy-Based Models Geoffrey Hinton Two types of density model Stochastic generative model using directed acyclic graph (e.g. Bayes Net) Energy-based

### Notes on pseudo-marginal methods, variational Bayes and ABC

Notes on pseudo-marginal methods, variational Bayes and ABC Christian Andersson Naesseth October 3, 2016 The Pseudo-Marginal Framework Assume we are interested in sampling from the posterior distribution

### Inference in Bayesian Networks

Andrea Passerini passerini@disi.unitn.it Machine Learning Inference in graphical models Description Assume we have evidence e on the state of a subset of variables E in the model (i.e. Bayesian Network)

### Model Selection for Gaussian Processes

Institute for Adaptive and Neural Computation School of Informatics,, UK December 26 Outline GP basics Model selection: covariance functions and parameterizations Criteria for model selection Marginal

### CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

### Linear Models for Classification

Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

### Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

### MCMC notes by Mark Holder

MCMC notes by Mark Holder Bayesian inference Ultimately, we want to make probability statements about true values of parameters, given our data. For example P(α 0 < α 1 X). According to Bayes theorem:

### p L yi z n m x N n xi

y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

### Paul Karapanagiotidis ECO4060

Paul Karapanagiotidis ECO4060 The way forward 1) Motivate why Markov-Chain Monte Carlo (MCMC) is useful for econometric modeling 2) Introduce Markov-Chain Monte Carlo (MCMC) - Metropolis-Hastings (MH)

### UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

### STA 294: Stochastic Processes & Bayesian Nonparametrics

MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a

### CSC 446 Notes: Lecture 13

CSC 446 Notes: Lecture 3 The Problem We have already studied how to calculate the probability of a variable or variables using the message passing method. However, there are some times when the structure

### Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

### Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

### Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond Department of Biomedical Engineering and Computational Science Aalto University January 26, 2012 Contents 1 Batch and Recursive Estimation

### Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods Arnaud Doucet Email: arnaud@cs.ubc.ca 1 1.1 Outline Introduction to Markov chain Monte Carlo The Gibbs Sampler Examples Overview of the Lecture

### Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

### Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

### SAMSI Astrostatistics Tutorial. More Markov chain Monte Carlo & Demo of Mathematica software

SAMSI Astrostatistics Tutorial More Markov chain Monte Carlo & Demo of Mathematica software Phil Gregory University of British Columbia 26 Bayesian Logical Data Analysis for the Physical Sciences Contents:

### Unsupervised Learning

Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

### Tutorial on Probabilistic Programming with PyMC3

185.A83 Machine Learning for Health Informatics 2017S, VU, 2.0 h, 3.0 ECTS Tutorial 02-04.04.2017 Tutorial on Probabilistic Programming with PyMC3 florian.endel@tuwien.ac.at http://hci-kdd.org/machine-learning-for-health-informatics-course

### Bayesian Prediction of Code Output. ASA Albuquerque Chapter Short Course October 2014

Bayesian Prediction of Code Output ASA Albuquerque Chapter Short Course October 2014 Abstract This presentation summarizes Bayesian prediction methodology for the Gaussian process (GP) surrogate representation

### Linear Dynamical Systems

Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

### Nonparametric Bayesian Methods - Lecture I

Nonparametric Bayesian Methods - Lecture I Harry van Zanten Korteweg-de Vries Institute for Mathematics CRiSM Masterclass, April 4-6, 2016 Overview of the lectures I Intro to nonparametric Bayesian statistics

### Bayesian Learning in Undirected Graphical Models

Bayesian Learning in Undirected Graphical Models Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London, UK http://www.gatsby.ucl.ac.uk/ and Center for Automated Learning and

### Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

### Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning Tobias Scheffer, Niels Landwehr Remember: Normal Distribution Distribution over x. Density function with parameters

### Metropolis-Hastings Algorithm

Strength of the Gibbs sampler Metropolis-Hastings Algorithm Easy algorithm to think about. Exploits the factorization properties of the joint probability distribution. No difficult choices to be made to

### Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

### RANDOM TOPICS. stochastic gradient descent & Monte Carlo

RANDOM TOPICS stochastic gradient descent & Monte Carlo MASSIVE MODEL FITTING nx minimize f(x) = 1 n i=1 f i (x) Big! (over 100K) minimize 1 least squares 2 kax bk2 = X i 1 2 (a ix b i ) 2 minimize 1 SVM

Advanced Machine Learning Nonparametric Bayesian Models --Learning/Reasoning in Open Possible Worlds Eric Xing Lecture 7, August 4, 2009 Reading: Eric Xing Eric Xing @ CMU, 2006-2009 Clustering Eric Xing

### Lecture 14. Clustering, K-means, and EM

Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.

### Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes Lecturer: Drew Bagnell Scribe: Venkatraman Narayanan 1, M. Koval and P. Parashar 1 Applications of Gaussian

### Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ Bayesian paradigm Consistent use of probability theory

### Outline Lecture 2 2(32)

Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

### Convergence Rate of Markov Chains

Convergence Rate of Markov Chains Will Perkins April 16, 2013 Convergence Last class we saw that if X n is an irreducible, aperiodic, positive recurrent Markov chain, then there exists a stationary distribution

### MCMC Methods: Gibbs and Metropolis

MCMC Methods: Gibbs and Metropolis Patrick Breheny February 28 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/30 Introduction As we have seen, the ability to sample from the posterior distribution

### σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

### Approximate Slice Sampling for Bayesian Posterior Inference

Approximate Slice Sampling for Bayesian Posterior Inference Anonymous Author 1 Anonymous Author 2 Anonymous Author 3 Unknown Institution 1 Unknown Institution 2 Unknown Institution 3 Abstract In this paper,

### Approximate Inference Part 1 of 2

Approximate Inference Part 1 of 2 Tom Minka Microsoft Research, Cambridge, UK Machine Learning Summer School 2009 http://mlg.eng.cam.ac.uk/mlss09/ 1 Bayesian paradigm Consistent use of probability theory

### Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

### Bayesian Analysis for Natural Language Processing Lecture 2

Bayesian Analysis for Natural Language Processing Lecture 2 Shay Cohen February 4, 2013 Administrativia The class has a mailing list: coms-e6998-11@cs.columbia.edu Need two volunteers for leading a discussion

### Introduction to Gaussian Process

Introduction to Gaussian Process CS 778 Chris Tensmeyer CS 478 INTRODUCTION 1 What Topic? Machine Learning Regression Bayesian ML Bayesian Regression Bayesian Non-parametric Gaussian Process (GP) GP Regression

### Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

Sampling Methods Machine Learning orsten Möller Möller/Mori 1 Recall Inference or General Graphs Junction tree algorithm is an exact inference method for arbitrary graphs A particular tree structure defined

### Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods

Introduction to Stochastic Gradient Markov Chain Monte Carlo Methods Changyou Chen Department of Electrical and Computer Engineering, Duke University cc448@duke.edu Duke-Tsinghua Machine Learning Summer

### Statistical learning. Chapter 20, Sections 1 4 1

Statistical learning Chapter 20, Sections 1 4 Chapter 20, Sections 1 4 1 Outline Bayesian learning Maximum a posteriori and maximum likelihood learning Bayes net learning ML parameter learning with complete

### Lecture 6: Bayesian Inference in SDE Models

Lecture 6: Bayesian Inference in SDE Models Bayesian Filtering and Smoothing Point of View Simo Särkkä Aalto University Simo Särkkä (Aalto) Lecture 6: Bayesian Inference in SDEs 1 / 45 Contents 1 SDEs

### CS 188: Artificial Intelligence. Bayes Nets

CS 188: Artificial Intelligence Probabilistic Inference: Enumeration, Variable Elimination, Sampling Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew

### ELEC633: Graphical Models

ELEC633: Graphical Models Tahira isa Saleem Scribe from 7 October 2008 References: Casella and George Exploring the Gibbs sampler (1992) Chib and Greenberg Understanding the Metropolis-Hastings algorithm

### Bayesian networks: approximate inference

Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008 Approximative inference September 2008 1 / 25 Motivation Because of the (worst-case) intractability of exact

### Approximate Bayesian Computation: a simulation based approach to inference

Approximate Bayesian Computation: a simulation based approach to inference Richard Wilkinson Simon Tavaré 2 Department of Probability and Statistics University of Sheffield 2 Department of Applied Mathematics

### Lecture 9: PGM Learning

13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

### Bayesian Networks BY: MOHAMAD ALSABBAGH

Bayesian Networks BY: MOHAMAD ALSABBAGH Outlines Introduction Bayes Rule Bayesian Networks (BN) Representation Size of a Bayesian Network Inference via BN BN Learning Dynamic BN Introduction Conditional

### Markov Chain Monte Carlo Lecture 6

Sequential parallel tempering With the development of science and technology, we more and more need to deal with high dimensional systems. For example, we need to align a group of protein or DNA sequences

### Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning Diederik (Durk) Kingma Danilo J. Rezende (*) Max Welling Shakir Mohamed (**) Stochastic Gradient Variational Inference Bayesian

### Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

### Machine Learning, Midterm Exam

10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have

### Bayesian model selection in graphs by using BDgraph package

Bayesian model selection in graphs by using BDgraph package A. Mohammadi and E. Wit March 26, 2013 MOTIVATION Flow cytometry data with 11 proteins from Sachs et al. (2005) RESULT FOR CELL SIGNALING DATA

### 19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

10-708: Probabilistic Graphical Models, Spring 2015 19 : Bayesian Nonparametrics: The Indian Buffet Process Lecturer: Avinava Dubey Scribes: Rishav Das, Adam Brodie, and Hemank Lamba 1 Latent Variable

### Based on slides by Richard Zemel

CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 3: Directed Graphical Models and Latent Variables Based on slides by Richard Zemel Learning outcomes What aspects of a model can we

### Machine Learning Final Exam May 5, 2015

Name: Andrew ID: Instructions Anything on paper is OK in arbitrary shape size, and quantity. Electronic devices are not acceptable. This includes ipods, ipads, Android tablets, Blackberries, Nokias, Windows

### Bayesian Nonparametrics

Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet

### Markov-Chain Monte Carlo

Markov-Chain Monte Carlo CSE586 Computer Vision II Spring 2010, Penn State Univ. References Recall: Sampling Motivation If we can generate random samples x i from a given distribution P(x), then we can

### CPSC 540: Machine Learning

CPSC 540: Machine Learning Kernel Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 2: 2 late days to hand it in tonight. Assignment 3: Due Feburary

### Introduction to Bayesian methods in inverse problems

Introduction to Bayesian methods in inverse problems Ville Kolehmainen 1 1 Department of Applied Physics, University of Eastern Finland, Kuopio, Finland March 4 2013 Manchester, UK. Contents Introduction