RANDOM TOPICS. stochastic gradient descent & Monte Carlo

Size: px
Start display at page:

Download "RANDOM TOPICS. stochastic gradient descent & Monte Carlo"

Transcription

1 RANDOM TOPICS stochastic gradient descent & Monte Carlo

2 MASSIVE MODEL FITTING nx minimize f(x) = 1 n i=1 f i (x) Big! (over 100K) minimize 1 least squares 2 kax bk2 = X i 1 2 (a ix b i ) 2 minimize 1 SVM 2 kwk2 + h(ldw) = 1 2 kwk2 + X i h(l i d i w) minimize low-rank factorization 1 X 2 kd XY 1 k2 = 2 (d ij x i y j ) 2 ij

3 THE BIG IDEA minimize f(x) = 1 n nx i=1 f i (x) True gradient rf = 1 nx rf i (x) n i=1 Idea: choose a subset of the data [1,N] rf g = 1 X rf i (x) i2 usually just one sample

4 INFINITE SAMPLE VS FINITE SAMPLE finite sample minimize f(x) = 1 n nx f i (x) i=1 We can solve finite sample problem to high accuracy. infinite sample but true accuracy is limited by sample size Z minimize f(x) =E s [f s (x)] = s f s (x)p(s)ds

5 STOCHASTIC ORACLE approximate gradient rf g = 1 X rf i (x) i2 big error solution improves necessary properties g = rf(x)+ why? E( ) =0 small error solutions gets worse Error must decrease as we approach solution

6 EXAMPLE minimize 1 2 kax bk2 what's happening? inexact gradient why does this happen?

7 DECREASING STEP SIZE x k+1 = x k k rf k (x) big stepsize k = a b + k why? (almost) equivalent to choose larger sample small stepsize used for strongly convex problems

8 AVERAGING x k+1 = x k k rf k (x) k = a p k + b averaging ergodic averaging x k+1 = 1 k +1 X xi x why is this bad? used for weakly convex problems does this limit convergence rate?

9 ergodic averaging AVERAGING x k+1 = 1 k +1 X xi compute without storage x k+1 = k k +1 xk + 1 k +1 xk+1 x k+1 = short memory version k k + xk + k + xk+1 tradeoff variance for bias 1 averaging x

10 KNOWN RATES: WEAKLY CONVEX Theorem Suppose f is convex, krf(x)k appleg, and that the diameter of dom(f) is less than D. If we use the stepsize k = See Shamir and Zhang, ICML 13 Rakhlin, Shamir, Sridharan, CoRR 11 p c, k then D log(k) E[f( x k ) f? ] apple c + cg p k

11 KNOWN RATES: STRONGLY CONVEX Shamir and Zhang, ICML 13 E[f( x k ) x k+1 = f? ] apple 58(1 + /k) Theorem Suppose f is strongly convex with parameter m, and that krf(x)k appleg. If you use stepsize k =1/mk, and the limited memory averaging with 1, then k k + xk + k + xk+1 ( + 1) + ( +.5)3 (1 + log k) G 2 k mk

12 EXAMPLE: SVM PEGASOS: Primal Estimated sub-gradient SOlver for SVM minimize 1 2 kwk2 + Ch(Aw) rh(x) = ( 1, x < 1 0, otherwise note: this is a subgradient descent method

13 PEGOSOS PEGASOS: Primal Estimated sub-gradient SOlver for SVM minimize X i 2 kwk2 + h(a i w) While not converged : k = 1 k If a T k wk 1: w k+1 = w k k w k If a T k w k < 1: w k+1 = w k k ( w k a T i ) ŵ k+1 = 1 k +1 k+1 X i=1 w k not used in practice

14 PEGOSOS stochastic methods finite sample method gradient method

15 PEGOSOS classification error (CV) gradient method stochastic methods You hope to converge before SGD gets slow!

16 ALLREDUCE but if you don t use SGD as warm start for iterative method Agarwal. Effective Terascale Linear Learning. 12

17 GENERALIZATION: FBS NX minimize g(x)+ 1 N f i (x) i=1 ˆx k+1 = x k k rf k (x k ) x k+1 =prox g (ˆx k+1, ) diminishing stepsize+averaging: Duchi and Singer 09 (FOBOS) non-ergodic convergence: Rosasco, Villa, Vu 14 all known results make lots of assumptions, gradients must be bounded

18 MONTO CARLO METHODS Methods that involve randomly sampling a distribution Monte Carlo casino

19 BAYESIAN LEARNING estimate parameters in a model measure these data = M(parameters) estimate these ingredients prior probability distribution of unknown parameters likelihood probability of data given (unknown) parameters Thomas Bayes

20 BAYES RULE P (A B) = P (B A)P (A) P (B) likelihood P (parameters data) = P (data parameters)p (parameters) P (data) prior we really care about this P (parameters data) / P (data parameters)p (parameters) posterior distribution

21 EXAMPLE: LOGISTIC REGRESSION P (l i =1 d i )= exp(d iw) 1 + exp(d i w) P (l, D w) = Y i exp(l i d i w) 1 + exp(l i d i w) P (parameters data) / P (data parameters)p (parameters) prior

22 EXAMPLE: LOGISTIC REGRESSION P (parameters data) / P (data parameters)p (parameters)! exp(l i d i w) P (w l, D) = P (w) 1 + exp(l i d i w) Y i log P (w l, D) = log P (w)+ X i prior log(1 + exp( l i d i w)) Example: log P (w) = w

23 BAYESIANS OFFER NO SOLUTIONS Bayes Lagrange solution space P (w D, l) optimization solution Monte Carlo methods randomly sample this solution space

24 WHY MONTE CARLO? You need to know more than just a max/minimizer error bars optimization solution Z E(w) = wp(w) dw Z Var(w) =E[(w µ)(w µ) T ]= ww T P (w) dw µµ T

25 EXAMPLE: POKER BOTS You can t solve your problem by other (faster) methods maximize log P (w) we don t know derivative poker bots model parameters describe player behavior 2.5M different community cards 10 players = 59K raise/fold/call perms per round

26 WHY MONTE CARLO? You re incredibly lazy maximize log P (w) I could differentiate this, I just don t wanna arg max log P (w) max k {P (wk )}

27 MARKOV CHAINS What is an MC? MC has steady state if Irreducible: can visit any state starting at any state Aperiodic: does not get trapped in deterministic cycles

28 METROPOLIS HASTINGS proposal distribution q(y x) ingredients posterior distribution p(x) Start with x 0 For k =1, 2, 3,... MH Algorithm Choose candidate y from q(y x k ) Compute acceptance probability p(y)q(x k y) =min 1, p(x k )q(y x k ) Set x k+1 = y with probability otherwise x k+1 = x k

29 CONVERGENCE Irreducible: support of q must contain support of p Aperiodic: positive rejection probability guarantees no infinite cycle Theorem Suppose the support of q contains the support of p. Then the MH sampler has a stationary distribution, and the distribution is equal to p.

30 METROPOLIS ALGORITHM proposal distribution q(y x) ingredients posterior distribution p(x) Assume proposal is symmetric q(x k y) =q(y x k ) Metropolis Hastings Metropolis =min 1, p(y)q(x k y) p(x k )q(y x k ) =min 1, p(y) p(x k )

31 EXAMPLE: GMM histogram of Metropolis iterates Andrieu, Freitas, Doucet, Jordan 03

32 PROPERTIES OF MH Pro s We don t need a normalization constant for posterior We can run many chains in parallel (cluster/gpu) We don t need any derivatives P (parameters data) = P (data parameters)p (parameters) P (data) P (parameters data) / P (data parameters)p (parameters)

33 PROPERTIES OF MH Pro s We don t need a normalization constant for posterior We can run many chains in parallel (cluster/gpu) We don t need any derivatives Con s Mixing time depends on proposal distribution too wide = constant rejections = slow mixing too narrow = short movements=slow mixing Samples only meaningful at stationary distribution Burn in samples must be discarded Many samples needed because of correlations

34 SIMULATED ANNEALING maximize p(x) Easy choice: run MCMC, then max k p(x k ) Better choice: sample the distribution p 1 T k (x k ) temperature Why is this better? Where does the name come from?

35 COOLING SCHEDULE hot warm cool cold Andrieu, Freitas, Doucet, Jordan 03

36 CONVERGENCE SA solve non-convex problem, even NP-complete problems, as time goes to infinity. Theorem Suppose the MCMC mixes fast enough that epsilon-dense sampling occurs in finite time at every temperature. For an annealing schedule with temperature T k = 1 C log(k + T 0 ) simulated annealing converges to a global optima with probability 1. Granville, "Simulated annealing: A proof of convergence

37 WHEN TO USE SIMULATED ANNEALING flow chart Should I use simulated annealing? no. No practical way to choose temperature schedule Too fast = stuck in local minimum (risky) Too slow = no different from MCMC Act of desperation!

38 GIBBS SAMPLER Want to sample P (x 1,x 2,x 3 ) on stage k, pick some coordinate j q(y x k )= ( p(y j x k j c ), if y j c 0, otherwise = x k j c = p(y)q(xk y) p(x k )q(y x k ) = p(y)p(xk j xkjc) p(x k )p(y j y j c) = p(y j c ) p(x k j c ) =1 P (B) = P (A and B) P (A B)

39 GIBBS SAMPLER Want to sample P (x 1,x 2,x 3 ) iterates x 2 P (x 1 x 1 2,x 1 3,x 1 4,...,x 1 n) x 3 P (x 2 x 2 1,x 2 3,x 2 4,...,x 2 n) x 4 P (x 3 x 3 1,x 3 2,x 3 4,...,x 3 n)

40 APPLICATION: SAMPLING GRAPHICAL MODELS Restricted Boltzmann Machine (RBM) binary random variables 0/1 weights E(v, h) = a T v b T h v T Wh P (v, h) = 1 Z e E(v,h) partition function

41 APPLICATION: SAMPLING GRAPHICAL MODELS Restricted Boltzmann Machine (RBM) E(v, h) = a T v b T h v T Wh P (v, h) = 1 Z e E(v,h) P (v i =1 h) = = P (v i =1 h) P (v i =0 h)+p (v i =1 h) exp( a i + P j w ijh j + C) exp(c)+exp( a i + P j w ijh j + C) remove normalization aggregate constant = exp( a i + P j w ijh j ) 1 + exp( a i + P j w ijh j ) cancel constants = ( a i + X i w ij h j ) sigmoid function

42 BLOCK GIBBS FOR RBM visible stage 1 freeze hidden randomly sample visible hidden P (v i =1 h) = ( a i + X j w ij h j ) stage 2 freeze visible randomly sample hidden P (h j =1 v) = ( b j + X i w ij v i )

43 DEEP BELIEF NETS visible DBN = layered RBM Each layer only depends on layer beneath it (feed forward) Probability for each hidden node is sigmoid function hidden

44 EXAMPLE: MNIST Gan, Henao, Carlson, Carin 15 Train 3-layer DBN with 200 hidden units trained on 60K MNIST digits pre-training: layer-by-layer training training: train all weights simultaneously training done using Gibbs sampler Gibbs sampler used to explore final solution

45 EXAMPLE: MNIST training data Learned features observations sampled from deep belief network using Gibbs sampler

46 WHAT S WRONG WITH GIBBS? susceptible to strong correlations also a problem for MH: bad proposal distribution

47 SLICE SAMPLER Problems with MH: hard to choose proposal distribution does not exploit analytical information about problem long mixing time Slice sampler ( 1, if 0 apple u apple p(x) q(x, u) = 0, otherwise 1 1 p(x) u x x

48 LIFTED INTEGRAL Problems with MH: hard to choose proposal distribution does not exploit analytical information about problem long mixing time Z Slice sampler Z Z Z p(x) = q(x, u) du f(x)p(x) = f(x)q(x, u) du dx sample this 1 1 p(x) u x x

49 SLICE SAMPLER Z Z Z f(x)p(x) = f(x)q(x, u) du dx do Gibbs on this Freeze x : Choose u uniformly from [0,p(x)] Freeze u : Choose x uniformly from {x p(x) u} need analytical formula for this set p(x) u

50 MODEL FITTING ny p(x) = p i (x) i=1 support is a box q(x, u 1,u 2,...,u n )= log p(x) = nx i=1 ( 1, if p i (x) apple u i 8i 0, otherwise log p i (x) Z q(x, u 1,u 2,...,u n ) du = Z p1 (x) u 1 =0 Z p2 (x) u 2 =0 Z pn (x) u n =0 du = ny k=1 p i (x) Z Z Z f(x)p(x) dx = x u f(x)q(x, u) dx du

51 MODEL FITTING Z Z Z f(x)p(x) dx = x u f(x)q(x, u) dx du q(x, u 1,u 2,...,u n )= ( 1, if p i (x) apple u i 8i 0, otherwise Freeze x : For all i, choose u i uniformly from [0,p i (x)] Freeze u : Choose x uniformly from T {x p i (x) u i } i

52 SLICE SAMPLER PROPERTIES pros does not require proposal distribution mixes super fast simple implementation cons analytical formula for super-level sets hard in multiple dimensions fixes / generalizations step-out methods random hyper-rectangles

53 STEP-OUT METHODS pick small interval double width size until slice is contained sample uniformly from slice using rejection sampling - throw away selections outside of slice p(x) p(x) p(x) u u u

54 HIGH DIMS: COORDINATE Z Z Z METHOD f(x)p(x) = f(x)q(x, u) du dx do coordinate Gibbs on this Freeze x : Choose u uniformly from [0,p(x)] Freeze u : For i =1 n Choose x i uniformly from {x i p(x 1,,x i,x n ) u} can use stepping -out method Radford Neal: Slice Sampling 03

55 HIGH DIMS: HYPER-RECTANGLE choose large rectangle randomly/uniformly place rectangle around current location draw random proposal point from rectangle if proposal lies outside slice, shrink box so proposal is on boundary performs much better for poorly-conditioned objectives Radford Neal: Slice Sampling 03

56 COMPARISON Gradient/ Splitting rate e ck when to use you value reliability and precision (moderate speed, high accuracy) SGD 1/k you value speed over accuracy (high speed, moderate accuracy) MCMC 1/ p k you value simplicity (no gradient) or need statistical inference (slow and inaccurate)

57 DO MATLAB EXERCISE MCMC

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

MCMC and Gibbs Sampling. Kayhan Batmanghelich

MCMC and Gibbs Sampling. Kayhan Batmanghelich MCMC and Gibbs Sampling Kayhan Batmanghelich 1 Approaches to inference l Exact inference algorithms l l l The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction

More information

The Origin of Deep Learning. Lili Mou Jan, 2015

The Origin of Deep Learning. Lili Mou Jan, 2015 The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov

More information

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative

More information

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions

More information

17 : Markov Chain Monte Carlo

17 : Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo

More information

Monte Carlo Methods. Geoff Gordon February 9, 2006

Monte Carlo Methods. Geoff Gordon February 9, 2006 Monte Carlo Methods Geoff Gordon ggordon@cs.cmu.edu February 9, 2006 Numerical integration problem 5 4 3 f(x,y) 2 1 1 0 0.5 0 X 0.5 1 1 0.8 0.6 0.4 Y 0.2 0 0.2 0.4 0.6 0.8 1 x X f(x)dx Used for: function

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

Markov Chains and MCMC

Markov Chains and MCMC Markov Chains and MCMC Markov chains Let S = {1, 2,..., N} be a finite set consisting of N states. A Markov chain Y 0, Y 1, Y 2,... is a sequence of random variables, with Y t S for all points in time

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) School of Computer Science 10-708 Probabilistic Graphical Models Markov Chain Monte Carlo (MCMC) Readings: MacKay Ch. 29 Jordan Ch. 21 Matt Gormley Lecture 16 March 14, 2016 1 Homework 2 Housekeeping Due

More information

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018 Graphical Models Markov Chain Monte Carlo Inference Siamak Ravanbakhsh Winter 2018 Learning objectives Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two important examples: Gibbs sampling

More information

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science

More information

Introduction to Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,

More information

Monte Carlo Methods. Leon Gu CSD, CMU

Monte Carlo Methods. Leon Gu CSD, CMU Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte

More information

6 Markov Chain Monte Carlo (MCMC)

6 Markov Chain Monte Carlo (MCMC) 6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Bayesian Inference and MCMC

Bayesian Inference and MCMC Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the

More information

Density estimation. Computing, and avoiding, partition functions. Iain Murray

Density estimation. Computing, and avoiding, partition functions. Iain Murray Density estimation Computing, and avoiding, partition functions Roadmap: Motivation: density estimation Understanding annealing/tempering NADE Iain Murray School of Informatics, University of Edinburgh

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous

More information

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

I. Bayesian econometrics

I. Bayesian econometrics I. Bayesian econometrics A. Introduction B. Bayesian inference in the univariate regression model C. Statistical decision theory D. Large sample results E. Diffuse priors F. Numerical Bayesian methods

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

Markov chain Monte Carlo Lecture 9

Markov chain Monte Carlo Lecture 9 Markov chain Monte Carlo Lecture 9 David Sontag New York University Slides adapted from Eric Xing and Qirong Ho (CMU) Limitations of Monte Carlo Direct (unconditional) sampling Hard to get rare events

More information

MCMC: Markov Chain Monte Carlo

MCMC: Markov Chain Monte Carlo I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov

More information

Lecture notes on Regression: Markov Chain Monte Carlo (MCMC)

Lecture notes on Regression: Markov Chain Monte Carlo (MCMC) Lecture notes on Regression: Markov Chain Monte Carlo (MCMC) Dr. Veselina Kalinova, Max Planck Institute for Radioastronomy, Bonn Machine Learning course: the elegant way to extract information from data,

More information

Machine Learning for Data Science (CS4786) Lecture 24

Machine Learning for Data Science (CS4786) Lecture 24 Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each

More information

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version) A quick introduction to Markov chains and Markov chain Monte Carlo (revised version) Rasmus Waagepetersen Institute of Mathematical Sciences Aalborg University 1 Introduction These notes are intended to

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

Simulated Annealing for Constrained Global Optimization

Simulated Annealing for Constrained Global Optimization Monte Carlo Methods for Computation and Optimization Final Presentation Simulated Annealing for Constrained Global Optimization H. Edwin Romeijn & Robert L.Smith (1994) Presented by Ariel Schwartz Objective

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem

More information

The Ising model and Markov chain Monte Carlo

The Ising model and Markov chain Monte Carlo The Ising model and Markov chain Monte Carlo Ramesh Sridharan These notes give a short description of the Ising model for images and an introduction to Metropolis-Hastings and Gibbs Markov Chain Monte

More information

Markov Chain Monte Carlo (MCMC)

Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can

More information

Learning the hyper-parameters. Luca Martino

Learning the hyper-parameters. Luca Martino Learning the hyper-parameters Luca Martino 2017 2017 1 / 28 Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth

More information

Convex Optimization CMU-10725

Convex Optimization CMU-10725 Convex Optimization CMU-10725 Simulated Annealing Barnabás Póczos & Ryan Tibshirani Andrey Markov Markov Chains 2 Markov Chains Markov chain: Homogen Markov chain: 3 Markov Chains Assume that the state

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Stochastic optimization Markov Chain Monte Carlo

Stochastic optimization Markov Chain Monte Carlo Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya Weizmann Institute of Science 1 Motivation Markov chains Stationary distribution Mixing time 2 Algorithms Metropolis-Hastings Simulated Annealing

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos Contents Markov Chain Monte Carlo Methods Sampling Rejection Importance Hastings-Metropolis Gibbs Markov Chains

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling Professor Erik Sudderth Brown University Computer Science October 27, 2016 Some figures and materials courtesy

More information

Results: MCMC Dancers, q=10, n=500

Results: MCMC Dancers, q=10, n=500 Motivation Sampling Methods for Bayesian Inference How to track many INTERACTING targets? A Tutorial Frank Dellaert Results: MCMC Dancers, q=10, n=500 1 Probabilistic Topological Maps Results Real-Time

More information

Markov Chains and MCMC

Markov Chains and MCMC Markov Chains and MCMC CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 4 : 590.02 Spring 13 1 Recap: Monte Carlo Method If U is a universe of items, and G is a subset satisfying some property,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

MCMC Methods: Gibbs and Metropolis

MCMC Methods: Gibbs and Metropolis MCMC Methods: Gibbs and Metropolis Patrick Breheny February 28 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/30 Introduction As we have seen, the ability to sample from the posterior distribution

More information

Lecture 6: Markov Chain Monte Carlo

Lecture 6: Markov Chain Monte Carlo Lecture 6: Markov Chain Monte Carlo D. Jason Koskinen koskinen@nbi.ku.dk Photo by Howard Jackman University of Copenhagen Advanced Methods in Applied Statistics Feb - Apr 2016 Niels Bohr Institute 2 Outline

More information

Quantifying Uncertainty

Quantifying Uncertainty Sai Ravela M. I. T Last Updated: Spring 2013 1 Markov Chain Monte Carlo Monte Carlo sampling made for large scale problems via Markov Chains Monte Carlo Sampling Rejection Sampling Importance Sampling

More information

CSC 2541: Bayesian Methods for Machine Learning

CSC 2541: Bayesian Methods for Machine Learning CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Restricted Boltzmann Machines

Restricted Boltzmann Machines Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector

More information

UNSUPERVISED LEARNING

UNSUPERVISED LEARNING UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training

More information

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Speaker Representation and Verification Part II. by Vasileios Vasilakakis Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation

More information

CS 188: Artificial Intelligence. Bayes Nets

CS 188: Artificial Intelligence. Bayes Nets CS 188: Artificial Intelligence Probabilistic Inference: Enumeration, Variable Elimination, Sampling Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew

More information

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim Simulated Stochastic Approximation Annealing for Global Optimization with a Square-Root Cooling Schedule Abstract Simulated annealing has been widely used in the solution of optimization problems. As known

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines Christopher Tosh Department of Computer Science and Engineering University of California, San Diego ctosh@cs.ucsd.edu Abstract The

More information

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods Prof. Daniel Cremers 11. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric

More information

Simulation - Lectures - Part III Markov chain Monte Carlo

Simulation - Lectures - Part III Markov chain Monte Carlo Simulation - Lectures - Part III Markov chain Monte Carlo Julien Berestycki Part A Simulation and Statistical Programming Hilary Term 2018 Part A Simulation. HT 2018. J. Berestycki. 1 / 50 Outline Markov

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017 Algorithms other than SGD CS6787 Lecture 10 Fall 2017 Machine learning is not just SGD Once a model is trained, we need to use it to classify new examples This inference task is not computed with SGD There

More information

Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo Methods Markov Chain Monte Carlo Methods p. /36 Markov Chain Monte Carlo Methods Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory Markov Chain Monte Carlo Methods p. 2/36 Markov Chains

More information

Lecture 7 and 8: Markov Chain Monte Carlo

Lecture 7 and 8: Markov Chain Monte Carlo Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani

More information

Markov chain Monte Carlo methods in atmospheric remote sensing

Markov chain Monte Carlo methods in atmospheric remote sensing 1 / 45 Markov chain Monte Carlo methods in atmospheric remote sensing Johanna Tamminen johanna.tamminen@fmi.fi ESA Summer School on Earth System Monitoring and Modeling July 3 Aug 11, 212, Frascati July,

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling Christopher Jennison Department of Mathematical Sciences, University of Bath, UK http://people.bath.ac.uk/mascj Adriana Ibrahim Institute

More information

Hamiltonian Monte Carlo for Scalable Deep Learning

Hamiltonian Monte Carlo for Scalable Deep Learning Hamiltonian Monte Carlo for Scalable Deep Learning Isaac Robson Department of Statistics and Operations Research, University of North Carolina at Chapel Hill isrobson@email.unc.edu BIOS 740 May 4, 2018

More information

Markov chain Monte Carlo

Markov chain Monte Carlo Markov chain Monte Carlo Peter Beerli October 10, 2005 [this chapter is highly influenced by chapter 1 in Markov chain Monte Carlo in Practice, eds Gilks W. R. et al. Chapman and Hall/CRC, 1996] 1 Short

More information

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs

More information

Learning Energy-Based Models of High-Dimensional Data

Learning Energy-Based Models of High-Dimensional Data Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal

More information

Metropolis-Hastings Algorithm

Metropolis-Hastings Algorithm Strength of the Gibbs sampler Metropolis-Hastings Algorithm Easy algorithm to think about. Exploits the factorization properties of the joint probability distribution. No difficult choices to be made to

More information

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets Neural Networks for Machine Learning Lecture 11a Hopfield Nets Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Hopfield Nets A Hopfield net is composed of binary threshold

More information

Markov chain Monte Carlo

Markov chain Monte Carlo Markov chain Monte Carlo Probabilistic Models of Cognition, 2011 http://www.ipam.ucla.edu/programs/gss2011/ Roadmap: Motivation Monte Carlo basics What is MCMC? Metropolis Hastings and Gibbs...more tomorrow.

More information

16 : Markov Chain Monte Carlo (MCMC)

16 : Markov Chain Monte Carlo (MCMC) 10-708: Probabilistic Graphical Models 10-708, Spring 2014 16 : Markov Chain Monte Carlo MCMC Lecturer: Matthew Gormley Scribes: Yining Wang, Renato Negrinho 1 Sampling from low-dimensional distributions

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

MSc MT15. Further Statistical Methods: MCMC. Lecture 5-6: Markov chains; Metropolis Hastings MCMC. Notes and Practicals available at

MSc MT15. Further Statistical Methods: MCMC. Lecture 5-6: Markov chains; Metropolis Hastings MCMC. Notes and Practicals available at MSc MT15. Further Statistical Methods: MCMC Lecture 5-6: Markov chains; Metropolis Hastings MCMC Notes and Practicals available at www.stats.ox.ac.uk\ nicholls\mscmcmc15 Markov chain Monte Carlo Methods

More information

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University

More information

16 : Approximate Inference: Markov Chain Monte Carlo

16 : Approximate Inference: Markov Chain Monte Carlo 10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Knowledge Extraction from DBNs for Images

Knowledge Extraction from DBNs for Images Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang Introduction to MCMC DB Breakfast 09/30/2011 Guozhang Wang Motivation: Statistical Inference Joint Distribution Sleeps Well Playground Sunny Bike Ride Pleasant dinner Productive day Posterior Estimation

More information

Answers and expectations

Answers and expectations Answers and expectations For a function f(x) and distribution P(x), the expectation of f with respect to P is The expectation is the average of f, when x is drawn from the probability distribution P E

More information

Markov chain Monte Carlo

Markov chain Monte Carlo 1 / 26 Markov chain Monte Carlo Timothy Hanson 1 and Alejandro Jara 2 1 Division of Biostatistics, University of Minnesota, USA 2 Department of Statistics, Universidad de Concepción, Chile IAP-Workshop

More information

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Bagging During Markov Chain Monte Carlo for Smoother Predictions Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods

More information

Approximate Inference using MCMC

Approximate Inference using MCMC Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov

More information

Probabilistic Machine Learning

Probabilistic Machine Learning Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Brief introduction to Markov Chain Monte Carlo

Brief introduction to Markov Chain Monte Carlo Brief introduction to Department of Probability and Mathematical Statistics seminar Stochastic modeling in economics and finance November 7, 2011 Brief introduction to Content 1 and motivation Classical

More information

Bayesian Phylogenetics:

Bayesian Phylogenetics: Bayesian Phylogenetics: an introduction Marc A. Suchard msuchard@ucla.edu UCLA Who is this man? How sure are you? The one true tree? Methods we ve learned so far try to find a single tree that best describes

More information

Announcements. Proposals graded

Announcements. Proposals graded Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:

More information