RANDOM TOPICS. stochastic gradient descent & Monte Carlo
|
|
- Posy Taylor
- 6 years ago
- Views:
Transcription
1 RANDOM TOPICS stochastic gradient descent & Monte Carlo
2 MASSIVE MODEL FITTING nx minimize f(x) = 1 n i=1 f i (x) Big! (over 100K) minimize 1 least squares 2 kax bk2 = X i 1 2 (a ix b i ) 2 minimize 1 SVM 2 kwk2 + h(ldw) = 1 2 kwk2 + X i h(l i d i w) minimize low-rank factorization 1 X 2 kd XY 1 k2 = 2 (d ij x i y j ) 2 ij
3 THE BIG IDEA minimize f(x) = 1 n nx i=1 f i (x) True gradient rf = 1 nx rf i (x) n i=1 Idea: choose a subset of the data [1,N] rf g = 1 X rf i (x) i2 usually just one sample
4 INFINITE SAMPLE VS FINITE SAMPLE finite sample minimize f(x) = 1 n nx f i (x) i=1 We can solve finite sample problem to high accuracy. infinite sample but true accuracy is limited by sample size Z minimize f(x) =E s [f s (x)] = s f s (x)p(s)ds
5 STOCHASTIC ORACLE approximate gradient rf g = 1 X rf i (x) i2 big error solution improves necessary properties g = rf(x)+ why? E( ) =0 small error solutions gets worse Error must decrease as we approach solution
6 EXAMPLE minimize 1 2 kax bk2 what's happening? inexact gradient why does this happen?
7 DECREASING STEP SIZE x k+1 = x k k rf k (x) big stepsize k = a b + k why? (almost) equivalent to choose larger sample small stepsize used for strongly convex problems
8 AVERAGING x k+1 = x k k rf k (x) k = a p k + b averaging ergodic averaging x k+1 = 1 k +1 X xi x why is this bad? used for weakly convex problems does this limit convergence rate?
9 ergodic averaging AVERAGING x k+1 = 1 k +1 X xi compute without storage x k+1 = k k +1 xk + 1 k +1 xk+1 x k+1 = short memory version k k + xk + k + xk+1 tradeoff variance for bias 1 averaging x
10 KNOWN RATES: WEAKLY CONVEX Theorem Suppose f is convex, krf(x)k appleg, and that the diameter of dom(f) is less than D. If we use the stepsize k = See Shamir and Zhang, ICML 13 Rakhlin, Shamir, Sridharan, CoRR 11 p c, k then D log(k) E[f( x k ) f? ] apple c + cg p k
11 KNOWN RATES: STRONGLY CONVEX Shamir and Zhang, ICML 13 E[f( x k ) x k+1 = f? ] apple 58(1 + /k) Theorem Suppose f is strongly convex with parameter m, and that krf(x)k appleg. If you use stepsize k =1/mk, and the limited memory averaging with 1, then k k + xk + k + xk+1 ( + 1) + ( +.5)3 (1 + log k) G 2 k mk
12 EXAMPLE: SVM PEGASOS: Primal Estimated sub-gradient SOlver for SVM minimize 1 2 kwk2 + Ch(Aw) rh(x) = ( 1, x < 1 0, otherwise note: this is a subgradient descent method
13 PEGOSOS PEGASOS: Primal Estimated sub-gradient SOlver for SVM minimize X i 2 kwk2 + h(a i w) While not converged : k = 1 k If a T k wk 1: w k+1 = w k k w k If a T k w k < 1: w k+1 = w k k ( w k a T i ) ŵ k+1 = 1 k +1 k+1 X i=1 w k not used in practice
14 PEGOSOS stochastic methods finite sample method gradient method
15 PEGOSOS classification error (CV) gradient method stochastic methods You hope to converge before SGD gets slow!
16 ALLREDUCE but if you don t use SGD as warm start for iterative method Agarwal. Effective Terascale Linear Learning. 12
17 GENERALIZATION: FBS NX minimize g(x)+ 1 N f i (x) i=1 ˆx k+1 = x k k rf k (x k ) x k+1 =prox g (ˆx k+1, ) diminishing stepsize+averaging: Duchi and Singer 09 (FOBOS) non-ergodic convergence: Rosasco, Villa, Vu 14 all known results make lots of assumptions, gradients must be bounded
18 MONTO CARLO METHODS Methods that involve randomly sampling a distribution Monte Carlo casino
19 BAYESIAN LEARNING estimate parameters in a model measure these data = M(parameters) estimate these ingredients prior probability distribution of unknown parameters likelihood probability of data given (unknown) parameters Thomas Bayes
20 BAYES RULE P (A B) = P (B A)P (A) P (B) likelihood P (parameters data) = P (data parameters)p (parameters) P (data) prior we really care about this P (parameters data) / P (data parameters)p (parameters) posterior distribution
21 EXAMPLE: LOGISTIC REGRESSION P (l i =1 d i )= exp(d iw) 1 + exp(d i w) P (l, D w) = Y i exp(l i d i w) 1 + exp(l i d i w) P (parameters data) / P (data parameters)p (parameters) prior
22 EXAMPLE: LOGISTIC REGRESSION P (parameters data) / P (data parameters)p (parameters)! exp(l i d i w) P (w l, D) = P (w) 1 + exp(l i d i w) Y i log P (w l, D) = log P (w)+ X i prior log(1 + exp( l i d i w)) Example: log P (w) = w
23 BAYESIANS OFFER NO SOLUTIONS Bayes Lagrange solution space P (w D, l) optimization solution Monte Carlo methods randomly sample this solution space
24 WHY MONTE CARLO? You need to know more than just a max/minimizer error bars optimization solution Z E(w) = wp(w) dw Z Var(w) =E[(w µ)(w µ) T ]= ww T P (w) dw µµ T
25 EXAMPLE: POKER BOTS You can t solve your problem by other (faster) methods maximize log P (w) we don t know derivative poker bots model parameters describe player behavior 2.5M different community cards 10 players = 59K raise/fold/call perms per round
26 WHY MONTE CARLO? You re incredibly lazy maximize log P (w) I could differentiate this, I just don t wanna arg max log P (w) max k {P (wk )}
27 MARKOV CHAINS What is an MC? MC has steady state if Irreducible: can visit any state starting at any state Aperiodic: does not get trapped in deterministic cycles
28 METROPOLIS HASTINGS proposal distribution q(y x) ingredients posterior distribution p(x) Start with x 0 For k =1, 2, 3,... MH Algorithm Choose candidate y from q(y x k ) Compute acceptance probability p(y)q(x k y) =min 1, p(x k )q(y x k ) Set x k+1 = y with probability otherwise x k+1 = x k
29 CONVERGENCE Irreducible: support of q must contain support of p Aperiodic: positive rejection probability guarantees no infinite cycle Theorem Suppose the support of q contains the support of p. Then the MH sampler has a stationary distribution, and the distribution is equal to p.
30 METROPOLIS ALGORITHM proposal distribution q(y x) ingredients posterior distribution p(x) Assume proposal is symmetric q(x k y) =q(y x k ) Metropolis Hastings Metropolis =min 1, p(y)q(x k y) p(x k )q(y x k ) =min 1, p(y) p(x k )
31 EXAMPLE: GMM histogram of Metropolis iterates Andrieu, Freitas, Doucet, Jordan 03
32 PROPERTIES OF MH Pro s We don t need a normalization constant for posterior We can run many chains in parallel (cluster/gpu) We don t need any derivatives P (parameters data) = P (data parameters)p (parameters) P (data) P (parameters data) / P (data parameters)p (parameters)
33 PROPERTIES OF MH Pro s We don t need a normalization constant for posterior We can run many chains in parallel (cluster/gpu) We don t need any derivatives Con s Mixing time depends on proposal distribution too wide = constant rejections = slow mixing too narrow = short movements=slow mixing Samples only meaningful at stationary distribution Burn in samples must be discarded Many samples needed because of correlations
34 SIMULATED ANNEALING maximize p(x) Easy choice: run MCMC, then max k p(x k ) Better choice: sample the distribution p 1 T k (x k ) temperature Why is this better? Where does the name come from?
35 COOLING SCHEDULE hot warm cool cold Andrieu, Freitas, Doucet, Jordan 03
36 CONVERGENCE SA solve non-convex problem, even NP-complete problems, as time goes to infinity. Theorem Suppose the MCMC mixes fast enough that epsilon-dense sampling occurs in finite time at every temperature. For an annealing schedule with temperature T k = 1 C log(k + T 0 ) simulated annealing converges to a global optima with probability 1. Granville, "Simulated annealing: A proof of convergence
37 WHEN TO USE SIMULATED ANNEALING flow chart Should I use simulated annealing? no. No practical way to choose temperature schedule Too fast = stuck in local minimum (risky) Too slow = no different from MCMC Act of desperation!
38 GIBBS SAMPLER Want to sample P (x 1,x 2,x 3 ) on stage k, pick some coordinate j q(y x k )= ( p(y j x k j c ), if y j c 0, otherwise = x k j c = p(y)q(xk y) p(x k )q(y x k ) = p(y)p(xk j xkjc) p(x k )p(y j y j c) = p(y j c ) p(x k j c ) =1 P (B) = P (A and B) P (A B)
39 GIBBS SAMPLER Want to sample P (x 1,x 2,x 3 ) iterates x 2 P (x 1 x 1 2,x 1 3,x 1 4,...,x 1 n) x 3 P (x 2 x 2 1,x 2 3,x 2 4,...,x 2 n) x 4 P (x 3 x 3 1,x 3 2,x 3 4,...,x 3 n)
40 APPLICATION: SAMPLING GRAPHICAL MODELS Restricted Boltzmann Machine (RBM) binary random variables 0/1 weights E(v, h) = a T v b T h v T Wh P (v, h) = 1 Z e E(v,h) partition function
41 APPLICATION: SAMPLING GRAPHICAL MODELS Restricted Boltzmann Machine (RBM) E(v, h) = a T v b T h v T Wh P (v, h) = 1 Z e E(v,h) P (v i =1 h) = = P (v i =1 h) P (v i =0 h)+p (v i =1 h) exp( a i + P j w ijh j + C) exp(c)+exp( a i + P j w ijh j + C) remove normalization aggregate constant = exp( a i + P j w ijh j ) 1 + exp( a i + P j w ijh j ) cancel constants = ( a i + X i w ij h j ) sigmoid function
42 BLOCK GIBBS FOR RBM visible stage 1 freeze hidden randomly sample visible hidden P (v i =1 h) = ( a i + X j w ij h j ) stage 2 freeze visible randomly sample hidden P (h j =1 v) = ( b j + X i w ij v i )
43 DEEP BELIEF NETS visible DBN = layered RBM Each layer only depends on layer beneath it (feed forward) Probability for each hidden node is sigmoid function hidden
44 EXAMPLE: MNIST Gan, Henao, Carlson, Carin 15 Train 3-layer DBN with 200 hidden units trained on 60K MNIST digits pre-training: layer-by-layer training training: train all weights simultaneously training done using Gibbs sampler Gibbs sampler used to explore final solution
45 EXAMPLE: MNIST training data Learned features observations sampled from deep belief network using Gibbs sampler
46 WHAT S WRONG WITH GIBBS? susceptible to strong correlations also a problem for MH: bad proposal distribution
47 SLICE SAMPLER Problems with MH: hard to choose proposal distribution does not exploit analytical information about problem long mixing time Slice sampler ( 1, if 0 apple u apple p(x) q(x, u) = 0, otherwise 1 1 p(x) u x x
48 LIFTED INTEGRAL Problems with MH: hard to choose proposal distribution does not exploit analytical information about problem long mixing time Z Slice sampler Z Z Z p(x) = q(x, u) du f(x)p(x) = f(x)q(x, u) du dx sample this 1 1 p(x) u x x
49 SLICE SAMPLER Z Z Z f(x)p(x) = f(x)q(x, u) du dx do Gibbs on this Freeze x : Choose u uniformly from [0,p(x)] Freeze u : Choose x uniformly from {x p(x) u} need analytical formula for this set p(x) u
50 MODEL FITTING ny p(x) = p i (x) i=1 support is a box q(x, u 1,u 2,...,u n )= log p(x) = nx i=1 ( 1, if p i (x) apple u i 8i 0, otherwise log p i (x) Z q(x, u 1,u 2,...,u n ) du = Z p1 (x) u 1 =0 Z p2 (x) u 2 =0 Z pn (x) u n =0 du = ny k=1 p i (x) Z Z Z f(x)p(x) dx = x u f(x)q(x, u) dx du
51 MODEL FITTING Z Z Z f(x)p(x) dx = x u f(x)q(x, u) dx du q(x, u 1,u 2,...,u n )= ( 1, if p i (x) apple u i 8i 0, otherwise Freeze x : For all i, choose u i uniformly from [0,p i (x)] Freeze u : Choose x uniformly from T {x p i (x) u i } i
52 SLICE SAMPLER PROPERTIES pros does not require proposal distribution mixes super fast simple implementation cons analytical formula for super-level sets hard in multiple dimensions fixes / generalizations step-out methods random hyper-rectangles
53 STEP-OUT METHODS pick small interval double width size until slice is contained sample uniformly from slice using rejection sampling - throw away selections outside of slice p(x) p(x) p(x) u u u
54 HIGH DIMS: COORDINATE Z Z Z METHOD f(x)p(x) = f(x)q(x, u) du dx do coordinate Gibbs on this Freeze x : Choose u uniformly from [0,p(x)] Freeze u : For i =1 n Choose x i uniformly from {x i p(x 1,,x i,x n ) u} can use stepping -out method Radford Neal: Slice Sampling 03
55 HIGH DIMS: HYPER-RECTANGLE choose large rectangle randomly/uniformly place rectangle around current location draw random proposal point from rectangle if proposal lies outside slice, shrink box so proposal is on boundary performs much better for poorly-conditioned objectives Radford Neal: Slice Sampling 03
56 COMPARISON Gradient/ Splitting rate e ck when to use you value reliability and precision (moderate speed, high accuracy) SGD 1/k you value speed over accuracy (high speed, moderate accuracy) MCMC 1/ p k you value simplicity (no gradient) or need statistical inference (slow and inaccurate)
57 DO MATLAB EXERCISE MCMC
Bayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationMCMC and Gibbs Sampling. Kayhan Batmanghelich
MCMC and Gibbs Sampling Kayhan Batmanghelich 1 Approaches to inference l Exact inference algorithms l l l The elimination algorithm Message-passing algorithm (sum-product, belief propagation) The junction
More informationThe Origin of Deep Learning. Lili Mou Jan, 2015
The Origin of Deep Learning Lili Mou Jan, 2015 Acknowledgment Most of the materials come from G. E. Hinton s online course. Outline Introduction Preliminary Boltzmann Machines and RBMs Deep Belief Nets
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos & Aarti Singh Contents Markov Chain Monte Carlo Methods Goal & Motivation Sampling Rejection Importance Markov
More informationComputer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo
Group Prof. Daniel Cremers 10a. Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative is Markov Chain
More informationComputer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo
Group Prof. Daniel Cremers 11. Sampling Methods: Markov Chain Monte Carlo Markov Chain Monte Carlo In high-dimensional spaces, rejection sampling and importance sampling are very inefficient An alternative
More informationApril 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning
for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions
More information17 : Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models, Spring 2015 17 : Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Heran Lin, Bin Deng, Yun Huang 1 Review of Monte Carlo Methods 1.1 Overview Monte Carlo
More informationMonte Carlo Methods. Geoff Gordon February 9, 2006
Monte Carlo Methods Geoff Gordon ggordon@cs.cmu.edu February 9, 2006 Numerical integration problem 5 4 3 f(x,y) 2 1 1 0 0.5 0 X 0.5 1 1 0.8 0.6 0.4 Y 0.2 0 0.2 0.4 0.6 0.8 1 x X f(x)dx Used for: function
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationMarkov Chains and MCMC
Markov Chains and MCMC Markov chains Let S = {1, 2,..., N} be a finite set consisting of N states. A Markov chain Y 0, Y 1, Y 2,... is a sequence of random variables, with Y t S for all points in time
More informationMarkov Chain Monte Carlo (MCMC)
School of Computer Science 10-708 Probabilistic Graphical Models Markov Chain Monte Carlo (MCMC) Readings: MacKay Ch. 29 Jordan Ch. 21 Matt Gormley Lecture 16 March 14, 2016 1 Homework 2 Housekeeping Due
More informationComputer Vision Group Prof. Daniel Cremers. 14. Sampling Methods
Prof. Daniel Cremers 14. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric
More informationMarkov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018
Graphical Models Markov Chain Monte Carlo Inference Siamak Ravanbakhsh Winter 2018 Learning objectives Markov chains the idea behind Markov Chain Monte Carlo (MCMC) two important examples: Gibbs sampling
More informationChapter 11. Stochastic Methods Rooted in Statistical Mechanics
Chapter 11. Stochastic Methods Rooted in Statistical Mechanics Neural Networks and Learning Machines (Haykin) Lecture Notes on Self-learning Neural Algorithms Byoung-Tak Zhang School of Computer Science
More informationIntroduction to Restricted Boltzmann Machines
Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,
More informationMonte Carlo Methods. Leon Gu CSD, CMU
Monte Carlo Methods Leon Gu CSD, CMU Approximate Inference EM: y-observed variables; x-hidden variables; θ-parameters; E-step: q(x) = p(x y, θ t 1 ) M-step: θ t = arg max E q(x) [log p(y, x θ)] θ Monte
More information6 Markov Chain Monte Carlo (MCMC)
6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationBayesian Inference and MCMC
Bayesian Inference and MCMC Aryan Arbabi Partly based on MCMC slides from CSC412 Fall 2018 1 / 18 Bayesian Inference - Motivation Consider we have a data set D = {x 1,..., x n }. E.g each x i can be the
More informationDensity estimation. Computing, and avoiding, partition functions. Iain Murray
Density estimation Computing, and avoiding, partition functions Roadmap: Motivation: density estimation Understanding annealing/tempering NADE Iain Murray School of Informatics, University of Edinburgh
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationDeep Learning Srihari. Deep Belief Nets. Sargur N. Srihari
Deep Belief Nets Sargur N. Srihari srihari@cedar.buffalo.edu Topics 1. Boltzmann machines 2. Restricted Boltzmann machines 3. Deep Belief Networks 4. Deep Boltzmann machines 5. Boltzmann machines for continuous
More informationProbabilistic Graphical Models Lecture 17: Markov chain Monte Carlo
Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 18, 2015 1 / 45 Resources and Attribution Image credits,
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationI. Bayesian econometrics
I. Bayesian econometrics A. Introduction B. Bayesian inference in the univariate regression model C. Statistical decision theory D. Large sample results E. Diffuse priors F. Numerical Bayesian methods
More informationWhy should you care about the solution strategies?
Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the
More informationMarkov chain Monte Carlo Lecture 9
Markov chain Monte Carlo Lecture 9 David Sontag New York University Slides adapted from Eric Xing and Qirong Ho (CMU) Limitations of Monte Carlo Direct (unconditional) sampling Hard to get rare events
More informationMCMC: Markov Chain Monte Carlo
I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov
More informationLecture notes on Regression: Markov Chain Monte Carlo (MCMC)
Lecture notes on Regression: Markov Chain Monte Carlo (MCMC) Dr. Veselina Kalinova, Max Planck Institute for Radioastronomy, Bonn Machine Learning course: the elegant way to extract information from data,
More informationMachine Learning for Data Science (CS4786) Lecture 24
Machine Learning for Data Science (CS4786) Lecture 24 Graphical Models: Approximate Inference Course Webpage : http://www.cs.cornell.edu/courses/cs4786/2016sp/ BELIEF PROPAGATION OR MESSAGE PASSING Each
More informationA quick introduction to Markov chains and Markov chain Monte Carlo (revised version)
A quick introduction to Markov chains and Markov chain Monte Carlo (revised version) Rasmus Waagepetersen Institute of Mathematical Sciences Aalborg University 1 Introduction These notes are intended to
More informationDeep unsupervised learning
Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.
More informationSimulated Annealing for Constrained Global Optimization
Monte Carlo Methods for Computation and Optimization Final Presentation Simulated Annealing for Constrained Global Optimization H. Edwin Romeijn & Robert L.Smith (1994) Presented by Ariel Schwartz Objective
More informationProbabilistic Graphical Models
10-708 Probabilistic Graphical Models Homework 3 (v1.1.0) Due Apr 14, 7:00 PM Rules: 1. Homework is due on the due date at 7:00 PM. The homework should be submitted via Gradescope. Solution to each problem
More informationThe Ising model and Markov chain Monte Carlo
The Ising model and Markov chain Monte Carlo Ramesh Sridharan These notes give a short description of the Ising model for images and an introduction to Metropolis-Hastings and Gibbs Markov Chain Monte
More informationMarkov Chain Monte Carlo (MCMC)
Markov Chain Monte Carlo (MCMC Dependent Sampling Suppose we wish to sample from a density π, and we can evaluate π as a function but have no means to directly generate a sample. Rejection sampling can
More informationLearning the hyper-parameters. Luca Martino
Learning the hyper-parameters Luca Martino 2017 2017 1 / 28 Parameters and hyper-parameters 1. All the described methods depend on some choice of hyper-parameters... 2. For instance, do you recall λ (bandwidth
More informationConvex Optimization CMU-10725
Convex Optimization CMU-10725 Simulated Annealing Barnabás Póczos & Ryan Tibshirani Andrey Markov Markov Chains 2 Markov Chains Markov chain: Homogen Markov chain: 3 Markov Chains Assume that the state
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationStochastic optimization Markov Chain Monte Carlo
Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya Weizmann Institute of Science 1 Motivation Markov chains Stationary distribution Mixing time 2 Algorithms Metropolis-Hastings Simulated Annealing
More informationIntroduction to Machine Learning CMU-10701
Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabás Póczos Contents Markov Chain Monte Carlo Methods Sampling Rejection Importance Hastings-Metropolis Gibbs Markov Chains
More informationComputational statistics
Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated
More informationCS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling
CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling Professor Erik Sudderth Brown University Computer Science October 27, 2016 Some figures and materials courtesy
More informationResults: MCMC Dancers, q=10, n=500
Motivation Sampling Methods for Bayesian Inference How to track many INTERACTING targets? A Tutorial Frank Dellaert Results: MCMC Dancers, q=10, n=500 1 Probabilistic Topological Maps Results Real-Time
More informationMarkov Chains and MCMC
Markov Chains and MCMC CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 4 : 590.02 Spring 13 1 Recap: Monte Carlo Method If U is a universe of items, and G is a subset satisfying some property,
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationMCMC Methods: Gibbs and Metropolis
MCMC Methods: Gibbs and Metropolis Patrick Breheny February 28 Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/30 Introduction As we have seen, the ability to sample from the posterior distribution
More informationLecture 6: Markov Chain Monte Carlo
Lecture 6: Markov Chain Monte Carlo D. Jason Koskinen koskinen@nbi.ku.dk Photo by Howard Jackman University of Copenhagen Advanced Methods in Applied Statistics Feb - Apr 2016 Niels Bohr Institute 2 Outline
More informationQuantifying Uncertainty
Sai Ravela M. I. T Last Updated: Spring 2013 1 Markov Chain Monte Carlo Monte Carlo sampling made for large scale problems via Markov Chains Monte Carlo Sampling Rejection Sampling Importance Sampling
More informationCSC 2541: Bayesian Methods for Machine Learning
CSC 2541: Bayesian Methods for Machine Learning Radford M. Neal, University of Toronto, 2011 Lecture 3 More Markov Chain Monte Carlo Methods The Metropolis algorithm isn t the only way to do MCMC. We ll
More informationOn Markov chain Monte Carlo methods for tall data
On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationRestricted Boltzmann Machines
Restricted Boltzmann Machines Boltzmann Machine(BM) A Boltzmann machine extends a stochastic Hopfield network to include hidden units. It has binary (0 or 1) visible vector unit x and hidden (latent) vector
More informationUNSUPERVISED LEARNING
UNSUPERVISED LEARNING Topics Layer-wise (unsupervised) pre-training Restricted Boltzmann Machines Auto-encoders LAYER-WISE (UNSUPERVISED) PRE-TRAINING Breakthrough in 2006 Layer-wise (unsupervised) pre-training
More informationSpeaker Representation and Verification Part II. by Vasileios Vasilakakis
Speaker Representation and Verification Part II by Vasileios Vasilakakis Outline -Approaches of Neural Networks in Speaker/Speech Recognition -Feed-Forward Neural Networks -Training with Back-propagation
More informationCS 188: Artificial Intelligence. Bayes Nets
CS 188: Artificial Intelligence Probabilistic Inference: Enumeration, Variable Elimination, Sampling Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew
More informationfor Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim
Simulated Stochastic Approximation Annealing for Global Optimization with a Square-Root Cooling Schedule Abstract Simulated annealing has been widely used in the solution of optimization problems. As known
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationMixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines
Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines Christopher Tosh Department of Computer Science and Engineering University of California, San Diego ctosh@cs.ucsd.edu Abstract The
More informationComputer Vision Group Prof. Daniel Cremers. 11. Sampling Methods
Prof. Daniel Cremers 11. Sampling Methods Sampling Methods Sampling Methods are widely used in Computer Science as an approximation of a deterministic algorithm to represent uncertainty without a parametric
More informationSimulation - Lectures - Part III Markov chain Monte Carlo
Simulation - Lectures - Part III Markov chain Monte Carlo Julien Berestycki Part A Simulation and Statistical Programming Hilary Term 2018 Part A Simulation. HT 2018. J. Berestycki. 1 / 50 Outline Markov
More informationMarkov Chain Monte Carlo methods
Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As
More informationAlgorithms other than SGD. CS6787 Lecture 10 Fall 2017
Algorithms other than SGD CS6787 Lecture 10 Fall 2017 Machine learning is not just SGD Once a model is trained, we need to use it to classify new examples This inference task is not computed with SGD There
More informationMarkov Chain Monte Carlo Methods
Markov Chain Monte Carlo Methods p. /36 Markov Chain Monte Carlo Methods Michel Bierlaire michel.bierlaire@epfl.ch Transport and Mobility Laboratory Markov Chain Monte Carlo Methods p. 2/36 Markov Chains
More informationLecture 7 and 8: Markov Chain Monte Carlo
Lecture 7 and 8: Markov Chain Monte Carlo 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering University of Cambridge http://mlg.eng.cam.ac.uk/teaching/4f13/ Ghahramani
More informationMarkov chain Monte Carlo methods in atmospheric remote sensing
1 / 45 Markov chain Monte Carlo methods in atmospheric remote sensing Johanna Tamminen johanna.tamminen@fmi.fi ESA Summer School on Earth System Monitoring and Modeling July 3 Aug 11, 212, Frascati July,
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationA Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait
A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling Christopher Jennison Department of Mathematical Sciences, University of Bath, UK http://people.bath.ac.uk/mascj Adriana Ibrahim Institute
More informationHamiltonian Monte Carlo for Scalable Deep Learning
Hamiltonian Monte Carlo for Scalable Deep Learning Isaac Robson Department of Statistics and Operations Research, University of North Carolina at Chapel Hill isrobson@email.unc.edu BIOS 740 May 4, 2018
More informationMarkov chain Monte Carlo
Markov chain Monte Carlo Peter Beerli October 10, 2005 [this chapter is highly influenced by chapter 1 in Markov chain Monte Carlo in Practice, eds Gilks W. R. et al. Chapman and Hall/CRC, 1996] 1 Short
More informationPattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods
Pattern Recognition and Machine Learning Chapter 11: Sampling Methods Elise Arnaud Jakob Verbeek May 22, 2008 Outline of the chapter 11.1 Basic Sampling Algorithms 11.2 Markov Chain Monte Carlo 11.3 Gibbs
More informationLearning Energy-Based Models of High-Dimensional Data
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero www.cs.toronto.edu/~hinton/energybasedmodelsweb.htm Discovering causal structure as a goal
More informationMetropolis-Hastings Algorithm
Strength of the Gibbs sampler Metropolis-Hastings Algorithm Easy algorithm to think about. Exploits the factorization properties of the joint probability distribution. No difficult choices to be made to
More informationNeural Networks for Machine Learning. Lecture 11a Hopfield Nets
Neural Networks for Machine Learning Lecture 11a Hopfield Nets Geoffrey Hinton Nitish Srivastava, Kevin Swersky Tijmen Tieleman Abdel-rahman Mohamed Hopfield Nets A Hopfield net is composed of binary threshold
More informationMarkov chain Monte Carlo
Markov chain Monte Carlo Probabilistic Models of Cognition, 2011 http://www.ipam.ucla.edu/programs/gss2011/ Roadmap: Motivation Monte Carlo basics What is MCMC? Metropolis Hastings and Gibbs...more tomorrow.
More information16 : Markov Chain Monte Carlo (MCMC)
10-708: Probabilistic Graphical Models 10-708, Spring 2014 16 : Markov Chain Monte Carlo MCMC Lecturer: Matthew Gormley Scribes: Yining Wang, Renato Negrinho 1 Sampling from low-dimensional distributions
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationLogistic Regression. Stochastic Gradient Descent
Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}
More informationMSc MT15. Further Statistical Methods: MCMC. Lecture 5-6: Markov chains; Metropolis Hastings MCMC. Notes and Practicals available at
MSc MT15. Further Statistical Methods: MCMC Lecture 5-6: Markov chains; Metropolis Hastings MCMC Notes and Practicals available at www.stats.ox.ac.uk\ nicholls\mscmcmc15 Markov chain Monte Carlo Methods
More informationConnections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN
Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables Revised submission to IEEE TNN Aapo Hyvärinen Dept of Computer Science and HIIT University
More information16 : Approximate Inference: Markov Chain Monte Carlo
10-708: Probabilistic Graphical Models 10-708, Spring 2017 16 : Approximate Inference: Markov Chain Monte Carlo Lecturer: Eric P. Xing Scribes: Yuan Yang, Chao-Ming Yen 1 Introduction As the target distribution
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationKnowledge Extraction from DBNs for Images
Knowledge Extraction from DBNs for Images Son N. Tran and Artur d Avila Garcez Department of Computer Science City University London Contents 1 Introduction 2 Knowledge Extraction from DBNs 3 Experimental
More informationClassification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012
Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationIntroduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang
Introduction to MCMC DB Breakfast 09/30/2011 Guozhang Wang Motivation: Statistical Inference Joint Distribution Sleeps Well Playground Sunny Bike Ride Pleasant dinner Productive day Posterior Estimation
More informationAnswers and expectations
Answers and expectations For a function f(x) and distribution P(x), the expectation of f with respect to P is The expectation is the average of f, when x is drawn from the probability distribution P E
More informationMarkov chain Monte Carlo
1 / 26 Markov chain Monte Carlo Timothy Hanson 1 and Alejandro Jara 2 1 Division of Biostatistics, University of Minnesota, USA 2 Department of Statistics, Universidad de Concepción, Chile IAP-Workshop
More informationBagging During Markov Chain Monte Carlo for Smoother Predictions
Bagging During Markov Chain Monte Carlo for Smoother Predictions Herbert K. H. Lee University of California, Santa Cruz Abstract: Making good predictions from noisy data is a challenging problem. Methods
More informationApproximate Inference using MCMC
Approximate Inference using MCMC 9.520 Class 22 Ruslan Salakhutdinov BCS and CSAIL, MIT 1 Plan 1. Introduction/Notation. 2. Examples of successful Bayesian models. 3. Basic Sampling Algorithms. 4. Markov
More informationProbabilistic Machine Learning
Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationBrief introduction to Markov Chain Monte Carlo
Brief introduction to Department of Probability and Mathematical Statistics seminar Stochastic modeling in economics and finance November 7, 2011 Brief introduction to Content 1 and motivation Classical
More informationBayesian Phylogenetics:
Bayesian Phylogenetics: an introduction Marc A. Suchard msuchard@ucla.edu UCLA Who is this man? How sure are you? The one true tree? Methods we ve learned so far try to find a single tree that best describes
More informationAnnouncements. Proposals graded
Announcements Proposals graded Kevin Jamieson 2018 1 Bayesian Methods Machine Learning CSE546 Kevin Jamieson University of Washington November 1, 2018 2018 Kevin Jamieson 2 MLE Recap - coin flips Data:
More information