RANDOM TOPICS. stochastic gradient descent & Monte Carlo

Similar documents
Bayesian Methods for Machine Learning

MCMC and Gibbs Sampling. Kayhan Batmanghelich

The Origin of Deep Learning. Lili Mou Jan, 2015

STA 4273H: Statistical Machine Learning

CPSC 540: Machine Learning

Introduction to Machine Learning CMU-10701

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods: Markov Chain Monte Carlo

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

17 : Markov Chain Monte Carlo

Monte Carlo Methods. Geoff Gordon February 9, 2006

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Markov Chains and MCMC

Markov Chain Monte Carlo (MCMC)

Computer Vision Group Prof. Daniel Cremers. 14. Sampling Methods

Markov Chain Monte Carlo Inference. Siamak Ravanbakhsh Winter 2018

Chapter 11. Stochastic Methods Rooted in Statistical Mechanics

Introduction to Restricted Boltzmann Machines

Monte Carlo Methods. Leon Gu CSD, CMU

6 Markov Chain Monte Carlo (MCMC)

Warm up: risk prediction with logistic regression

Bayesian Inference and MCMC

Density estimation. Computing, and avoiding, partition functions. Iain Murray

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Probabilistic Graphical Models Lecture 17: Markov chain Monte Carlo

Introduction to Machine Learning

I. Bayesian econometrics

Why should you care about the solution strategies?

Markov chain Monte Carlo Lecture 9

MCMC: Markov Chain Monte Carlo

Lecture notes on Regression: Markov Chain Monte Carlo (MCMC)

Machine Learning for Data Science (CS4786) Lecture 24

A quick introduction to Markov chains and Markov chain Monte Carlo (revised version)

Deep unsupervised learning

Simulated Annealing for Constrained Global Optimization

Probabilistic Graphical Models

The Ising model and Markov chain Monte Carlo

Markov Chain Monte Carlo (MCMC)

Learning the hyper-parameters. Luca Martino

Convex Optimization CMU-10725

Pattern Recognition and Machine Learning

Stochastic optimization Markov Chain Monte Carlo

Introduction to Machine Learning CMU-10701

Computational statistics

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Results: MCMC Dancers, q=10, n=500

Markov Chains and MCMC

STA 4273H: Statistical Machine Learning

MCMC Methods: Gibbs and Metropolis

Lecture 6: Markov Chain Monte Carlo

Quantifying Uncertainty

CSC 2541: Bayesian Methods for Machine Learning

On Markov chain Monte Carlo methods for tall data

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Restricted Boltzmann Machines

UNSUPERVISED LEARNING

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

CS 188: Artificial Intelligence. Bayes Nets

for Global Optimization with a Square-Root Cooling Schedule Faming Liang Simulated Stochastic Approximation Annealing for Global Optim

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

Mixing Rates for the Gibbs Sampler over Restricted Boltzmann Machines

Computer Vision Group Prof. Daniel Cremers. 11. Sampling Methods

Simulation - Lectures - Part III Markov chain Monte Carlo

Markov Chain Monte Carlo methods

Algorithms other than SGD. CS6787 Lecture 10 Fall 2017

Markov Chain Monte Carlo Methods

Lecture 7 and 8: Markov Chain Monte Carlo

Markov chain Monte Carlo methods in atmospheric remote sensing

CS60021: Scalable Data Mining. Large Scale Machine Learning

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

Hamiltonian Monte Carlo for Scalable Deep Learning

Markov chain Monte Carlo

Pattern Recognition and Machine Learning. Bishop Chapter 11: Sampling Methods

Learning Energy-Based Models of High-Dimensional Data

Metropolis-Hastings Algorithm

Neural Networks for Machine Learning. Lecture 11a Hopfield Nets

Markov chain Monte Carlo

16 : Markov Chain Monte Carlo (MCMC)

Nonparametric Bayesian Methods (Gaussian Processes)

STA 4273H: Sta-s-cal Machine Learning

Logistic Regression. Stochastic Gradient Descent

MSc MT15. Further Statistical Methods: MCMC. Lecture 5-6: Markov chains; Metropolis Hastings MCMC. Notes and Practicals available at

Connections between score matching, contrastive divergence, and pseudolikelihood for continuous-valued variables. Revised submission to IEEE TNN

16 : Approximate Inference: Markov Chain Monte Carlo

Ad Placement Strategies

Knowledge Extraction from DBNs for Images

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Introduction to MCMC. DB Breakfast 09/30/2011 Guozhang Wang

Answers and expectations

Markov chain Monte Carlo

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Approximate Inference using MCMC

Probabilistic Machine Learning

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Overfitting, Bias / Variance Analysis

Brief introduction to Markov Chain Monte Carlo

Bayesian Phylogenetics:

Announcements. Proposals graded

Transcription:

RANDOM TOPICS stochastic gradient descent & Monte Carlo

MASSIVE MODEL FITTING nx minimize f(x) = 1 n i=1 f i (x) Big! (over 100K) minimize 1 least squares 2 kax bk2 = X i 1 2 (a ix b i ) 2 minimize 1 SVM 2 kwk2 + h(ldw) = 1 2 kwk2 + X i h(l i d i w) minimize low-rank factorization 1 X 2 kd XY 1 k2 = 2 (d ij x i y j ) 2 ij

THE BIG IDEA minimize f(x) = 1 n nx i=1 f i (x) True gradient rf = 1 nx rf i (x) n i=1 Idea: choose a subset of the data [1,N] rf g = 1 X rf i (x) i2 usually just one sample

INFINITE SAMPLE VS FINITE SAMPLE finite sample minimize f(x) = 1 n nx f i (x) i=1 We can solve finite sample problem to high accuracy. infinite sample but true accuracy is limited by sample size Z minimize f(x) =E s [f s (x)] = s f s (x)p(s)ds

STOCHASTIC ORACLE approximate gradient rf g = 1 X rf i (x) i2 big error solution improves necessary properties g = rf(x)+ why? E( ) =0 small error solutions gets worse Error must decrease as we approach solution

EXAMPLE minimize 1 2 kax bk2 what's happening? inexact gradient why does this happen?

DECREASING STEP SIZE x k+1 = x k k rf k (x) big stepsize k = a b + k why? (almost) equivalent to choose larger sample small stepsize used for strongly convex problems

AVERAGING x k+1 = x k k rf k (x) k = a p k + b averaging ergodic averaging x k+1 = 1 k +1 X xi x why is this bad? used for weakly convex problems does this limit convergence rate?

ergodic averaging AVERAGING x k+1 = 1 k +1 X xi compute without storage x k+1 = k k +1 xk + 1 k +1 xk+1 x k+1 = short memory version k k + xk + k + xk+1 tradeoff variance for bias 1 averaging x

KNOWN RATES: WEAKLY CONVEX Theorem Suppose f is convex, krf(x)k appleg, and that the diameter of dom(f) is less than D. If we use the stepsize k = See Shamir and Zhang, ICML 13 Rakhlin, Shamir, Sridharan, CoRR 11 p c, k then D 2 2 + log(k) E[f( x k ) f? ] apple c + cg p k

KNOWN RATES: STRONGLY CONVEX Shamir and Zhang, ICML 13 E[f( x k ) x k+1 = f? ] apple 58(1 + /k) Theorem Suppose f is strongly convex with parameter m, and that krf(x)k appleg. If you use stepsize k =1/mk, and the limited memory averaging with 1, then k k + xk + k + xk+1 ( + 1) + ( +.5)3 (1 + log k) G 2 k mk

EXAMPLE: SVM PEGASOS: Primal Estimated sub-gradient SOlver for SVM minimize 1 2 kwk2 + Ch(Aw) rh(x) = ( 1, x < 1 0, otherwise note: this is a subgradient descent method

PEGOSOS PEGASOS: Primal Estimated sub-gradient SOlver for SVM minimize X i 2 kwk2 + h(a i w) While not converged : k = 1 k If a T k wk 1: w k+1 = w k k w k If a T k w k < 1: w k+1 = w k k ( w k a T i ) ŵ k+1 = 1 k +1 k+1 X i=1 w k not used in practice

PEGOSOS stochastic methods finite sample method gradient method

PEGOSOS classification error (CV) gradient method stochastic methods You hope to converge before SGD gets slow!

ALLREDUCE but if you don t use SGD as warm start for iterative method Agarwal. Effective Terascale Linear Learning. 12

GENERALIZATION: FBS NX minimize g(x)+ 1 N f i (x) i=1 ˆx k+1 = x k k rf k (x k ) x k+1 =prox g (ˆx k+1, ) diminishing stepsize+averaging: Duchi and Singer 09 (FOBOS) non-ergodic convergence: Rosasco, Villa, Vu 14 all known results make lots of assumptions, gradients must be bounded

MONTO CARLO METHODS Methods that involve randomly sampling a distribution Monte Carlo casino

BAYESIAN LEARNING estimate parameters in a model measure these data = M(parameters) estimate these ingredients prior probability distribution of unknown parameters likelihood probability of data given (unknown) parameters Thomas Bayes

BAYES RULE P (A B) = P (B A)P (A) P (B) likelihood P (parameters data) = P (data parameters)p (parameters) P (data) prior we really care about this P (parameters data) / P (data parameters)p (parameters) posterior distribution

EXAMPLE: LOGISTIC REGRESSION P (l i =1 d i )= exp(d iw) 1 + exp(d i w) P (l, D w) = Y i exp(l i d i w) 1 + exp(l i d i w) P (parameters data) / P (data parameters)p (parameters) prior

EXAMPLE: LOGISTIC REGRESSION P (parameters data) / P (data parameters)p (parameters)! exp(l i d i w) P (w l, D) = P (w) 1 + exp(l i d i w) Y i log P (w l, D) = log P (w)+ X i prior log(1 + exp( l i d i w)) Example: log P (w) = w

BAYESIANS OFFER NO SOLUTIONS Bayes Lagrange solution space P (w D, l) optimization solution Monte Carlo methods randomly sample this solution space

WHY MONTE CARLO? You need to know more than just a max/minimizer error bars optimization solution Z E(w) = wp(w) dw Z Var(w) =E[(w µ)(w µ) T ]= ww T P (w) dw µµ T

EXAMPLE: POKER BOTS You can t solve your problem by other (faster) methods maximize log P (w) we don t know derivative poker bots model parameters describe player behavior 2.5M different community cards 10 players = 59K raise/fold/call perms per round

WHY MONTE CARLO? You re incredibly lazy maximize log P (w) I could differentiate this, I just don t wanna arg max log P (w) max k {P (wk )}

MARKOV CHAINS What is an MC? MC has steady state if Irreducible: can visit any state starting at any state Aperiodic: does not get trapped in deterministic cycles

METROPOLIS HASTINGS proposal distribution q(y x) ingredients posterior distribution p(x) Start with x 0 For k =1, 2, 3,... MH Algorithm Choose candidate y from q(y x k ) Compute acceptance probability p(y)q(x k y) =min 1, p(x k )q(y x k ) Set x k+1 = y with probability otherwise x k+1 = x k

CONVERGENCE Irreducible: support of q must contain support of p Aperiodic: positive rejection probability guarantees no infinite cycle Theorem Suppose the support of q contains the support of p. Then the MH sampler has a stationary distribution, and the distribution is equal to p.

METROPOLIS ALGORITHM proposal distribution q(y x) ingredients posterior distribution p(x) Assume proposal is symmetric q(x k y) =q(y x k ) Metropolis Hastings Metropolis =min 1, p(y)q(x k y) p(x k )q(y x k ) =min 1, p(y) p(x k )

EXAMPLE: GMM histogram of Metropolis iterates Andrieu, Freitas, Doucet, Jordan 03

PROPERTIES OF MH Pro s We don t need a normalization constant for posterior We can run many chains in parallel (cluster/gpu) We don t need any derivatives P (parameters data) = P (data parameters)p (parameters) P (data) P (parameters data) / P (data parameters)p (parameters)

PROPERTIES OF MH Pro s We don t need a normalization constant for posterior We can run many chains in parallel (cluster/gpu) We don t need any derivatives Con s Mixing time depends on proposal distribution too wide = constant rejections = slow mixing too narrow = short movements=slow mixing Samples only meaningful at stationary distribution Burn in samples must be discarded Many samples needed because of correlations

SIMULATED ANNEALING maximize p(x) Easy choice: run MCMC, then max k p(x k ) Better choice: sample the distribution p 1 T k (x k ) temperature Why is this better? Where does the name come from?

COOLING SCHEDULE hot warm cool cold Andrieu, Freitas, Doucet, Jordan 03

CONVERGENCE SA solve non-convex problem, even NP-complete problems, as time goes to infinity. Theorem Suppose the MCMC mixes fast enough that epsilon-dense sampling occurs in finite time at every temperature. For an annealing schedule with temperature T k = 1 C log(k + T 0 ) simulated annealing converges to a global optima with probability 1. Granville, "Simulated annealing: A proof of convergence

WHEN TO USE SIMULATED ANNEALING flow chart Should I use simulated annealing? no. No practical way to choose temperature schedule Too fast = stuck in local minimum (risky) Too slow = no different from MCMC Act of desperation!

GIBBS SAMPLER Want to sample P (x 1,x 2,x 3 ) on stage k, pick some coordinate j q(y x k )= ( p(y j x k j c ), if y j c 0, otherwise = x k j c = p(y)q(xk y) p(x k )q(y x k ) = p(y)p(xk j xkjc) p(x k )p(y j y j c) = p(y j c ) p(x k j c ) =1 P (B) = P (A and B) P (A B)

GIBBS SAMPLER Want to sample P (x 1,x 2,x 3 ) iterates x 2 P (x 1 x 1 2,x 1 3,x 1 4,...,x 1 n) x 3 P (x 2 x 2 1,x 2 3,x 2 4,...,x 2 n) x 4 P (x 3 x 3 1,x 3 2,x 3 4,...,x 3 n)

APPLICATION: SAMPLING GRAPHICAL MODELS Restricted Boltzmann Machine (RBM) binary random variables 0/1 weights E(v, h) = a T v b T h v T Wh P (v, h) = 1 Z e E(v,h) partition function

APPLICATION: SAMPLING GRAPHICAL MODELS Restricted Boltzmann Machine (RBM) E(v, h) = a T v b T h v T Wh P (v, h) = 1 Z e E(v,h) P (v i =1 h) = = P (v i =1 h) P (v i =0 h)+p (v i =1 h) exp( a i + P j w ijh j + C) exp(c)+exp( a i + P j w ijh j + C) remove normalization aggregate constant = exp( a i + P j w ijh j ) 1 + exp( a i + P j w ijh j ) cancel constants = ( a i + X i w ij h j ) sigmoid function

BLOCK GIBBS FOR RBM visible stage 1 freeze hidden randomly sample visible hidden P (v i =1 h) = ( a i + X j w ij h j ) stage 2 freeze visible randomly sample hidden P (h j =1 v) = ( b j + X i w ij v i )

DEEP BELIEF NETS visible DBN = layered RBM Each layer only depends on layer beneath it (feed forward) Probability for each hidden node is sigmoid function hidden

EXAMPLE: MNIST Gan, Henao, Carlson, Carin 15 Train 3-layer DBN with 200 hidden units trained on 60K MNIST digits pre-training: layer-by-layer training training: train all weights simultaneously training done using Gibbs sampler Gibbs sampler used to explore final solution

EXAMPLE: MNIST training data Learned features observations sampled from deep belief network using Gibbs sampler

WHAT S WRONG WITH GIBBS? susceptible to strong correlations also a problem for MH: bad proposal distribution

SLICE SAMPLER Problems with MH: hard to choose proposal distribution does not exploit analytical information about problem long mixing time Slice sampler ( 1, if 0 apple u apple p(x) q(x, u) = 0, otherwise 1 1 p(x) u x x

LIFTED INTEGRAL Problems with MH: hard to choose proposal distribution does not exploit analytical information about problem long mixing time Z Slice sampler Z Z Z p(x) = q(x, u) du f(x)p(x) = f(x)q(x, u) du dx sample this 1 1 p(x) u x x

SLICE SAMPLER Z Z Z f(x)p(x) = f(x)q(x, u) du dx do Gibbs on this Freeze x : Choose u uniformly from [0,p(x)] Freeze u : Choose x uniformly from {x p(x) u} need analytical formula for this set p(x) u

MODEL FITTING ny p(x) = p i (x) i=1 support is a box q(x, u 1,u 2,...,u n )= log p(x) = nx i=1 ( 1, if p i (x) apple u i 8i 0, otherwise log p i (x) Z q(x, u 1,u 2,...,u n ) du = Z p1 (x) u 1 =0 Z p2 (x) u 2 =0 Z pn (x) u n =0 du = ny k=1 p i (x) Z Z Z f(x)p(x) dx = x u f(x)q(x, u) dx du

MODEL FITTING Z Z Z f(x)p(x) dx = x u f(x)q(x, u) dx du q(x, u 1,u 2,...,u n )= ( 1, if p i (x) apple u i 8i 0, otherwise Freeze x : For all i, choose u i uniformly from [0,p i (x)] Freeze u : Choose x uniformly from T {x p i (x) u i } i

SLICE SAMPLER PROPERTIES pros does not require proposal distribution mixes super fast simple implementation cons analytical formula for super-level sets hard in multiple dimensions fixes / generalizations step-out methods random hyper-rectangles

STEP-OUT METHODS pick small interval double width size until slice is contained sample uniformly from slice using rejection sampling - throw away selections outside of slice p(x) p(x) p(x) u u u

HIGH DIMS: COORDINATE Z Z Z METHOD f(x)p(x) = f(x)q(x, u) du dx do coordinate Gibbs on this Freeze x : Choose u uniformly from [0,p(x)] Freeze u : For i =1 n Choose x i uniformly from {x i p(x 1,,x i,x n ) u} can use stepping -out method Radford Neal: Slice Sampling 03

HIGH DIMS: HYPER-RECTANGLE choose large rectangle randomly/uniformly place rectangle around current location draw random proposal point from rectangle if proposal lies outside slice, shrink box so proposal is on boundary performs much better for poorly-conditioned objectives Radford Neal: Slice Sampling 03

COMPARISON Gradient/ Splitting rate e ck when to use you value reliability and precision (moderate speed, high accuracy) SGD 1/k you value speed over accuracy (high speed, moderate accuracy) MCMC 1/ p k you value simplicity (no gradient) or need statistical inference (slow and inaccurate)

DO MATLAB EXERCISE MCMC