RANDOM TOPICS stochastic gradient descent & Monte Carlo
MASSIVE MODEL FITTING nx minimize f(x) = 1 n i=1 f i (x) Big! (over 100K) minimize 1 least squares 2 kax bk2 = X i 1 2 (a ix b i ) 2 minimize 1 SVM 2 kwk2 + h(ldw) = 1 2 kwk2 + X i h(l i d i w) minimize low-rank factorization 1 X 2 kd XY 1 k2 = 2 (d ij x i y j ) 2 ij
THE BIG IDEA minimize f(x) = 1 n nx i=1 f i (x) True gradient rf = 1 nx rf i (x) n i=1 Idea: choose a subset of the data [1,N] rf g = 1 X rf i (x) i2 usually just one sample
INFINITE SAMPLE VS FINITE SAMPLE finite sample minimize f(x) = 1 n nx f i (x) i=1 We can solve finite sample problem to high accuracy. infinite sample but true accuracy is limited by sample size Z minimize f(x) =E s [f s (x)] = s f s (x)p(s)ds
STOCHASTIC ORACLE approximate gradient rf g = 1 X rf i (x) i2 big error solution improves necessary properties g = rf(x)+ why? E( ) =0 small error solutions gets worse Error must decrease as we approach solution
EXAMPLE minimize 1 2 kax bk2 what's happening? inexact gradient why does this happen?
DECREASING STEP SIZE x k+1 = x k k rf k (x) big stepsize k = a b + k why? (almost) equivalent to choose larger sample small stepsize used for strongly convex problems
AVERAGING x k+1 = x k k rf k (x) k = a p k + b averaging ergodic averaging x k+1 = 1 k +1 X xi x why is this bad? used for weakly convex problems does this limit convergence rate?
ergodic averaging AVERAGING x k+1 = 1 k +1 X xi compute without storage x k+1 = k k +1 xk + 1 k +1 xk+1 x k+1 = short memory version k k + xk + k + xk+1 tradeoff variance for bias 1 averaging x
KNOWN RATES: WEAKLY CONVEX Theorem Suppose f is convex, krf(x)k appleg, and that the diameter of dom(f) is less than D. If we use the stepsize k = See Shamir and Zhang, ICML 13 Rakhlin, Shamir, Sridharan, CoRR 11 p c, k then D 2 2 + log(k) E[f( x k ) f? ] apple c + cg p k
KNOWN RATES: STRONGLY CONVEX Shamir and Zhang, ICML 13 E[f( x k ) x k+1 = f? ] apple 58(1 + /k) Theorem Suppose f is strongly convex with parameter m, and that krf(x)k appleg. If you use stepsize k =1/mk, and the limited memory averaging with 1, then k k + xk + k + xk+1 ( + 1) + ( +.5)3 (1 + log k) G 2 k mk
EXAMPLE: SVM PEGASOS: Primal Estimated sub-gradient SOlver for SVM minimize 1 2 kwk2 + Ch(Aw) rh(x) = ( 1, x < 1 0, otherwise note: this is a subgradient descent method
PEGOSOS PEGASOS: Primal Estimated sub-gradient SOlver for SVM minimize X i 2 kwk2 + h(a i w) While not converged : k = 1 k If a T k wk 1: w k+1 = w k k w k If a T k w k < 1: w k+1 = w k k ( w k a T i ) ŵ k+1 = 1 k +1 k+1 X i=1 w k not used in practice
PEGOSOS stochastic methods finite sample method gradient method
PEGOSOS classification error (CV) gradient method stochastic methods You hope to converge before SGD gets slow!
ALLREDUCE but if you don t use SGD as warm start for iterative method Agarwal. Effective Terascale Linear Learning. 12
GENERALIZATION: FBS NX minimize g(x)+ 1 N f i (x) i=1 ˆx k+1 = x k k rf k (x k ) x k+1 =prox g (ˆx k+1, ) diminishing stepsize+averaging: Duchi and Singer 09 (FOBOS) non-ergodic convergence: Rosasco, Villa, Vu 14 all known results make lots of assumptions, gradients must be bounded
MONTO CARLO METHODS Methods that involve randomly sampling a distribution Monte Carlo casino
BAYESIAN LEARNING estimate parameters in a model measure these data = M(parameters) estimate these ingredients prior probability distribution of unknown parameters likelihood probability of data given (unknown) parameters Thomas Bayes
BAYES RULE P (A B) = P (B A)P (A) P (B) likelihood P (parameters data) = P (data parameters)p (parameters) P (data) prior we really care about this P (parameters data) / P (data parameters)p (parameters) posterior distribution
EXAMPLE: LOGISTIC REGRESSION P (l i =1 d i )= exp(d iw) 1 + exp(d i w) P (l, D w) = Y i exp(l i d i w) 1 + exp(l i d i w) P (parameters data) / P (data parameters)p (parameters) prior
EXAMPLE: LOGISTIC REGRESSION P (parameters data) / P (data parameters)p (parameters)! exp(l i d i w) P (w l, D) = P (w) 1 + exp(l i d i w) Y i log P (w l, D) = log P (w)+ X i prior log(1 + exp( l i d i w)) Example: log P (w) = w
BAYESIANS OFFER NO SOLUTIONS Bayes Lagrange solution space P (w D, l) optimization solution Monte Carlo methods randomly sample this solution space
WHY MONTE CARLO? You need to know more than just a max/minimizer error bars optimization solution Z E(w) = wp(w) dw Z Var(w) =E[(w µ)(w µ) T ]= ww T P (w) dw µµ T
EXAMPLE: POKER BOTS You can t solve your problem by other (faster) methods maximize log P (w) we don t know derivative poker bots model parameters describe player behavior 2.5M different community cards 10 players = 59K raise/fold/call perms per round
WHY MONTE CARLO? You re incredibly lazy maximize log P (w) I could differentiate this, I just don t wanna arg max log P (w) max k {P (wk )}
MARKOV CHAINS What is an MC? MC has steady state if Irreducible: can visit any state starting at any state Aperiodic: does not get trapped in deterministic cycles
METROPOLIS HASTINGS proposal distribution q(y x) ingredients posterior distribution p(x) Start with x 0 For k =1, 2, 3,... MH Algorithm Choose candidate y from q(y x k ) Compute acceptance probability p(y)q(x k y) =min 1, p(x k )q(y x k ) Set x k+1 = y with probability otherwise x k+1 = x k
CONVERGENCE Irreducible: support of q must contain support of p Aperiodic: positive rejection probability guarantees no infinite cycle Theorem Suppose the support of q contains the support of p. Then the MH sampler has a stationary distribution, and the distribution is equal to p.
METROPOLIS ALGORITHM proposal distribution q(y x) ingredients posterior distribution p(x) Assume proposal is symmetric q(x k y) =q(y x k ) Metropolis Hastings Metropolis =min 1, p(y)q(x k y) p(x k )q(y x k ) =min 1, p(y) p(x k )
EXAMPLE: GMM histogram of Metropolis iterates Andrieu, Freitas, Doucet, Jordan 03
PROPERTIES OF MH Pro s We don t need a normalization constant for posterior We can run many chains in parallel (cluster/gpu) We don t need any derivatives P (parameters data) = P (data parameters)p (parameters) P (data) P (parameters data) / P (data parameters)p (parameters)
PROPERTIES OF MH Pro s We don t need a normalization constant for posterior We can run many chains in parallel (cluster/gpu) We don t need any derivatives Con s Mixing time depends on proposal distribution too wide = constant rejections = slow mixing too narrow = short movements=slow mixing Samples only meaningful at stationary distribution Burn in samples must be discarded Many samples needed because of correlations
SIMULATED ANNEALING maximize p(x) Easy choice: run MCMC, then max k p(x k ) Better choice: sample the distribution p 1 T k (x k ) temperature Why is this better? Where does the name come from?
COOLING SCHEDULE hot warm cool cold Andrieu, Freitas, Doucet, Jordan 03
CONVERGENCE SA solve non-convex problem, even NP-complete problems, as time goes to infinity. Theorem Suppose the MCMC mixes fast enough that epsilon-dense sampling occurs in finite time at every temperature. For an annealing schedule with temperature T k = 1 C log(k + T 0 ) simulated annealing converges to a global optima with probability 1. Granville, "Simulated annealing: A proof of convergence
WHEN TO USE SIMULATED ANNEALING flow chart Should I use simulated annealing? no. No practical way to choose temperature schedule Too fast = stuck in local minimum (risky) Too slow = no different from MCMC Act of desperation!
GIBBS SAMPLER Want to sample P (x 1,x 2,x 3 ) on stage k, pick some coordinate j q(y x k )= ( p(y j x k j c ), if y j c 0, otherwise = x k j c = p(y)q(xk y) p(x k )q(y x k ) = p(y)p(xk j xkjc) p(x k )p(y j y j c) = p(y j c ) p(x k j c ) =1 P (B) = P (A and B) P (A B)
GIBBS SAMPLER Want to sample P (x 1,x 2,x 3 ) iterates x 2 P (x 1 x 1 2,x 1 3,x 1 4,...,x 1 n) x 3 P (x 2 x 2 1,x 2 3,x 2 4,...,x 2 n) x 4 P (x 3 x 3 1,x 3 2,x 3 4,...,x 3 n)
APPLICATION: SAMPLING GRAPHICAL MODELS Restricted Boltzmann Machine (RBM) binary random variables 0/1 weights E(v, h) = a T v b T h v T Wh P (v, h) = 1 Z e E(v,h) partition function
APPLICATION: SAMPLING GRAPHICAL MODELS Restricted Boltzmann Machine (RBM) E(v, h) = a T v b T h v T Wh P (v, h) = 1 Z e E(v,h) P (v i =1 h) = = P (v i =1 h) P (v i =0 h)+p (v i =1 h) exp( a i + P j w ijh j + C) exp(c)+exp( a i + P j w ijh j + C) remove normalization aggregate constant = exp( a i + P j w ijh j ) 1 + exp( a i + P j w ijh j ) cancel constants = ( a i + X i w ij h j ) sigmoid function
BLOCK GIBBS FOR RBM visible stage 1 freeze hidden randomly sample visible hidden P (v i =1 h) = ( a i + X j w ij h j ) stage 2 freeze visible randomly sample hidden P (h j =1 v) = ( b j + X i w ij v i )
DEEP BELIEF NETS visible DBN = layered RBM Each layer only depends on layer beneath it (feed forward) Probability for each hidden node is sigmoid function hidden
EXAMPLE: MNIST Gan, Henao, Carlson, Carin 15 Train 3-layer DBN with 200 hidden units trained on 60K MNIST digits pre-training: layer-by-layer training training: train all weights simultaneously training done using Gibbs sampler Gibbs sampler used to explore final solution
EXAMPLE: MNIST training data Learned features observations sampled from deep belief network using Gibbs sampler
WHAT S WRONG WITH GIBBS? susceptible to strong correlations also a problem for MH: bad proposal distribution
SLICE SAMPLER Problems with MH: hard to choose proposal distribution does not exploit analytical information about problem long mixing time Slice sampler ( 1, if 0 apple u apple p(x) q(x, u) = 0, otherwise 1 1 p(x) u x x
LIFTED INTEGRAL Problems with MH: hard to choose proposal distribution does not exploit analytical information about problem long mixing time Z Slice sampler Z Z Z p(x) = q(x, u) du f(x)p(x) = f(x)q(x, u) du dx sample this 1 1 p(x) u x x
SLICE SAMPLER Z Z Z f(x)p(x) = f(x)q(x, u) du dx do Gibbs on this Freeze x : Choose u uniformly from [0,p(x)] Freeze u : Choose x uniformly from {x p(x) u} need analytical formula for this set p(x) u
MODEL FITTING ny p(x) = p i (x) i=1 support is a box q(x, u 1,u 2,...,u n )= log p(x) = nx i=1 ( 1, if p i (x) apple u i 8i 0, otherwise log p i (x) Z q(x, u 1,u 2,...,u n ) du = Z p1 (x) u 1 =0 Z p2 (x) u 2 =0 Z pn (x) u n =0 du = ny k=1 p i (x) Z Z Z f(x)p(x) dx = x u f(x)q(x, u) dx du
MODEL FITTING Z Z Z f(x)p(x) dx = x u f(x)q(x, u) dx du q(x, u 1,u 2,...,u n )= ( 1, if p i (x) apple u i 8i 0, otherwise Freeze x : For all i, choose u i uniformly from [0,p i (x)] Freeze u : Choose x uniformly from T {x p i (x) u i } i
SLICE SAMPLER PROPERTIES pros does not require proposal distribution mixes super fast simple implementation cons analytical formula for super-level sets hard in multiple dimensions fixes / generalizations step-out methods random hyper-rectangles
STEP-OUT METHODS pick small interval double width size until slice is contained sample uniformly from slice using rejection sampling - throw away selections outside of slice p(x) p(x) p(x) u u u
HIGH DIMS: COORDINATE Z Z Z METHOD f(x)p(x) = f(x)q(x, u) du dx do coordinate Gibbs on this Freeze x : Choose u uniformly from [0,p(x)] Freeze u : For i =1 n Choose x i uniformly from {x i p(x 1,,x i,x n ) u} can use stepping -out method Radford Neal: Slice Sampling 03
HIGH DIMS: HYPER-RECTANGLE choose large rectangle randomly/uniformly place rectangle around current location draw random proposal point from rectangle if proposal lies outside slice, shrink box so proposal is on boundary performs much better for poorly-conditioned objectives Radford Neal: Slice Sampling 03
COMPARISON Gradient/ Splitting rate e ck when to use you value reliability and precision (moderate speed, high accuracy) SGD 1/k you value speed over accuracy (high speed, moderate accuracy) MCMC 1/ p k you value simplicity (no gradient) or need statistical inference (slow and inaccurate)
DO MATLAB EXERCISE MCMC