Stochastic Proximal Gradient Algorithm
|
|
- William Clarke
- 5 years ago
- Views:
Transcription
1 Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind help of A. Juditsky LJK, Grenoble
2 1 Motivation
3 Problem Statement (P) Argmin θ Θ { l(θ) + g(θ)}, where l is a smooth log-likelihood function or some other smooth statistical learning function, and g is a possibly non-smooth convex penalty term. This problem has attracted a lot of attention with the growing need to address high-dimensional statistical problems This work focuses on the case where the function l and its gradient l are both intractable, and where l is given by l(θ) = H θ (x)π θ (dx), for some probability measure π θ.
4 Network structure Problem: Estimate sparse network structures from measurements on the nodes. For discrete measurement: amounts to estimate a Gibbs measure with pair-wise interactions f θ (x 1,..., x p ) = 1 p exp θ ii B 0 (x i ) + θ ij B(x i, x j ) Z θ, i=1 1 j<i p for a function B 0 : X R, and a symmetric function B : X X R, where X is a finite set. The absence of an edge encodes conditional independence.
5 Network structure Each graph represents a model class of graphical models; learning a graph then is a model class selection problem. Constraint-based approaches: test conditional independence from the data and then determine a graph that most closely represents those independencies. Score-based approaches combine a metric for the complexity of the graph with a measure of the goodness of fit of the graph to the data... but the number of graph structures grows super-exponentially, and the problem is in general NP-hard.
6 Network structure For x X p, define B(x) def = (B jk (x j, x k )) 1 j,k p R p p. The l 1 -penalized maximum likelihood estimate of θ is obtained by solving an optimization problem of the form (P) where l and g are given by l(θ) = 1 n n i=1 θ, B(x (i) ) log Z θ, g(θ) = λ θ jk. 1 k<j p
7 Fisher identity Fact 1: Z θ is the normalization constant is given by Z θ = x exp( θ, B(x) where the sum is over all the possible configurations. Fact 2: the gradient log Z θ is the expectation of the sufficient statistics: log Z θ = B(x)f θ (x) x Problem: None of these quantities can be computed explicitly... Nevertheless, they can be estimated using Monte Carlo integration.
8 General framework where re Argmin θ Θ { l(θ) + g(θ)}, l is a smooth log-likelihood function or some other smooth statistical learning function, g is a non-smooth convex sparsity-inducing penalty. the function l and its gradient l are intractable, The score function l is given by l(θ) = H θ (x)π θ (dx), for some probability measure π θ on some measurable space (X, B), and some function H θ : X Θ.
9 1 Motivation
10 Definition Definition: Proximal mapping associated with closed convex function g and stepsize γ ( prox γ (θ) = Argmin ϑ Θ g(ϑ) + (2γ) 1 ϑ θ 2 ) 2 If g = I K, where K is a closed convex set (I K (x) = 0, x K, I K (x) = otherwise), then prox γ is the Euclidean projection on K prox γ (θ) = Argmin ϑ K ϑ θ 2 2 = P K (θ) if g(θ) = p i=1 λ i θ i then prox g is shrinkage (soft threshold) operation θ i γλ i θ i γλ i [S λ,γ (θ)] i = 0 θ i γλ i θ i + γλ i θ i γλ i
11 Proximal gradient method Unconstrained problem with cost function split in two components Minimize f(θ) = l(θ) + g(θ) l convex, differentiable with dom(g) = R n g closed, convex, possibly non differentiable... but prox g is inexpensive! Proximal gradient algorithm θ (k) = prox γk g(θ (k 1) + γ k l(θ (k 1) )) where {γ k, k N} is a sequence stepsizes, which either be constant, decreasing or determined by line search
12 Interpretation Denote θ + = prox γ (θ + γ l(θ)) from definition of proximal operator: θ + = Argmin ϑ (g(ϑ) + (2γ) 1 ϑ θ γ l(θ) 2 2) = Argmin ϑ (g(ϑ) l(θ) l(θ) T (ϑ θ) + (2γ) 1 ϑ θ 2 2). θ + minimizes g(ϑ) plus a simple quadratic local model of l(ϑ) around θ If γ 1/L, the surrogate function on the RHS majorizes the target function, and the algorithm might be seen as a specific instance of the Majorization-Minimization algorithm.
13 Some specific examples if g(θ) = 0 then proximal gradient = gradient method. θ (k) = θ (k 1) + γ k l(θ (k 1) ) if g(θ) = I K (θ), then proximal gradient = projected gradient θ (k) = P K (θ (k 1) + γ k l(θ (k 1) )). if g(θ) = i λ i θ i then proximal gradient = soft-thresholded gradient θ (k) = S λ,γk (θ (k 1) + γ k l(θ (k 1) ))
14 Gradient map The proximal gradient may be equivalently rewritten as θ (k) = θ (k 1) γ k G γk (θ (k 1) ) where the function G γ is given by G γ (θ) = 1 γ (θ prox γ(θ + γ l(θ))) The subgradient characterization of the proximal map implies G γ (θ) l(θ) + g(θ γg γ (θ)) Therefore, G γ (θ) = 0 if and only if θ minimizes f(θ) = l(θ) + g(θ)
15 Convergence of the proximal gradient Assumptions: f(θ) = l(θ) + g(θ) l is Lipschitz continuous with constant L > 0 l(θ) l(ϑ) 2 L θ ϑ 2 θ, ϑ Θ optimal value f is finite and attained at θ (not necessarily unique) Theorem f(θ (k) ) f decreases at least as fast as 1/k if fixed step size γ k 1/L is used if backtracking line search is used
16 Stochastic Approximation vs minibatch proximal Stochastic Approximation 1 Motivation
17 Stochastic Approximation vs minibatch proximal Stochastic Approximation Back to the original problem! The score function l is given by l(θ) = H θ (x)π θ (dx). Therefore, at each iteration, the score function should be approximated. The case where π θ = π and and random variables {X n, n N} each marginally distributed according to π = online learning (Juditsky, Nemirovski, 2010, Duchi et al, 2011).
18 Stochastic Approximation vs minibatch proximal Stochastic Approximation Back to the original problems! l(θ) = H θ (x)π θ (dx). π θ depends on the unknown parameter θ... Sampling directly from π θ is often not directly feasible. But one may construct a Markov chain, with Markov kernel P θ, such that π θ P θ = π θ The Metropolis-Hastings algorithm or Gibbs sampling provides a natural framework to handle such problems.
19 Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic Approximation / Mini-batches g 0 where H n+1 approximates l(θ n ). θ n+1 = θ n + γ n+1 H n+1 Stochastic Approximation: γ n 0 and H n+1 = H θn (X n+1 ) and X n+1 P θn (X n, ). Mini-batches setting: γ n γ and m n+1 1 H n+1 = m 1 n+1 H(θ n, X n+1,j ), j=0 where m n and {X n+1,j } mn+1 j=1 is a run of the length m n+1 of a Markov chain with transition kernel P θn. Beware! For SA, n iterations = n simulations. For minibatches, n iterations = n j=1 m j simulations.
20 Stochastic Approximation vs minibatch proximal Stochastic Approximation Averaging θ n def = n k=1 a kθ k n k=1 a k = ( 1 ) a n n k=1 a θ n 1 + k a n n k=1 a k θ n. Stochastic approximation: take a n 1, γ n = Cn α with α (1/2, 1), then n ( θn θ ) D N(0, σ 2 ) Mini-batch SA: take a n m n, γ n γ 1/(2L) and m n sufficiently fast, then n ( θnn θ ) D N(0, σ 2 ) where N n is the number of iterations for n simulations: Nn k=1 m k n < N n+1 k=1 m k.
21 Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic Approximation θ n+1 = θ n + γ n+1 l(θ n ) + γ n+1 η n+1 η n+1 = H θn (X n+1 ) l(θ n ) = H θn (X n+1 ) π θn (H θn ). Idea Split the error into a martingale increment + remainder term Key tool Poisson equation Ĥ θ P θ Ĥ θ = H θ π θ (H θ ).
22 Stochastic Approximation vs minibatch proximal Stochastic Approximation Decomposition of the error η n+1 = H θn (X n+1 ) π θn (H θn ) = Ĥθ n (X n+1 ) P θn Ĥ θn (X n+1 ) = Ĥθ n (X n+1 ) P θn Ĥ θn (X n ) + P θn Ĥ θn (X n+1 ) P θn Ĥ θn (X n ) We further split the error P θn Ĥ θn (X n+1 ) P θn Ĥ θn (X n ) = P θn+1 Ĥ θn+1 (X n+1 ) P θn Ĥ θn (X n )+P θn Ĥ θn (X n+1 ) P θn+1 Ĥ θn+1 (X n+1 ). To prove that the remainder term goes to zero, it is required to prove the regularity of the Poisson solution with respect to θ, to prove that θ Ĥθ and θ P θ Ĥ θ is smooth in some sense... this is not always a trivial issue!
23 Stochastic Approximation vs minibatch proximal Stochastic Approximation Minibatch case Assume that the Markov kernel is nice... Bias E [η n+1 F n ] = m 1 n+1 Fluctuation m n+1 1 m 1 n+1 j=0 m n+1 1 j=0 H θn (X j ) π θn (H θn ) (ν θn P jθn H θn π θn H θn ) = O(m 1 n+1 ) m n+1 1 = m 1 n+1 Ĥ θn (X j ) P θn Ĥ θn (X j 1 ) + remainders j=1
24 Stochastic Approximation vs minibatch proximal Stochastic Approximation Minibatches case Contrary to SA, the noise η n+1 in the recursion θ n+1 = θ n + γ l(θ n ) + γη n+1 converges to zero a.s. and the stepsize is kept constant γ n = γ. Idea: perturbation of a discrete time dynamic system θ k+1 = θ k + γ l( θ k ) having a unique fixed point and a Lyapunov function l: l(θ k+1 ) l(θ k ) in presence of vanishingly small perturbation η n+1. a.s convergence of perturbed dynamical system with a Lyapunov function can be established under very weak assumptions...
25 Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic proximal gradient The stochastic proximal gradient sequence {θ n, n N} can be rewritten as θ n+1 = prox γn+1 (θ n + γ n+1 l(θ n ) + γ n+1 η n+1 ), where η n+1 def = H n+1 l(θ n ) is the approximation error. Questions: Convergence and rate of convergence in the SA and mini-batch settings? Stochastic Approximation / Minibatch: which one should I prefer? Tuning of the parameters (stepsize for SA, size of minibatches, averaging weights, etc...) Acceleration (à la Nesterov)?
26 Stochastic Approximation vs minibatch proximal Stochastic Approximation Main result Lemma Suppose that {γ n, n N} is decreasing and 0 < Lγ n 1 for all n 1. For any θ Θ, and any nonnegative sequence {a n, n N}, n {f( θ n ) f(θ )} j=1 a j n j=1 ( aj a ) j 1 θ j 1 θ 2 + a 0 θ 0 θ 2 γ j γ j 1 2γ 0 n n a j Tγj (θ j 1 ) θ, η j + a j γ j η j 2, j=1 where T γ (θ) = def = prox γ (θ + γ l(θ)) is the gradient proximal map, j=1
27 Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic Approximation setting Take a j = γ j and decompose, using the Poisson equation, η j = ξ j + r j, where ξ j is a martingale term and r j is a remainder term. n γ j {f( θ n ) f(θ )} a 0 θ 0 θ 2 2γ 0 j=1 + n n γ j Tγj (θ j 1 ) θ, ξ j + γj 2 η j 2 + remainders, j=1 j=1 The red term is a martingale with a bracket bounded by n γj 2 θ j 1 θ 2 E [ ξ j 2 ] F j 1 j=1 If j=1 γ j = and j=1 γ2 j <, { θ n, n N} converges. rate of convergence ln(n)/ n by taking γ j = j 1/2.
28 Stochastic Approximation vs minibatch proximal Stochastic Approximation Minibatch setting Theorem Let { θ n, n 0} be the average estimator. Then for all n 1, n E [ f( θ n ) f(θ ) ] j=1 a j j=1 n j=1 ( aj a ) j 1 E [ θ j 1 θ 2] + a 0 E[ θ 0 θ 2 ] γ j γ j 1 2γ 0 n [ ] a j E θ j 1 θ ɛ (1) j 1 + n j=1 [ ] a j γ j E ɛ (2) j 1. where ɛ (1) n def = E [η n+1 F n ], ɛ (2) def n = E [ η ] n+1 2 Fn.
29 Stochastic Approximation vs minibatch proximal Stochastic Approximation Convergence analysis Corollary Suppose that γ n (0, 1/L], and there exist constants C 1, C 2, B < such that, for n 1, E[ɛ (1) n ] C 1 m 1 n+1, E[ɛ(2) n ] C 2 m 1 n+1 Then, setting γ n = γ, m n = n and a n 1,, and sup θ n θ B, P a.s. n N E [f(θ n ) f(θ )] C/n and E [f(θ Nn ) f(θ )] C/ n where N n is the number of iterations for n simulations. This is the same rate than for the SA (without the logarithmic term).
30 1 Motivation
31 Potts model We focus on the particular case where X = {1,..., M}, and B(x, y) = 1 {x=y}, which corresponds to the well known Potts model f θ (x 1,..., x p ) = 1 p exp θ ii B 0 (x i ) + Z θ i=1 1 j<i p θ ij 1 {xi=x j}. The term p i=1 θ iib 0 (x i ) is sometimes referred to as the external field and defines the distribution in the absence of interaction. We focus on the case where the interactions terms θ ij for i j are nonnegative. This corresponds to networks with there is either no interaction, or collaborative interactions between the nodes.
32 Algorithm At the k-th iteration, and given F k = σ(θ 1,..., θ k ): 1 generate the X p -valued Markov sequence {X k+1,j } m k+1 j=0 with transition P θk and initial distribution ν θk, and compute the approximate gradient H k+1 = 1 n n i=1 B(x (i) ) 1 m k+1 m k+1 j=1 B(X k+1,j ), 2 Compute θ k+1 = Π Ka ( sγk+1,λ (θ k + γ k+1 H k+1 ) ), the operation s γ,λ (M) soft-thresholds each entry of the matrix M, and the operation Π Ka (M) projects each entry of M on [0, a].
33 MCMC scheme For j i, we set b ij = e θij. Notice that b ij 1. For x = (x 1,..., x p ), ( p ) f θ (x) = 1 ( ) exp θ ii B 0 (x i ) bij 1 {xi=x Z j} + 1 {xi x j}. θ i=1 1 j<i p Augment the likelihood with auxiliary variables {δ ij, 1 j < i p}, δ ij {0, 1} such that the joint distribution of (x, δ) is given by ( p ) f θ (x, δ) exp θ ii B 0 (x i ) j<i ( i=1 1 {xi=x j}b ij ( 1 b 1 ij ) δij ) δ b ij 1 ij + 1 {xi x j}0 δij 1 1 δij. The marginal distribution of x in this joint distribution is the same f θ given above.
34 MCMC scheme The auxiliary variables {δ ij, 1 j < i p} are conditionally independent given x = (x 1,..., x p ); if x i x j, then δ ij = 0 with probability 1. If x i = x j, then δ ij = 1 with probability 1 b 1 ij, and δ ij = 0 with probability b 1 ij. The auxiliary variables {δ ij, 1 j < i p} defines an undirected graph with nodes {1,..., p} where there is an edge between i j if δ ij = 1, and there is no edge otherwise. This graph partitions the nodes {1,..., p} into maximal clusters C 1,..., C K (a set of nodes where there is a path joining any two of them). Notice that δ ij = 1 implies x i = x j. Hence all the nodes in a given cluster holds the same value of x. ( ) K f θ (x δ) exp θ ii B 0 (x i ). i C k k=1 j<i,(i,j) C k 1 {xi=x j}
35 Wolff algorithm Given X = (X 1,..., X p ) 1 Randomly select a node i {1,..., p}, and set C 0 = {i}. 2 Do until C 0 can no longer grow. For each new addition j to C 0, and for each j / C 0 such that θ jj > 0, starting with δ jj = 0, do the following. If X j = X j, set δ jj = 1 with probability 1 e θ jj. If δ jj = 1, add j to C 0. 3 If X i = v, randomly select v {1,..., M} \ {v}, and propose a new vector X X p, where X j = v for j C 0 and X j = X j for j / C 0. Accept X with probability 1 exp (B 0 (v ) B 0 (v)) θ jj. j C0
36 Relative error rate Stoch. Grad. Stoch. Grad. Averaged Acc. scheme True positive rate Stoch. Grad. Acc. scheme False discovery rate Stoch. Grad. Acc. scheme Iterations Iterations Iterations Figure : Simulation results for p = 50, n = 500 observations, 1% of off-diagonal terms, minibatch, m n = n
37 Relative error rate Stoch. Grad. Stoch. Grad. Averaged True positive rate False discovery rate Iterations Iterations Iterations Figure : Simulation results for p = 100, n = 500 observations, 1% of off-diagonal terms, n = 500 observations, 1% of off-diagonal terms, minibatch, m n = n
38 Relative error rate Stoch. Grad. Stoch. Grad. Averaged True positive rate False discovery rate Iterations Iterations Iterations Figure : Simulation results for p = 200, n = 500 observations, 1% of off-diagonal terms, 1% of off-diagonal terms, minibatch, m n = n
39 1 Motivation
40 Take-home message Efficient and globally converging procedure for penalized likelihood inference in incomplete data models are available if the complete data likelihood is globally concave with convex sparsity-inducing penalty (provided that computing the proximal operator is easy) Stochastic Approximation and Minibatch algorithms achieve the same rate, which is 1/ n where n is the number of simulations. Minibatch algorithms are in general preferable if the computation of the proximal operator is complex. Thanks for your attention... and patience!
Perturbed Proximal Gradient Algorithm
Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing
More informationOn Perturbed Proximal Gradient Algorithms
On Perturbed Proximal Gradient Algorithms Yves F. Atchadé University of Michigan, 1085 South University, Ann Arbor, 48109, MI, United States, yvesa@umich.edu Gersende Fort LTCI, CNRS, Telecom ParisTech,
More informationdans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés
Inférence pénalisée dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Gersende Fort Institut de Mathématiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse,
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationProbabilistic Graphical Models
2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationGradient Sliding for Composite Optimization
Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationLearning Gaussian Graphical Models with Unknown Group Sparsity
Learning Gaussian Graphical Models with Unknown Group Sparsity Kevin Murphy Ben Marlin Depts. of Statistics & Computer Science Univ. British Columbia Canada Connections Graphical models Density estimation
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationConsistency of the maximum likelihood estimator for general hidden Markov models
Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationMarkov Networks.
Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function
More informationMonte Carlo in Bayesian Statistics
Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview
More informationFast proximal gradient methods
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient
More informationADMM and Fast Gradient Methods for Distributed Optimization
ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationSome Results on the Ergodicity of Adaptive MCMC Algorithms
Some Results on the Ergodicity of Adaptive MCMC Algorithms Omar Khalil Supervisor: Jeffrey Rosenthal September 2, 2011 1 Contents 1 Andrieu-Moulines 4 2 Roberts-Rosenthal 7 3 Atchadé and Fort 8 4 Relationship
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationInformation theoretic perspectives on learning algorithms
Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez
More informationInderjit Dhillon The University of Texas at Austin
Inderjit Dhillon The University of Texas at Austin ( Universidad Carlos III de Madrid; 15 th June, 2012) (Based on joint work with J. Brickell, S. Sra, J. Tropp) Introduction 2 / 29 Notion of distance
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationLecture 16 Deep Neural Generative Models
Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed
More informationAlgorithms for Nonsmooth Optimization
Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationUndirected Graphical Models
Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates
More informationCSC 412 (Lecture 4): Undirected Graphical Models
CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:
More informationA Sparsity Preserving Stochastic Gradient Method for Composite Optimization
A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite
More information3 : Representation of Undirected GM
10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationNonconvex Sparse Logistic Regression with Weakly Convex Regularization
Nonconvex Sparse Logistic Regression with Weakly Convex Regularization Xinyue Shen, Student Member, IEEE, and Yuantao Gu, Senior Member, IEEE Abstract In this work we propose to fit a sparse logistic regression
More informationOn Markov chain Monte Carlo methods for tall data
On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is
More informationScaling Neighbourhood Methods
Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationNetwork Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)
Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu
More informationThe deterministic Lasso
The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationUndirected Graphical Models
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationNotes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed
18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationLecture 1: Supervised Learning
Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationBasic math for biology
Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood
More informationBayesian model selection in graphs by using BDgraph package
Bayesian model selection in graphs by using BDgraph package A. Mohammadi and E. Wit March 26, 2013 MOTIVATION Flow cytometry data with 11 proteins from Sachs et al. (2005) RESULT FOR CELL SIGNALING DATA
More informationGraphical Model Selection
May 6, 2013 Trevor Hastie, Stanford Statistics 1 Graphical Model Selection Trevor Hastie Stanford University joint work with Jerome Friedman, Rob Tibshirani, Rahul Mazumder and Jason Lee May 6, 2013 Trevor
More informationMarkov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)
Markov Networks l Like Bayes Nets l Graphical model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov
More informationLecture 15 Newton Method and Self-Concordance. October 23, 2008
Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications
More informationLecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More informationOnline Forest Density Estimation
Online Forest Density Estimation Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr UAI 16 1 Outline 1 Probabilistic Graphical Models 2 Online Density Estimation 3 Online Forest Density
More informationDeep unsupervised learning
Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationLearning MN Parameters with Alternative Objective Functions. Sargur Srihari
Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient
More information14 : Theory of Variational Inference: Inner and Outer Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture
More informationAn Introduction to Reversible Jump MCMC for Bayesian Networks, with Application
An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application, CleverSet, Inc. STARMAP/DAMARS Conference Page 1 The research described in this presentation has been funded by the U.S.
More informationProbabilistic Graphical Networks: Definitions and Basic Results
This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical
More informationSparse Stochastic Inference for Latent Dirichlet Allocation
Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation
More informationLearning Parameters of Undirected Models. Sargur Srihari
Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Difficulties due to Global Normalization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate
More informationGreedy Layer-Wise Training of Deep Networks
Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn
More informationStochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions
International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.
More informationAfternoon Meeting on Bayesian Computation 2018 University of Reading
Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationRiemann Manifold Methods in Bayesian Statistics
Ricardo Ehlers ehlers@icmc.usp.br Applied Maths and Stats University of São Paulo, Brazil Working Group in Statistical Learning University College Dublin September 2015 Bayesian inference is based on Bayes
More informationApril 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning
for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions
More informationChapter 17: Undirected Graphical Models
Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)
More informationOn the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,
Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationThe Frank-Wolfe Algorithm:
The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology
More informationMarkov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)
Markov Networks l Like Bayes Nets l Graph model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationComputer Intensive Methods in Mathematical Statistics
Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 5 Sequential Monte Carlo methods I 31 March 2017 Computer Intensive Methods (1) Plan of today s lecture
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationRandomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints
Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints By I. Necoara, Y. Nesterov, and F. Glineur Lijun Xu Optimization Group Meeting November 27, 2012 Outline
More informationr=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J
7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationIntroduction to Restricted Boltzmann Machines
Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,
More informationComputer Intensive Methods in Mathematical Statistics
Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of
More informationStochastic optimization Markov Chain Monte Carlo
Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya Weizmann Institute of Science 1 Motivation Markov chains Stationary distribution Mixing time 2 Algorithms Metropolis-Hastings Simulated Annealing
More informationSparse Gaussian conditional random fields
Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian
More informationHidden Markov Models
CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationDeep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści
Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?
More informationEstimators based on non-convex programs: Statistical and computational guarantees
Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright
More information