Stochastic Proximal Gradient Algorithm

Size: px
Start display at page:

Download "Stochastic Proximal Gradient Algorithm"

Transcription

1 Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind help of A. Juditsky LJK, Grenoble

2 1 Motivation

3 Problem Statement (P) Argmin θ Θ { l(θ) + g(θ)}, where l is a smooth log-likelihood function or some other smooth statistical learning function, and g is a possibly non-smooth convex penalty term. This problem has attracted a lot of attention with the growing need to address high-dimensional statistical problems This work focuses on the case where the function l and its gradient l are both intractable, and where l is given by l(θ) = H θ (x)π θ (dx), for some probability measure π θ.

4 Network structure Problem: Estimate sparse network structures from measurements on the nodes. For discrete measurement: amounts to estimate a Gibbs measure with pair-wise interactions f θ (x 1,..., x p ) = 1 p exp θ ii B 0 (x i ) + θ ij B(x i, x j ) Z θ, i=1 1 j<i p for a function B 0 : X R, and a symmetric function B : X X R, where X is a finite set. The absence of an edge encodes conditional independence.

5 Network structure Each graph represents a model class of graphical models; learning a graph then is a model class selection problem. Constraint-based approaches: test conditional independence from the data and then determine a graph that most closely represents those independencies. Score-based approaches combine a metric for the complexity of the graph with a measure of the goodness of fit of the graph to the data... but the number of graph structures grows super-exponentially, and the problem is in general NP-hard.

6 Network structure For x X p, define B(x) def = (B jk (x j, x k )) 1 j,k p R p p. The l 1 -penalized maximum likelihood estimate of θ is obtained by solving an optimization problem of the form (P) where l and g are given by l(θ) = 1 n n i=1 θ, B(x (i) ) log Z θ, g(θ) = λ θ jk. 1 k<j p

7 Fisher identity Fact 1: Z θ is the normalization constant is given by Z θ = x exp( θ, B(x) where the sum is over all the possible configurations. Fact 2: the gradient log Z θ is the expectation of the sufficient statistics: log Z θ = B(x)f θ (x) x Problem: None of these quantities can be computed explicitly... Nevertheless, they can be estimated using Monte Carlo integration.

8 General framework where re Argmin θ Θ { l(θ) + g(θ)}, l is a smooth log-likelihood function or some other smooth statistical learning function, g is a non-smooth convex sparsity-inducing penalty. the function l and its gradient l are intractable, The score function l is given by l(θ) = H θ (x)π θ (dx), for some probability measure π θ on some measurable space (X, B), and some function H θ : X Θ.

9 1 Motivation

10 Definition Definition: Proximal mapping associated with closed convex function g and stepsize γ ( prox γ (θ) = Argmin ϑ Θ g(ϑ) + (2γ) 1 ϑ θ 2 ) 2 If g = I K, where K is a closed convex set (I K (x) = 0, x K, I K (x) = otherwise), then prox γ is the Euclidean projection on K prox γ (θ) = Argmin ϑ K ϑ θ 2 2 = P K (θ) if g(θ) = p i=1 λ i θ i then prox g is shrinkage (soft threshold) operation θ i γλ i θ i γλ i [S λ,γ (θ)] i = 0 θ i γλ i θ i + γλ i θ i γλ i

11 Proximal gradient method Unconstrained problem with cost function split in two components Minimize f(θ) = l(θ) + g(θ) l convex, differentiable with dom(g) = R n g closed, convex, possibly non differentiable... but prox g is inexpensive! Proximal gradient algorithm θ (k) = prox γk g(θ (k 1) + γ k l(θ (k 1) )) where {γ k, k N} is a sequence stepsizes, which either be constant, decreasing or determined by line search

12 Interpretation Denote θ + = prox γ (θ + γ l(θ)) from definition of proximal operator: θ + = Argmin ϑ (g(ϑ) + (2γ) 1 ϑ θ γ l(θ) 2 2) = Argmin ϑ (g(ϑ) l(θ) l(θ) T (ϑ θ) + (2γ) 1 ϑ θ 2 2). θ + minimizes g(ϑ) plus a simple quadratic local model of l(ϑ) around θ If γ 1/L, the surrogate function on the RHS majorizes the target function, and the algorithm might be seen as a specific instance of the Majorization-Minimization algorithm.

13 Some specific examples if g(θ) = 0 then proximal gradient = gradient method. θ (k) = θ (k 1) + γ k l(θ (k 1) ) if g(θ) = I K (θ), then proximal gradient = projected gradient θ (k) = P K (θ (k 1) + γ k l(θ (k 1) )). if g(θ) = i λ i θ i then proximal gradient = soft-thresholded gradient θ (k) = S λ,γk (θ (k 1) + γ k l(θ (k 1) ))

14 Gradient map The proximal gradient may be equivalently rewritten as θ (k) = θ (k 1) γ k G γk (θ (k 1) ) where the function G γ is given by G γ (θ) = 1 γ (θ prox γ(θ + γ l(θ))) The subgradient characterization of the proximal map implies G γ (θ) l(θ) + g(θ γg γ (θ)) Therefore, G γ (θ) = 0 if and only if θ minimizes f(θ) = l(θ) + g(θ)

15 Convergence of the proximal gradient Assumptions: f(θ) = l(θ) + g(θ) l is Lipschitz continuous with constant L > 0 l(θ) l(ϑ) 2 L θ ϑ 2 θ, ϑ Θ optimal value f is finite and attained at θ (not necessarily unique) Theorem f(θ (k) ) f decreases at least as fast as 1/k if fixed step size γ k 1/L is used if backtracking line search is used

16 Stochastic Approximation vs minibatch proximal Stochastic Approximation 1 Motivation

17 Stochastic Approximation vs minibatch proximal Stochastic Approximation Back to the original problem! The score function l is given by l(θ) = H θ (x)π θ (dx). Therefore, at each iteration, the score function should be approximated. The case where π θ = π and and random variables {X n, n N} each marginally distributed according to π = online learning (Juditsky, Nemirovski, 2010, Duchi et al, 2011).

18 Stochastic Approximation vs minibatch proximal Stochastic Approximation Back to the original problems! l(θ) = H θ (x)π θ (dx). π θ depends on the unknown parameter θ... Sampling directly from π θ is often not directly feasible. But one may construct a Markov chain, with Markov kernel P θ, such that π θ P θ = π θ The Metropolis-Hastings algorithm or Gibbs sampling provides a natural framework to handle such problems.

19 Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic Approximation / Mini-batches g 0 where H n+1 approximates l(θ n ). θ n+1 = θ n + γ n+1 H n+1 Stochastic Approximation: γ n 0 and H n+1 = H θn (X n+1 ) and X n+1 P θn (X n, ). Mini-batches setting: γ n γ and m n+1 1 H n+1 = m 1 n+1 H(θ n, X n+1,j ), j=0 where m n and {X n+1,j } mn+1 j=1 is a run of the length m n+1 of a Markov chain with transition kernel P θn. Beware! For SA, n iterations = n simulations. For minibatches, n iterations = n j=1 m j simulations.

20 Stochastic Approximation vs minibatch proximal Stochastic Approximation Averaging θ n def = n k=1 a kθ k n k=1 a k = ( 1 ) a n n k=1 a θ n 1 + k a n n k=1 a k θ n. Stochastic approximation: take a n 1, γ n = Cn α with α (1/2, 1), then n ( θn θ ) D N(0, σ 2 ) Mini-batch SA: take a n m n, γ n γ 1/(2L) and m n sufficiently fast, then n ( θnn θ ) D N(0, σ 2 ) where N n is the number of iterations for n simulations: Nn k=1 m k n < N n+1 k=1 m k.

21 Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic Approximation θ n+1 = θ n + γ n+1 l(θ n ) + γ n+1 η n+1 η n+1 = H θn (X n+1 ) l(θ n ) = H θn (X n+1 ) π θn (H θn ). Idea Split the error into a martingale increment + remainder term Key tool Poisson equation Ĥ θ P θ Ĥ θ = H θ π θ (H θ ).

22 Stochastic Approximation vs minibatch proximal Stochastic Approximation Decomposition of the error η n+1 = H θn (X n+1 ) π θn (H θn ) = Ĥθ n (X n+1 ) P θn Ĥ θn (X n+1 ) = Ĥθ n (X n+1 ) P θn Ĥ θn (X n ) + P θn Ĥ θn (X n+1 ) P θn Ĥ θn (X n ) We further split the error P θn Ĥ θn (X n+1 ) P θn Ĥ θn (X n ) = P θn+1 Ĥ θn+1 (X n+1 ) P θn Ĥ θn (X n )+P θn Ĥ θn (X n+1 ) P θn+1 Ĥ θn+1 (X n+1 ). To prove that the remainder term goes to zero, it is required to prove the regularity of the Poisson solution with respect to θ, to prove that θ Ĥθ and θ P θ Ĥ θ is smooth in some sense... this is not always a trivial issue!

23 Stochastic Approximation vs minibatch proximal Stochastic Approximation Minibatch case Assume that the Markov kernel is nice... Bias E [η n+1 F n ] = m 1 n+1 Fluctuation m n+1 1 m 1 n+1 j=0 m n+1 1 j=0 H θn (X j ) π θn (H θn ) (ν θn P jθn H θn π θn H θn ) = O(m 1 n+1 ) m n+1 1 = m 1 n+1 Ĥ θn (X j ) P θn Ĥ θn (X j 1 ) + remainders j=1

24 Stochastic Approximation vs minibatch proximal Stochastic Approximation Minibatches case Contrary to SA, the noise η n+1 in the recursion θ n+1 = θ n + γ l(θ n ) + γη n+1 converges to zero a.s. and the stepsize is kept constant γ n = γ. Idea: perturbation of a discrete time dynamic system θ k+1 = θ k + γ l( θ k ) having a unique fixed point and a Lyapunov function l: l(θ k+1 ) l(θ k ) in presence of vanishingly small perturbation η n+1. a.s convergence of perturbed dynamical system with a Lyapunov function can be established under very weak assumptions...

25 Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic proximal gradient The stochastic proximal gradient sequence {θ n, n N} can be rewritten as θ n+1 = prox γn+1 (θ n + γ n+1 l(θ n ) + γ n+1 η n+1 ), where η n+1 def = H n+1 l(θ n ) is the approximation error. Questions: Convergence and rate of convergence in the SA and mini-batch settings? Stochastic Approximation / Minibatch: which one should I prefer? Tuning of the parameters (stepsize for SA, size of minibatches, averaging weights, etc...) Acceleration (à la Nesterov)?

26 Stochastic Approximation vs minibatch proximal Stochastic Approximation Main result Lemma Suppose that {γ n, n N} is decreasing and 0 < Lγ n 1 for all n 1. For any θ Θ, and any nonnegative sequence {a n, n N}, n {f( θ n ) f(θ )} j=1 a j n j=1 ( aj a ) j 1 θ j 1 θ 2 + a 0 θ 0 θ 2 γ j γ j 1 2γ 0 n n a j Tγj (θ j 1 ) θ, η j + a j γ j η j 2, j=1 where T γ (θ) = def = prox γ (θ + γ l(θ)) is the gradient proximal map, j=1

27 Stochastic Approximation vs minibatch proximal Stochastic Approximation Stochastic Approximation setting Take a j = γ j and decompose, using the Poisson equation, η j = ξ j + r j, where ξ j is a martingale term and r j is a remainder term. n γ j {f( θ n ) f(θ )} a 0 θ 0 θ 2 2γ 0 j=1 + n n γ j Tγj (θ j 1 ) θ, ξ j + γj 2 η j 2 + remainders, j=1 j=1 The red term is a martingale with a bracket bounded by n γj 2 θ j 1 θ 2 E [ ξ j 2 ] F j 1 j=1 If j=1 γ j = and j=1 γ2 j <, { θ n, n N} converges. rate of convergence ln(n)/ n by taking γ j = j 1/2.

28 Stochastic Approximation vs minibatch proximal Stochastic Approximation Minibatch setting Theorem Let { θ n, n 0} be the average estimator. Then for all n 1, n E [ f( θ n ) f(θ ) ] j=1 a j j=1 n j=1 ( aj a ) j 1 E [ θ j 1 θ 2] + a 0 E[ θ 0 θ 2 ] γ j γ j 1 2γ 0 n [ ] a j E θ j 1 θ ɛ (1) j 1 + n j=1 [ ] a j γ j E ɛ (2) j 1. where ɛ (1) n def = E [η n+1 F n ], ɛ (2) def n = E [ η ] n+1 2 Fn.

29 Stochastic Approximation vs minibatch proximal Stochastic Approximation Convergence analysis Corollary Suppose that γ n (0, 1/L], and there exist constants C 1, C 2, B < such that, for n 1, E[ɛ (1) n ] C 1 m 1 n+1, E[ɛ(2) n ] C 2 m 1 n+1 Then, setting γ n = γ, m n = n and a n 1,, and sup θ n θ B, P a.s. n N E [f(θ n ) f(θ )] C/n and E [f(θ Nn ) f(θ )] C/ n where N n is the number of iterations for n simulations. This is the same rate than for the SA (without the logarithmic term).

30 1 Motivation

31 Potts model We focus on the particular case where X = {1,..., M}, and B(x, y) = 1 {x=y}, which corresponds to the well known Potts model f θ (x 1,..., x p ) = 1 p exp θ ii B 0 (x i ) + Z θ i=1 1 j<i p θ ij 1 {xi=x j}. The term p i=1 θ iib 0 (x i ) is sometimes referred to as the external field and defines the distribution in the absence of interaction. We focus on the case where the interactions terms θ ij for i j are nonnegative. This corresponds to networks with there is either no interaction, or collaborative interactions between the nodes.

32 Algorithm At the k-th iteration, and given F k = σ(θ 1,..., θ k ): 1 generate the X p -valued Markov sequence {X k+1,j } m k+1 j=0 with transition P θk and initial distribution ν θk, and compute the approximate gradient H k+1 = 1 n n i=1 B(x (i) ) 1 m k+1 m k+1 j=1 B(X k+1,j ), 2 Compute θ k+1 = Π Ka ( sγk+1,λ (θ k + γ k+1 H k+1 ) ), the operation s γ,λ (M) soft-thresholds each entry of the matrix M, and the operation Π Ka (M) projects each entry of M on [0, a].

33 MCMC scheme For j i, we set b ij = e θij. Notice that b ij 1. For x = (x 1,..., x p ), ( p ) f θ (x) = 1 ( ) exp θ ii B 0 (x i ) bij 1 {xi=x Z j} + 1 {xi x j}. θ i=1 1 j<i p Augment the likelihood with auxiliary variables {δ ij, 1 j < i p}, δ ij {0, 1} such that the joint distribution of (x, δ) is given by ( p ) f θ (x, δ) exp θ ii B 0 (x i ) j<i ( i=1 1 {xi=x j}b ij ( 1 b 1 ij ) δij ) δ b ij 1 ij + 1 {xi x j}0 δij 1 1 δij. The marginal distribution of x in this joint distribution is the same f θ given above.

34 MCMC scheme The auxiliary variables {δ ij, 1 j < i p} are conditionally independent given x = (x 1,..., x p ); if x i x j, then δ ij = 0 with probability 1. If x i = x j, then δ ij = 1 with probability 1 b 1 ij, and δ ij = 0 with probability b 1 ij. The auxiliary variables {δ ij, 1 j < i p} defines an undirected graph with nodes {1,..., p} where there is an edge between i j if δ ij = 1, and there is no edge otherwise. This graph partitions the nodes {1,..., p} into maximal clusters C 1,..., C K (a set of nodes where there is a path joining any two of them). Notice that δ ij = 1 implies x i = x j. Hence all the nodes in a given cluster holds the same value of x. ( ) K f θ (x δ) exp θ ii B 0 (x i ). i C k k=1 j<i,(i,j) C k 1 {xi=x j}

35 Wolff algorithm Given X = (X 1,..., X p ) 1 Randomly select a node i {1,..., p}, and set C 0 = {i}. 2 Do until C 0 can no longer grow. For each new addition j to C 0, and for each j / C 0 such that θ jj > 0, starting with δ jj = 0, do the following. If X j = X j, set δ jj = 1 with probability 1 e θ jj. If δ jj = 1, add j to C 0. 3 If X i = v, randomly select v {1,..., M} \ {v}, and propose a new vector X X p, where X j = v for j C 0 and X j = X j for j / C 0. Accept X with probability 1 exp (B 0 (v ) B 0 (v)) θ jj. j C0

36 Relative error rate Stoch. Grad. Stoch. Grad. Averaged Acc. scheme True positive rate Stoch. Grad. Acc. scheme False discovery rate Stoch. Grad. Acc. scheme Iterations Iterations Iterations Figure : Simulation results for p = 50, n = 500 observations, 1% of off-diagonal terms, minibatch, m n = n

37 Relative error rate Stoch. Grad. Stoch. Grad. Averaged True positive rate False discovery rate Iterations Iterations Iterations Figure : Simulation results for p = 100, n = 500 observations, 1% of off-diagonal terms, n = 500 observations, 1% of off-diagonal terms, minibatch, m n = n

38 Relative error rate Stoch. Grad. Stoch. Grad. Averaged True positive rate False discovery rate Iterations Iterations Iterations Figure : Simulation results for p = 200, n = 500 observations, 1% of off-diagonal terms, 1% of off-diagonal terms, minibatch, m n = n

39 1 Motivation

40 Take-home message Efficient and globally converging procedure for penalized likelihood inference in incomplete data models are available if the complete data likelihood is globally concave with convex sparsity-inducing penalty (provided that computing the proximal operator is easy) Stochastic Approximation and Minibatch algorithms achieve the same rate, which is 1/ n where n is the number of simulations. Minibatch algorithms are in general preferable if the computation of the proximal operator is complex. Thanks for your attention... and patience!

Perturbed Proximal Gradient Algorithm

Perturbed Proximal Gradient Algorithm Perturbed Proximal Gradient Algorithm Gersende FORT LTCI, CNRS, Telecom ParisTech Université Paris-Saclay, 75013, Paris, France Large-scale inverse problems and optimization Applications to image processing

More information

On Perturbed Proximal Gradient Algorithms

On Perturbed Proximal Gradient Algorithms On Perturbed Proximal Gradient Algorithms Yves F. Atchadé University of Michigan, 1085 South University, Ann Arbor, 48109, MI, United States, yvesa@umich.edu Gersende Fort LTCI, CNRS, Telecom ParisTech,

More information

dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés

dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Inférence pénalisée dans les modèles à vraisemblance non explicite par des algorithmes gradient-proximaux perturbés Gersende Fort Institut de Mathématiques de Toulouse, CNRS and Univ. Paul Sabatier Toulouse,

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Probabilistic Graphical Models

Probabilistic Graphical Models 2016 Robert Nowak Probabilistic Graphical Models 1 Introduction We have focused mainly on linear models for signals, in particular the subspace model x = Uθ, where U is a n k matrix and θ R k is a vector

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Gradient Sliding for Composite Optimization

Gradient Sliding for Composite Optimization Noname manuscript No. (will be inserted by the editor) Gradient Sliding for Composite Optimization Guanghui Lan the date of receipt and acceptance should be inserted later Abstract We consider in this

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Learning Gaussian Graphical Models with Unknown Group Sparsity

Learning Gaussian Graphical Models with Unknown Group Sparsity Learning Gaussian Graphical Models with Unknown Group Sparsity Kevin Murphy Ben Marlin Depts. of Statistics & Computer Science Univ. British Columbia Canada Connections Graphical models Density estimation

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Consistency of the maximum likelihood estimator for general hidden Markov models

Consistency of the maximum likelihood estimator for general hidden Markov models Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Markov Networks.

Markov Networks. Markov Networks www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts Markov network syntax Markov network semantics Potential functions Partition function

More information

Monte Carlo in Bayesian Statistics

Monte Carlo in Bayesian Statistics Monte Carlo in Bayesian Statistics Matthew Thomas SAMBa - University of Bath m.l.thomas@bath.ac.uk December 4, 2014 Matthew Thomas (SAMBa) Monte Carlo in Bayesian Statistics December 4, 2014 1 / 16 Overview

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.

Approximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina. Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and

More information

Some Results on the Ergodicity of Adaptive MCMC Algorithms

Some Results on the Ergodicity of Adaptive MCMC Algorithms Some Results on the Ergodicity of Adaptive MCMC Algorithms Omar Khalil Supervisor: Jeffrey Rosenthal September 2, 2011 1 Contents 1 Andrieu-Moulines 4 2 Roberts-Rosenthal 7 3 Atchadé and Fort 8 4 Relationship

More information

Dual Proximal Gradient Method

Dual Proximal Gradient Method Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method

More information

Information theoretic perspectives on learning algorithms

Information theoretic perspectives on learning algorithms Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez

More information

Inderjit Dhillon The University of Texas at Austin

Inderjit Dhillon The University of Texas at Austin Inderjit Dhillon The University of Texas at Austin ( Universidad Carlos III de Madrid; 15 th June, 2012) (Based on joint work with J. Brickell, S. Sra, J. Tropp) Introduction 2 / 29 Notion of distance

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Lecture 16 Deep Neural Generative Models

Lecture 16 Deep Neural Generative Models Lecture 16 Deep Neural Generative Models CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago May 22, 2017 Approach so far: We have considered simple models and then constructed

More information

Algorithms for Nonsmooth Optimization

Algorithms for Nonsmooth Optimization Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Undirected Graphical Models

Undirected Graphical Models Undirected Graphical Models 1 Conditional Independence Graphs Let G = (V, E) be an undirected graph with vertex set V and edge set E, and let A, B, and C be subsets of vertices. We say that C separates

More information

CSC 412 (Lecture 4): Undirected Graphical Models

CSC 412 (Lecture 4): Undirected Graphical Models CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

3 : Representation of Undirected GM

3 : Representation of Undirected GM 10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Nonconvex Sparse Logistic Regression with Weakly Convex Regularization

Nonconvex Sparse Logistic Regression with Weakly Convex Regularization Nonconvex Sparse Logistic Regression with Weakly Convex Regularization Xinyue Shen, Student Member, IEEE, and Yuantao Gu, Senior Member, IEEE Abstract In this work we propose to fit a sparse logistic regression

More information

On Markov chain Monte Carlo methods for tall data

On Markov chain Monte Carlo methods for tall data On Markov chain Monte Carlo methods for tall data Remi Bardenet, Arnaud Doucet, Chris Holmes Paper review by: David Carlson October 29, 2016 Introduction Many data sets in machine learning and computational

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning MCMC and Non-Parametric Bayes Mark Schmidt University of British Columbia Winter 2016 Admin I went through project proposals: Some of you got a message on Piazza. No news is

More information

Scaling Neighbourhood Methods

Scaling Neighbourhood Methods Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Lecture 1: Supervised Learning

Lecture 1: Supervised Learning Lecture 1: Supervised Learning Tuo Zhao Schools of ISYE and CSE, Georgia Tech ISYE6740/CSE6740/CS7641: Computational Data Analysis/Machine from Portland, Learning Oregon: pervised learning (Supervised)

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Bayesian model selection in graphs by using BDgraph package

Bayesian model selection in graphs by using BDgraph package Bayesian model selection in graphs by using BDgraph package A. Mohammadi and E. Wit March 26, 2013 MOTIVATION Flow cytometry data with 11 proteins from Sachs et al. (2005) RESULT FOR CELL SIGNALING DATA

More information

Graphical Model Selection

Graphical Model Selection May 6, 2013 Trevor Hastie, Stanford Statistics 1 Graphical Model Selection Trevor Hastie Stanford University joint work with Jerome Friedman, Rob Tibshirani, Rahul Mazumder and Jason Lee May 6, 2013 Trevor

More information

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials) Markov Networks l Like Bayes Nets l Graphical model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov

More information

Lecture 15 Newton Method and Self-Concordance. October 23, 2008

Lecture 15 Newton Method and Self-Concordance. October 23, 2008 Newton Method and Self-Concordance October 23, 2008 Outline Lecture 15 Self-concordance Notion Self-concordant Functions Operations Preserving Self-concordance Properties of Self-concordant Functions Implications

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Online Forest Density Estimation

Online Forest Density Estimation Online Forest Density Estimation Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr UAI 16 1 Outline 1 Probabilistic Graphical Models 2 Online Density Estimation 3 Online Forest Density

More information

Deep unsupervised learning

Deep unsupervised learning Deep unsupervised learning Advanced data-mining Yongdai Kim Department of Statistics, Seoul National University, South Korea Unsupervised learning In machine learning, there are 3 kinds of learning paradigm.

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient

More information

14 : Theory of Variational Inference: Inner and Outer Approximation

14 : Theory of Variational Inference: Inner and Outer Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2014 14 : Theory of Variational Inference: Inner and Outer Approximation Lecturer: Eric P. Xing Scribes: Yu-Hsin Kuo, Amos Ng 1 Introduction Last lecture

More information

An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application

An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application, CleverSet, Inc. STARMAP/DAMARS Conference Page 1 The research described in this presentation has been funded by the U.S.

More information

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic Graphical Networks: Definitions and Basic Results This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical

More information

Sparse Stochastic Inference for Latent Dirichlet Allocation

Sparse Stochastic Inference for Latent Dirichlet Allocation Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation

More information

Learning Parameters of Undirected Models. Sargur Srihari

Learning Parameters of Undirected Models. Sargur Srihari Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Difficulties due to Global Normalization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate

More information

Greedy Layer-Wise Training of Deep Networks

Greedy Layer-Wise Training of Deep Networks Greedy Layer-Wise Training of Deep Networks Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle NIPS 2007 Presented by Ahmed Hefny Story so far Deep neural nets are more expressive: Can learn

More information

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions International Journal of Control Vol. 00, No. 00, January 2007, 1 10 Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions I-JENG WANG and JAMES C.

More information

Afternoon Meeting on Bayesian Computation 2018 University of Reading

Afternoon Meeting on Bayesian Computation 2018 University of Reading Gabriele Abbati 1, Alessra Tosi 2, Seth Flaxman 3, Michael A Osborne 1 1 University of Oxford, 2 Mind Foundry Ltd, 3 Imperial College London Afternoon Meeting on Bayesian Computation 2018 University of

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Riemann Manifold Methods in Bayesian Statistics

Riemann Manifold Methods in Bayesian Statistics Ricardo Ehlers ehlers@icmc.usp.br Applied Maths and Stats University of São Paulo, Brazil Working Group in Statistical Learning University College Dublin September 2015 Bayesian inference is based on Bayes

More information

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning for for Advanced Topics in California Institute of Technology April 20th, 2017 1 / 50 Table of Contents for 1 2 3 4 2 / 50 History of methods for Enrico Fermi used to calculate incredibly accurate predictions

More information

Chapter 17: Undirected Graphical Models

Chapter 17: Undirected Graphical Models Chapter 17: Undirected Graphical Models The Elements of Statistical Learning Biaobin Jiang Department of Biological Sciences Purdue University bjiang@purdue.edu October 30, 2014 Biaobin Jiang (Purdue)

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

The Frank-Wolfe Algorithm:

The Frank-Wolfe Algorithm: The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology

More information

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials) Markov Networks l Like Bayes Nets l Graph model that describes joint probability distribution using tables (AKA potentials) l Nodes are random variables l Labels are outcomes over the variables Markov

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost

More information

Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 5 Sequential Monte Carlo methods I 31 March 2017 Computer Intensive Methods (1) Plan of today s lecture

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints

Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints By I. Necoara, Y. Nesterov, and F. Glineur Lijun Xu Optimization Group Meeting November 27, 2012 Outline

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Introduction to Restricted Boltzmann Machines

Introduction to Restricted Boltzmann Machines Introduction to Restricted Boltzmann Machines Ilija Bogunovic and Edo Collins EPFL {ilija.bogunovic,edo.collins}@epfl.ch October 13, 2014 Introduction Ingredients: 1. Probabilistic graphical models (undirected,

More information

Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of

More information

Stochastic optimization Markov Chain Monte Carlo

Stochastic optimization Markov Chain Monte Carlo Stochastic optimization Markov Chain Monte Carlo Ethan Fetaya Weizmann Institute of Science 1 Motivation Markov chains Stationary distribution Mixing time 2 Algorithms Metropolis-Hastings Simulated Annealing

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Estimators based on non-convex programs: Statistical and computational guarantees

Estimators based on non-convex programs: Statistical and computational guarantees Estimators based on non-convex programs: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Po-Ling Loh (UC Berkeley) Martin Wainwright

More information