A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Size: px
Start display at page:

Download "A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices"

Transcription

1 A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology ICML 2010 Ryota Tomioka Univ Tokyo M-DAL / 25

2 Introduction Learning low-rank matrices 1 Matrix completion [Srebro et al. 05; Abernethy et al. 09] collaborative filtering, link prediction Y = W Part of Y is observed W = W user W movie 2 Multi-task learning [Argyriou et al., 07] y = W x + b W = w task 1 w task 2. w task R 3 Predicting over matrices classification/regression [Tomioka & Aihara, 07] time y = W, X + b X space W W = W task W feature = W space W time Ryota Tomioka Univ Tokyo M-DAL / 25

3 Introduction Problem formulation Primal problem minimize f l AW + φ λ W W R R C }{{} =:f W f l R m R: loss function differentiable. A R R C R m : design matrix. φ λ : regularizer non-differentiable; for example, the trace norm: φ λ W = λ W = λ r j=1 σ jw linear sum of singular values. Separation of f l and A convergence regardless of A input data. Ryota Tomioka Univ Tokyo M-DAL / 25

4 Introduction Existing approaches and our goal Proximal accelerated gradient [Ji & Ye, 09] Can keep the intermediate solution low-rank. Slow for poorly conditioned design matrix A. Optimal as a first order black-box method O1/k 2. Interior point algorithm [Tomioka & Aihara, 07] Can obtain highly precise solution in small number of iterations. Only low-rank in the limit each step can be heavy. Can we keep the intermediate solution low-rank and still converge rapidly as an IP method? Dual Augmented Lagrangian DAL [Tomioka & Sugiyama, 09] for sparse estimation. M-DAL: generalized to low-rank matrix estimation. Ryota Tomioka Univ Tokyo M-DAL / 25

5 Introduction Existing approaches and our goal Proximal accelerated gradient [Ji & Ye, 09] Can keep the intermediate solution low-rank. Slow for poorly conditioned design matrix A. Optimal as a first order black-box method O1/k 2. Interior point algorithm [Tomioka & Aihara, 07] Can obtain highly precise solution in small number of iterations. Only low-rank in the limit each step can be heavy. Can we keep the intermediate solution low-rank and still converge rapidly as an IP method? Dual Augmented Lagrangian DAL [Tomioka & Sugiyama, 09] for sparse estimation. M-DAL: generalized to low-rank matrix estimation. Ryota Tomioka Univ Tokyo M-DAL / 25

6 Introduction Existing approaches and our goal Proximal accelerated gradient [Ji & Ye, 09] Can keep the intermediate solution low-rank. Slow for poorly conditioned design matrix A. Optimal as a first order black-box method O1/k 2. Interior point algorithm [Tomioka & Aihara, 07] Can obtain highly precise solution in small number of iterations. Only low-rank in the limit each step can be heavy. Can we keep the intermediate solution low-rank and still converge rapidly as an IP method? Dual Augmented Lagrangian DAL [Tomioka & Sugiyama, 09] for sparse estimation. M-DAL: generalized to low-rank matrix estimation. Ryota Tomioka Univ Tokyo M-DAL / 25

7 Introduction Outline 1 Introduction Why learn low-rank matrices? Existing algorithms 2 Proposed algorithm Augmented Lagrangian Proximal minimization Super-linear convergence Generalizations 3 Experiments Synthetic 10,000 10,000 matrix completion. BCI classification problem learning multiple matrices. 4 Summary Ryota Tomioka Univ Tokyo M-DAL / 25

8 Algorithm M-DAL algorithm 1 Initialize W 0. Choose η 1 η 2 2 Iterate until relative duality gap < ϵ α t := argmin ϕ t α inner objective α R m W t+1 = ST ληt W t + η t A α t Choosing α = fl t yields the proximal gradient forward-backward method. ST λ W := argmin φ λ X + 1 X R 2 X W 2 fro. R C For the trace-norm φ λ W = λ W, ST λ W = U maxs λi, 0V, λ where W = USV singular-value decomposition. Ryota Tomioka Univ Tokyo M-DAL / 25

9 Algorithm Proximal minimization [Rockafellar 76a, 76b] 1 Initialize W 0. 2 Iterate: W t+1 = argmin f W + W = argmin W Fenchel dual see Rockafellar 1970: min W f l AW + 1 η t Φ ληt W ; W t proximal term {}}{ 1 2η t W W t 2 f l AW + φ λ W + 1 2η t W W t 2 }{{} =: 1 η t Φ ληt W ; W t =max α =: min ϕ t α α f l α 1 η t Φ λη t η t A α; W t Ryota Tomioka Univ Tokyo M-DAL / 25

10 Algorithm Proximal minimization [Rockafellar 76a, 76b] 1 Initialize W 0. 2 Iterate: W t+1 = argmin f W + W = argmin W Fenchel dual see Rockafellar 1970: min W f l AW + 1 η t Φ ληt W ; W t proximal term {}}{ 1 2η t W W t 2 f l AW + φ λ W + 1 2η t W W t 2 }{{} =: 1 η t Φ ληt W ; W t =max α =: min ϕ t α α f l α 1 η t Φ λη t η t A α; W t Ryota Tomioka Univ Tokyo M-DAL / 25

11 Algorithm Proximal minimization [Rockafellar 76a, 76b] 1 Initialize W 0. 2 Iterate: W t+1 = argmin f W + W = argmin W Fenchel dual see Rockafellar 1970: min W f l AW + 1 η t Φ ληt W ; W t proximal term {}}{ 1 2η t W W t 2 f l AW + φ λ W + 1 2η t W W t 2 }{{} =: 1 η t Φ ληt W ; W t =max α =: min ϕ t α α f l α 1 η t Φ λη t η t A α; W t Ryota Tomioka Univ Tokyo M-DAL / 25

12 Definition Algorithm W : the unique minimizer of the objective f W. W t : sequence generated by the M-DAL algorithm with ϕ t α t γ W t+1 W t 1/γ: Lipschitz constant of f l ηt fro. Assumption There is a constant σ > 0 such that f W t+1 f W σ W t+1 W 2 fro t = 0, 1, 2,.... Theorem: Super-linear convergence W t+1 W fro σηt W t W fro. I.e., W t converges super-linearly to W if η t is increasing. Ryota Tomioka Univ Tokyo M-DAL / 25

13 Why is M-DAL efficient? Algorithm 1 Proximation wrt φ λ is analytic though non-smooth: W t+1 = ST ηt λ w t + η t A α t 2 Inner minimization is smooth: α t = argmin α R m f l α }{{} Differentiable + 1 2η t ST ηt λw t + η t A α 2 fro }{{} = Φ λ linear to the estimated rank φ λ w Φλ w λ 0 λ Ryota Tomioka Univ Tokyo M-DAL / 25

14 Generalizations Algorithm 1 Learning multiple matrices K φ λ W = λ W k = λ k=1 W W K No need to form the big matrix. Can be used to select informative data sources and learn feature extractors simultaneously. 2 General spectral regularization r φ λ W = g λ σ j W j=1 for any convex function g λ for which the proximal operator: ST g λ σ j = argmin g λ x + 12 x σ j 2 x R can be computed in closed form. Ryota Tomioka Univ Tokyo M-DAL / 25

15 Experiments Synthetic experiment 1: low-rank matrix completion Synthetic experiment 1: low-rank matrix completion Large scale & structured. True matrix W : 10, , M elements, low rank. Observation: randomly chosen m elements sparse. Quasi-Newton method for the minimization of ϕ t α No need to form the full matrix! m=1,200,000 m=2,400,000 Ryota Tomioka Univ Tokyo M-DAL / 25

16 Experiments Synthetic experiment 1: low-rank matrix completion Synthetic experiment 1: low-rank matrix completion Large scale & structured. True matrix W : 10, , M elements, low rank. Observation: randomly chosen m elements sparse. Quasi-Newton method for the minimization of ϕ t α No need to form the full matrix! 300 m=1,200,000 Rank= m=2,400,000 Rank=20 20 Time s Rank 6 Time s Rank Regularization constant λ Regularization constant λ Ryota Tomioka Univ Tokyo M-DAL / 25

17 Rank=10 Experiments Synthetic experiment 1: low-rank matrix completion Rank=10, #observations m=1,200,000 λ time s #outer #inner rank S-RMSE ±2.0 5 ±0 8 ±0 2.8 ± ± ± ±0 18 ±0 5 ± ± ± ±0 28 ±0 6.4 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±11 41 ±0 70 ± ± ± #inner iterations is roughly 2 times #outer iterations. Because we don t need to solve the inner problem very precisely! Ryota Tomioka Univ Tokyo M-DAL / 25

18 Rank=20 Experiments Synthetic experiment 1: low-rank matrix completion Rank=20, #observations m=2,400,000 λ time s #outer #inner rank S-RMSE ±19 6 ± ± ± ± ±22 11 ± ± ± ± ±25 15 ± ± ± ± ±29 19 ± ± ± ± ±36 24 ± ± ± ± ±44 29 ± ± ± ± ±48 34 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Ryota Tomioka Univ Tokyo M-DAL / 25

19 Experiments Synthetic experiment 2: classifying matrices Synthetic experiment 2: classifying matrices True matrix W : 64 64, rank=16. Classification problem # samples m = 1000: f l z = m log1 + exp y i z i i=1 Design matrix A: dense each example drawn from Wishart distribution. λ = 800 chosen to roughly reproduce rank=16. Medium scale & dense Newton method for minimizing ϕ t α works better when the condition is poor. Methods: M-DAL proposed IP interior point method [T& A 2007] PG projected gradient method [T& S 2008] AG accelerated gradient method [Ji & Ye 2009] Ryota Tomioka Univ Tokyo M-DAL / 25

20 Comparison Experiments Synthetic experiment 2: classifying matrices Relative duality gap AG IP PG M DAL CPU time s M-DAL proposed IP interior point method [T& A 2007] PG projected gradient method [T& S 2008] AG accelerated gradient method [Ji & Ye 2009] Ryota Tomioka Univ Tokyo M-DAL / 25

21 Experiments Super-linear convergence Synthetic experiment 2: classifying matrices Relative duality gap Number of iterations Time s Ryota Tomioka Univ Tokyo M-DAL / 25

22 Experiments Benchmark experiment: BCI dataset Benchmark experiment: BCI dataset BCI competition 2003 dataset IV. Task: predict the upcoming finger movement is right or left. Original data: multichannel EEG time-series 28 channels 50 time-points 500ms long Each training example is preprocessed into three matrices and whitened [T&M 2010]: First order <20Hz component: matrix. Second order alpha 7-15Hz component: covariance Second order beta 15-30Hz component: covariance 316 training examples. 100 test samples. Ryota Tomioka Univ Tokyo M-DAL / 25

23 Experiments BCI dataset regularization path Benchmark experiment: BCI dataset -o- M-DAL proposed - - PG T&S 08 -x- AG Ji&Ye Number of iterations Time s Accuracy Regularization constant Regularization constant Regularization constant Stopping criterion: RDG Note: these are costs for obtaining the entire regularization path. Ryota Tomioka Univ Tokyo M-DAL / 25

24 Experiments Benchmark experiment: BCI dataset Summary M-DAL: Dual Augmented Lagrangian algorithm for learning low-rank matrices. Theoretically shown that M-DAL converges superlinearly requires small number of steps. Separable regularizer + differentiable loss function each step can be efficiently solved. Empirically we can solve matrix completion problem of size 10,000 10,000 in roughly 5min. Training a low-rank classifier that automatically combines multiple data sources matrices is sped up by a factor of Future work: Other AL algorithms. Large scale and unstructured or badly conditioned problems. Ryota Tomioka Univ Tokyo M-DAL / 25

25 Background Appendix Convex conjugate of f Convex conjugate of a sum: Fenchel dual: f y = sup x R n y, x f x f + g y = f g α = inf α f α + g y α inf f x + gx = sup f α g α x Rn α R n Ryota Tomioka Univ Tokyo M-DAL / 25

26 Inf-convolution Appendix Inf-convolution envelope function: Φ λ V ; W t = φ λ W t V = inf φ λ Y + 12 W t + V Y 2 Y R R C = Φ λ W t + V ; 0 = Φ λ W t + V Nondifferentiable Differentiable Φ λ x φ λ *v φ λ x Φ λ *v Ryota Tomioka Univ Tokyo M-DAL / 25

27 Appendix M-DAL algorithm for the trace-norm W t+1 = ST ληt W t + η t A α t where α t = argmin α 1 η t Φ λη t W t +η t A α {}}{ fl α + 1 ST ληt W t + η t A α 2 2η t } {{ } ϕ t α ϕ t α is differentiable: ϕ t α = f l α + AST λη t W t + η t A α 2 ϕ t α : a bit more involved but can be computed Wright 92 Ryota Tomioka Univ Tokyo M-DAL / 25

28 Appendix Minimizing the inner objective ϕ t α ϕ t α, ϕ t α can be computed using only the singular values σ j W t α λη t and the corresponding SVs. no need to compute full SVD of W t α := W t + η t A α. But, computation of 2 ϕ t α requires full SVD. Consequence: When A α is structured, computing the above SVD is cheap. E.g., matrix completion: A α is sparse with only m non-zeros. W t α = W t L W t R +η t Quasi-Newton method maintains factorized W t and scalable When A α is not structured, Full Newton method converges faster and more stable Ryota Tomioka Univ Tokyo M-DAL / 25

29 Appendix Primal Proximal Minimization A M-DAL W t+1 = argmin W f l AW + Primal Proximal Minimization B W t+1 = argmin W φ λ W + φ λ W + 1 W W t 2 fro 2η t f l AW + 1 2η t W W t 2 fro Dual Proximal Minimization A split Bregman iteration α t+1 = argmin fl α + φ λ A α + 1 α α t 2 α 2η t Dual Proximal Minimization B α t+1 = argmin α φ λ A α + fl α + 1 α α t 2 2η t Ryota Tomioka Univ Tokyo M-DAL / 25

30 Appendix References Abernethy et al. A new approach to collaborative filtering: Operator estimation with spectral regularization. JMLR 10, Argyriou et al. Multi-task feature learning In NIPS 19, Ji & Ye. An accelerated gradient method for trace norm minimization. In ICML Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. on Control and Optimization 14, 1976a. Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. of Oper. Res. 1, 1976b. Srebro et al. Maximum-margin matrix factorization. In NIPS 17, Tomioka & Aihara. Classifying matrices with a spectral regularization. In ICML Tomioka et al. Super-Linear Convergence of Dual Augmented-Lagrangian Algorithm for Sparsity Regularized Estimation. Arxiv: , Tomioka & Sugiyama. Sparse learning with duality gap guarantee. In NIPS workshop OPT 2008 Optimization for Machine Learning, Tomioka & Müller. A regularized discriminative framework for EEG analysis with application to brain-computer interface. Neuroimage 49, Wright. Differential equations for the analytic singular value decomposition of a matrix. Numer. Math. 63, Ryota Tomioka Univ Tokyo M-DAL / 25

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning Editors: Suvrit Sra suvrit@gmail.com Max Planck Insitute for Biological Cybernetics 72076 Tübingen, Germany Sebastian Nowozin Microsoft Research Cambridge, CB3 0FB, United

More information

Dual Augmented Lagrangian, Proximal Minimization, and MKL

Dual Augmented Lagrangian, Proximal Minimization, and MKL Dual Augmented Lagrangian, Proximal Minimization, and MKL Ryota Tomioka 1, Taiji Suzuki 1, and Masashi Sugiyama 2 1 University of Tokyo 2 Tokyo Institute of Technology 2009-09-15 @ TU Berlin (UT / Tokyo

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

A Spectral Regularization Framework for Multi-Task Structure Learning

A Spectral Regularization Framework for Multi-Task Structure Learning A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,

More information

Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation Journal of Machine Learning Research 12 2011) 1537-1586 Submitted 5/10; Revised 12/10; Published 5/11 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation

More information

Spectral k-support Norm Regularization

Spectral k-support Norm Regularization Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views

A Randomized Approach for Crowdsourcing in the Presence of Multiple Views A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion

More information

Dual Proximal Gradient Method

Dual Proximal Gradient Method Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method

More information

Maximum Margin Matrix Factorization

Maximum Margin Matrix Factorization Maximum Margin Matrix Factorization Nati Srebro Toyota Technological Institute Chicago Joint work with Noga Alon Tel-Aviv Yonatan Amit Hebrew U Alex d Aspremont Princeton Michael Fink Hebrew U Tommi Jaakkola

More information

Accelerated Training of Max-Margin Markov Networks with Kernels

Accelerated Training of Max-Margin Markov Networks with Kernels Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Using SVD to Recommend Movies

Using SVD to Recommend Movies Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Projection methods to solve SDP

Projection methods to solve SDP Projection methods to solve SDP Franz Rendl http://www.math.uni-klu.ac.at Alpen-Adria-Universität Klagenfurt Austria F. Rendl, Oberwolfach Seminar, May 2010 p.1/32 Overview Augmented Primal-Dual Method

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Learning Task Grouping and Overlap in Multi-Task Learning

Learning Task Grouping and Overlap in Multi-Task Learning Learning Task Grouping and Overlap in Multi-Task Learning Abhishek Kumar Hal Daumé III Department of Computer Science University of Mayland, College Park 20 May 2013 Proceedings of the 29 th International

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Matrix Factorizations: A Tale of Two Norms

Matrix Factorizations: A Tale of Two Norms Matrix Factorizations: A Tale of Two Norms Nati Srebro Toyota Technological Institute Chicago Maximum Margin Matrix Factorization S, Jason Rennie, Tommi Jaakkola (MIT), NIPS 2004 Rank, Trace-Norm and Max-Norm

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

A Simple Algorithm for Nuclear Norm Regularized Problems

A Simple Algorithm for Nuclear Norm Regularized Problems A Simple Algorithm for Nuclear Norm Regularized Problems ICML 00 Martin Jaggi, Marek Sulovský ETH Zurich Matrix Factorizations for recommender systems Y = Customer Movie UV T = u () The Netflix challenge:

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

LECTURE 13 LECTURE OUTLINE

LECTURE 13 LECTURE OUTLINE LECTURE 13 LECTURE OUTLINE Problem Structures Separable problems Integer/discrete problems Branch-and-bound Large sum problems Problems with many constraints Conic Programming Second Order Cone Programming

More information

https://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:

More information

Dual and primal-dual methods

Dual and primal-dual methods ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

An Algorithm for Transfer Learning in a Heterogeneous Environment

An Algorithm for Transfer Learning in a Heterogeneous Environment An Algorithm for Transfer Learning in a Heterogeneous Environment Andreas Argyriou 1, Andreas Maurer 2, and Massimiliano Pontil 1 1 Department of Computer Science University College London Malet Place,

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.

More information

Support Vector Machine via Nonlinear Rescaling Method

Support Vector Machine via Nonlinear Rescaling Method Manuscript Click here to download Manuscript: svm-nrm_3.tex Support Vector Machine via Nonlinear Rescaling Method Roman Polyak Department of SEOR and Department of Mathematical Sciences George Mason University

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method

1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)

Regression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Primal-dual Subgradient Method for Convex Problems with Functional Constraints

Primal-dual Subgradient Method for Convex Problems with Functional Constraints Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

10 Numerical methods for constrained problems

10 Numerical methods for constrained problems 10 Numerical methods for constrained problems min s.t. f(x) h(x) = 0 (l), g(x) 0 (m), x X The algorithms can be roughly divided the following way: ˆ primal methods: find descent direction keeping inside

More information

arxiv: v2 [math.oc] 18 Nov 2012

arxiv: v2 [math.oc] 18 Nov 2012 arxiv:206.2372v2 [math.oc] 8 Nov 202 PRISMA: PRoximal Iterative SMoothing Algorithm Francesco Orabona Toyota Technological Institute at Chicago francesco@orabona.com Andreas Argyriou Katholieke Universiteit

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Large Scale Semi-supervised Linear SVMs. University of Chicago

Large Scale Semi-supervised Linear SVMs. University of Chicago Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.

More information

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles

Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Probabilistic Time Series Classification

Probabilistic Time Series Classification Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign

More information

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE

LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Optimization for Communications and Networks. Poompat Saengudomlert. Session 4 Duality and Lagrange Multipliers

Optimization for Communications and Networks. Poompat Saengudomlert. Session 4 Duality and Lagrange Multipliers Optimization for Communications and Networks Poompat Saengudomlert Session 4 Duality and Lagrange Multipliers P Saengudomlert (2015) Optimization Session 4 1 / 14 24 Dual Problems Consider a primal convex

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Jialin Dong ShanghaiTech University 1 Outline Introduction FourVignettes: System Model and Problem Formulation Problem Analysis

More information

Rank, Trace-Norm & Max-Norm

Rank, Trace-Norm & Max-Norm Rank, Trace-Norm & Max-Norm as measures of matrix complexity Nati Srebro University of Toronto Adi Shraibman Hebrew University Matrix Learning users movies 2 1 4 5 5 4? 1 3 3 5 2 4? 5 3? 4 1 3 5 2 1? 4

More information

Mirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik

Mirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik Mirror Descent for Metric Learning Gautam Kunapuli Jude W. Shavlik And what do we have here? We have a metric learning algorithm that uses composite mirror descent (COMID): Unifying framework for metric

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Statistical Performance of Convex Tensor Decomposition

Statistical Performance of Convex Tensor Decomposition Slides available: h-p://www.ibis.t.u tokyo.ac.jp/ryotat/tensor12kyoto.pdf Statistical Performance of Convex Tensor Decomposition Ryota Tomioka 2012/01/26 @ Kyoto University Perspectives in Informatics

More information

ORIE 6326: Convex Optimization. Quasi-Newton Methods

ORIE 6326: Convex Optimization. Quasi-Newton Methods ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted

More information

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Some tensor decomposition methods for machine learning

Some tensor decomposition methods for machine learning Some tensor decomposition methods for machine learning Massimiliano Pontil Istituto Italiano di Tecnologia and University College London 16 August 2016 1 / 36 Outline Problem and motivation Tucker decomposition

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Theoretical and Experimental Analyses of Tensor-Based Regression and Classification

Theoretical and Experimental Analyses of Tensor-Based Regression and Classification Neural Computation, to appear. 1 Theoretical and Experimental Analyses of Tensor-Based Regression and Classification Kishan Wimalawarne Toyo Institute of Technology, Japan ishanwn@gmail.com Ryota Tomioa

More information

Maximum Margin Matrix Factorization for Collaborative Ranking

Maximum Margin Matrix Factorization for Collaborative Ranking Maximum Margin Matrix Factorization for Collaborative Ranking Joint work with Quoc Le, Alexandros Karatzoglou and Markus Weimer Alexander J. Smola sml.nicta.com.au Statistical Machine Learning Program

More information

First-order methods for structured nonsmooth optimization

First-order methods for structured nonsmooth optimization First-order methods for structured nonsmooth optimization Sangwoon Yun Department of Mathematics Education Sungkyunkwan University Oct 19, 2016 Center for Mathematical Analysis & Computation, Yonsei University

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Introduction Consider a zero mean random vector R n with autocorrelation matri R = E( T ). R has eigenvectors q(1),,q(n) and associated eigenvalues λ(1) λ(n). Let Q = [ q(1)

More information

Implicit Optimization Bias

Implicit Optimization Bias Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,

More information

Fantope Regularization in Metric Learning

Fantope Regularization in Metric Learning Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction

More information

Feature Learning for Multitask Learning Using l 2,1 Norm Minimization

Feature Learning for Multitask Learning Using l 2,1 Norm Minimization Feature Learning for Multitask Learning Using l, Norm Minimization Priya Venkateshan Abstract We look at solving the task of Multitask Feature Learning by way of feature selection. We find that formulating

More information

Sparse Covariance Selection using Semidefinite Programming

Sparse Covariance Selection using Semidefinite Programming Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support

More information

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1, Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information