A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
|
|
- Maurice Jesse Haynes
- 5 years ago
- Views:
Transcription
1 A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology ICML 2010 Ryota Tomioka Univ Tokyo M-DAL / 25
2 Introduction Learning low-rank matrices 1 Matrix completion [Srebro et al. 05; Abernethy et al. 09] collaborative filtering, link prediction Y = W Part of Y is observed W = W user W movie 2 Multi-task learning [Argyriou et al., 07] y = W x + b W = w task 1 w task 2. w task R 3 Predicting over matrices classification/regression [Tomioka & Aihara, 07] time y = W, X + b X space W W = W task W feature = W space W time Ryota Tomioka Univ Tokyo M-DAL / 25
3 Introduction Problem formulation Primal problem minimize f l AW + φ λ W W R R C }{{} =:f W f l R m R: loss function differentiable. A R R C R m : design matrix. φ λ : regularizer non-differentiable; for example, the trace norm: φ λ W = λ W = λ r j=1 σ jw linear sum of singular values. Separation of f l and A convergence regardless of A input data. Ryota Tomioka Univ Tokyo M-DAL / 25
4 Introduction Existing approaches and our goal Proximal accelerated gradient [Ji & Ye, 09] Can keep the intermediate solution low-rank. Slow for poorly conditioned design matrix A. Optimal as a first order black-box method O1/k 2. Interior point algorithm [Tomioka & Aihara, 07] Can obtain highly precise solution in small number of iterations. Only low-rank in the limit each step can be heavy. Can we keep the intermediate solution low-rank and still converge rapidly as an IP method? Dual Augmented Lagrangian DAL [Tomioka & Sugiyama, 09] for sparse estimation. M-DAL: generalized to low-rank matrix estimation. Ryota Tomioka Univ Tokyo M-DAL / 25
5 Introduction Existing approaches and our goal Proximal accelerated gradient [Ji & Ye, 09] Can keep the intermediate solution low-rank. Slow for poorly conditioned design matrix A. Optimal as a first order black-box method O1/k 2. Interior point algorithm [Tomioka & Aihara, 07] Can obtain highly precise solution in small number of iterations. Only low-rank in the limit each step can be heavy. Can we keep the intermediate solution low-rank and still converge rapidly as an IP method? Dual Augmented Lagrangian DAL [Tomioka & Sugiyama, 09] for sparse estimation. M-DAL: generalized to low-rank matrix estimation. Ryota Tomioka Univ Tokyo M-DAL / 25
6 Introduction Existing approaches and our goal Proximal accelerated gradient [Ji & Ye, 09] Can keep the intermediate solution low-rank. Slow for poorly conditioned design matrix A. Optimal as a first order black-box method O1/k 2. Interior point algorithm [Tomioka & Aihara, 07] Can obtain highly precise solution in small number of iterations. Only low-rank in the limit each step can be heavy. Can we keep the intermediate solution low-rank and still converge rapidly as an IP method? Dual Augmented Lagrangian DAL [Tomioka & Sugiyama, 09] for sparse estimation. M-DAL: generalized to low-rank matrix estimation. Ryota Tomioka Univ Tokyo M-DAL / 25
7 Introduction Outline 1 Introduction Why learn low-rank matrices? Existing algorithms 2 Proposed algorithm Augmented Lagrangian Proximal minimization Super-linear convergence Generalizations 3 Experiments Synthetic 10,000 10,000 matrix completion. BCI classification problem learning multiple matrices. 4 Summary Ryota Tomioka Univ Tokyo M-DAL / 25
8 Algorithm M-DAL algorithm 1 Initialize W 0. Choose η 1 η 2 2 Iterate until relative duality gap < ϵ α t := argmin ϕ t α inner objective α R m W t+1 = ST ληt W t + η t A α t Choosing α = fl t yields the proximal gradient forward-backward method. ST λ W := argmin φ λ X + 1 X R 2 X W 2 fro. R C For the trace-norm φ λ W = λ W, ST λ W = U maxs λi, 0V, λ where W = USV singular-value decomposition. Ryota Tomioka Univ Tokyo M-DAL / 25
9 Algorithm Proximal minimization [Rockafellar 76a, 76b] 1 Initialize W 0. 2 Iterate: W t+1 = argmin f W + W = argmin W Fenchel dual see Rockafellar 1970: min W f l AW + 1 η t Φ ληt W ; W t proximal term {}}{ 1 2η t W W t 2 f l AW + φ λ W + 1 2η t W W t 2 }{{} =: 1 η t Φ ληt W ; W t =max α =: min ϕ t α α f l α 1 η t Φ λη t η t A α; W t Ryota Tomioka Univ Tokyo M-DAL / 25
10 Algorithm Proximal minimization [Rockafellar 76a, 76b] 1 Initialize W 0. 2 Iterate: W t+1 = argmin f W + W = argmin W Fenchel dual see Rockafellar 1970: min W f l AW + 1 η t Φ ληt W ; W t proximal term {}}{ 1 2η t W W t 2 f l AW + φ λ W + 1 2η t W W t 2 }{{} =: 1 η t Φ ληt W ; W t =max α =: min ϕ t α α f l α 1 η t Φ λη t η t A α; W t Ryota Tomioka Univ Tokyo M-DAL / 25
11 Algorithm Proximal minimization [Rockafellar 76a, 76b] 1 Initialize W 0. 2 Iterate: W t+1 = argmin f W + W = argmin W Fenchel dual see Rockafellar 1970: min W f l AW + 1 η t Φ ληt W ; W t proximal term {}}{ 1 2η t W W t 2 f l AW + φ λ W + 1 2η t W W t 2 }{{} =: 1 η t Φ ληt W ; W t =max α =: min ϕ t α α f l α 1 η t Φ λη t η t A α; W t Ryota Tomioka Univ Tokyo M-DAL / 25
12 Definition Algorithm W : the unique minimizer of the objective f W. W t : sequence generated by the M-DAL algorithm with ϕ t α t γ W t+1 W t 1/γ: Lipschitz constant of f l ηt fro. Assumption There is a constant σ > 0 such that f W t+1 f W σ W t+1 W 2 fro t = 0, 1, 2,.... Theorem: Super-linear convergence W t+1 W fro σηt W t W fro. I.e., W t converges super-linearly to W if η t is increasing. Ryota Tomioka Univ Tokyo M-DAL / 25
13 Why is M-DAL efficient? Algorithm 1 Proximation wrt φ λ is analytic though non-smooth: W t+1 = ST ηt λ w t + η t A α t 2 Inner minimization is smooth: α t = argmin α R m f l α }{{} Differentiable + 1 2η t ST ηt λw t + η t A α 2 fro }{{} = Φ λ linear to the estimated rank φ λ w Φλ w λ 0 λ Ryota Tomioka Univ Tokyo M-DAL / 25
14 Generalizations Algorithm 1 Learning multiple matrices K φ λ W = λ W k = λ k=1 W W K No need to form the big matrix. Can be used to select informative data sources and learn feature extractors simultaneously. 2 General spectral regularization r φ λ W = g λ σ j W j=1 for any convex function g λ for which the proximal operator: ST g λ σ j = argmin g λ x + 12 x σ j 2 x R can be computed in closed form. Ryota Tomioka Univ Tokyo M-DAL / 25
15 Experiments Synthetic experiment 1: low-rank matrix completion Synthetic experiment 1: low-rank matrix completion Large scale & structured. True matrix W : 10, , M elements, low rank. Observation: randomly chosen m elements sparse. Quasi-Newton method for the minimization of ϕ t α No need to form the full matrix! m=1,200,000 m=2,400,000 Ryota Tomioka Univ Tokyo M-DAL / 25
16 Experiments Synthetic experiment 1: low-rank matrix completion Synthetic experiment 1: low-rank matrix completion Large scale & structured. True matrix W : 10, , M elements, low rank. Observation: randomly chosen m elements sparse. Quasi-Newton method for the minimization of ϕ t α No need to form the full matrix! 300 m=1,200,000 Rank= m=2,400,000 Rank=20 20 Time s Rank 6 Time s Rank Regularization constant λ Regularization constant λ Ryota Tomioka Univ Tokyo M-DAL / 25
17 Rank=10 Experiments Synthetic experiment 1: low-rank matrix completion Rank=10, #observations m=1,200,000 λ time s #outer #inner rank S-RMSE ±2.0 5 ±0 8 ±0 2.8 ± ± ± ±0 18 ±0 5 ± ± ± ±0 28 ±0 6.4 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±11 41 ±0 70 ± ± ± #inner iterations is roughly 2 times #outer iterations. Because we don t need to solve the inner problem very precisely! Ryota Tomioka Univ Tokyo M-DAL / 25
18 Rank=20 Experiments Synthetic experiment 1: low-rank matrix completion Rank=20, #observations m=2,400,000 λ time s #outer #inner rank S-RMSE ±19 6 ± ± ± ± ±22 11 ± ± ± ± ±25 15 ± ± ± ± ±29 19 ± ± ± ± ±36 24 ± ± ± ± ±44 29 ± ± ± ± ±48 34 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Ryota Tomioka Univ Tokyo M-DAL / 25
19 Experiments Synthetic experiment 2: classifying matrices Synthetic experiment 2: classifying matrices True matrix W : 64 64, rank=16. Classification problem # samples m = 1000: f l z = m log1 + exp y i z i i=1 Design matrix A: dense each example drawn from Wishart distribution. λ = 800 chosen to roughly reproduce rank=16. Medium scale & dense Newton method for minimizing ϕ t α works better when the condition is poor. Methods: M-DAL proposed IP interior point method [T& A 2007] PG projected gradient method [T& S 2008] AG accelerated gradient method [Ji & Ye 2009] Ryota Tomioka Univ Tokyo M-DAL / 25
20 Comparison Experiments Synthetic experiment 2: classifying matrices Relative duality gap AG IP PG M DAL CPU time s M-DAL proposed IP interior point method [T& A 2007] PG projected gradient method [T& S 2008] AG accelerated gradient method [Ji & Ye 2009] Ryota Tomioka Univ Tokyo M-DAL / 25
21 Experiments Super-linear convergence Synthetic experiment 2: classifying matrices Relative duality gap Number of iterations Time s Ryota Tomioka Univ Tokyo M-DAL / 25
22 Experiments Benchmark experiment: BCI dataset Benchmark experiment: BCI dataset BCI competition 2003 dataset IV. Task: predict the upcoming finger movement is right or left. Original data: multichannel EEG time-series 28 channels 50 time-points 500ms long Each training example is preprocessed into three matrices and whitened [T&M 2010]: First order <20Hz component: matrix. Second order alpha 7-15Hz component: covariance Second order beta 15-30Hz component: covariance 316 training examples. 100 test samples. Ryota Tomioka Univ Tokyo M-DAL / 25
23 Experiments BCI dataset regularization path Benchmark experiment: BCI dataset -o- M-DAL proposed - - PG T&S 08 -x- AG Ji&Ye Number of iterations Time s Accuracy Regularization constant Regularization constant Regularization constant Stopping criterion: RDG Note: these are costs for obtaining the entire regularization path. Ryota Tomioka Univ Tokyo M-DAL / 25
24 Experiments Benchmark experiment: BCI dataset Summary M-DAL: Dual Augmented Lagrangian algorithm for learning low-rank matrices. Theoretically shown that M-DAL converges superlinearly requires small number of steps. Separable regularizer + differentiable loss function each step can be efficiently solved. Empirically we can solve matrix completion problem of size 10,000 10,000 in roughly 5min. Training a low-rank classifier that automatically combines multiple data sources matrices is sped up by a factor of Future work: Other AL algorithms. Large scale and unstructured or badly conditioned problems. Ryota Tomioka Univ Tokyo M-DAL / 25
25 Background Appendix Convex conjugate of f Convex conjugate of a sum: Fenchel dual: f y = sup x R n y, x f x f + g y = f g α = inf α f α + g y α inf f x + gx = sup f α g α x Rn α R n Ryota Tomioka Univ Tokyo M-DAL / 25
26 Inf-convolution Appendix Inf-convolution envelope function: Φ λ V ; W t = φ λ W t V = inf φ λ Y + 12 W t + V Y 2 Y R R C = Φ λ W t + V ; 0 = Φ λ W t + V Nondifferentiable Differentiable Φ λ x φ λ *v φ λ x Φ λ *v Ryota Tomioka Univ Tokyo M-DAL / 25
27 Appendix M-DAL algorithm for the trace-norm W t+1 = ST ληt W t + η t A α t where α t = argmin α 1 η t Φ λη t W t +η t A α {}}{ fl α + 1 ST ληt W t + η t A α 2 2η t } {{ } ϕ t α ϕ t α is differentiable: ϕ t α = f l α + AST λη t W t + η t A α 2 ϕ t α : a bit more involved but can be computed Wright 92 Ryota Tomioka Univ Tokyo M-DAL / 25
28 Appendix Minimizing the inner objective ϕ t α ϕ t α, ϕ t α can be computed using only the singular values σ j W t α λη t and the corresponding SVs. no need to compute full SVD of W t α := W t + η t A α. But, computation of 2 ϕ t α requires full SVD. Consequence: When A α is structured, computing the above SVD is cheap. E.g., matrix completion: A α is sparse with only m non-zeros. W t α = W t L W t R +η t Quasi-Newton method maintains factorized W t and scalable When A α is not structured, Full Newton method converges faster and more stable Ryota Tomioka Univ Tokyo M-DAL / 25
29 Appendix Primal Proximal Minimization A M-DAL W t+1 = argmin W f l AW + Primal Proximal Minimization B W t+1 = argmin W φ λ W + φ λ W + 1 W W t 2 fro 2η t f l AW + 1 2η t W W t 2 fro Dual Proximal Minimization A split Bregman iteration α t+1 = argmin fl α + φ λ A α + 1 α α t 2 α 2η t Dual Proximal Minimization B α t+1 = argmin α φ λ A α + fl α + 1 α α t 2 2η t Ryota Tomioka Univ Tokyo M-DAL / 25
30 Appendix References Abernethy et al. A new approach to collaborative filtering: Operator estimation with spectral regularization. JMLR 10, Argyriou et al. Multi-task feature learning In NIPS 19, Ji & Ye. An accelerated gradient method for trace norm minimization. In ICML Rockafellar. Monotone operators and the proximal point algorithm. SIAM J. on Control and Optimization 14, 1976a. Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. of Oper. Res. 1, 1976b. Srebro et al. Maximum-margin matrix factorization. In NIPS 17, Tomioka & Aihara. Classifying matrices with a spectral regularization. In ICML Tomioka et al. Super-Linear Convergence of Dual Augmented-Lagrangian Algorithm for Sparsity Regularized Estimation. Arxiv: , Tomioka & Sugiyama. Sparse learning with duality gap guarantee. In NIPS workshop OPT 2008 Optimization for Machine Learning, Tomioka & Müller. A regularized discriminative framework for EEG analysis with application to brain-computer interface. Neuroimage 49, Wright. Differential equations for the analytic singular value decomposition of a matrix. Numer. Math. 63, Ryota Tomioka Univ Tokyo M-DAL / 25
Optimization for Machine Learning
Optimization for Machine Learning Editors: Suvrit Sra suvrit@gmail.com Max Planck Insitute for Biological Cybernetics 72076 Tübingen, Germany Sebastian Nowozin Microsoft Research Cambridge, CB3 0FB, United
More informationDual Augmented Lagrangian, Proximal Minimization, and MKL
Dual Augmented Lagrangian, Proximal Minimization, and MKL Ryota Tomioka 1, Taiji Suzuki 1, and Masashi Sugiyama 2 1 University of Tokyo 2 Tokyo Institute of Technology 2009-09-15 @ TU Berlin (UT / Tokyo
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationA Spectral Regularization Framework for Multi-Task Structure Learning
A Spectral Regularization Framework for Multi-Task Structure Learning Massimiliano Pontil Department of Computer Science University College London (Joint work with A. Argyriou, T. Evgeniou, C.A. Micchelli,
More informationSuper-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
Journal of Machine Learning Research 12 2011) 1537-1586 Submitted 5/10; Revised 12/10; Published 5/11 Super-Linear Convergence of Dual Augmented Lagrangian Algorithm for Sparsity Regularized Estimation
More informationSpectral k-support Norm Regularization
Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationA Randomized Approach for Crowdsourcing in the Presence of Multiple Views
A Randomized Approach for Crowdsourcing in the Presence of Multiple Views Presenter: Yao Zhou joint work with: Jingrui He - 1 - Roadmap Motivation Proposed framework: M2VW Experimental results Conclusion
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationMaximum Margin Matrix Factorization
Maximum Margin Matrix Factorization Nati Srebro Toyota Technological Institute Chicago Joint work with Noga Alon Tel-Aviv Yonatan Amit Hebrew U Alex d Aspremont Princeton Michael Fink Hebrew U Tommi Jaakkola
More informationAccelerated Training of Max-Margin Markov Networks with Kernels
Accelerated Training of Max-Margin Markov Networks with Kernels Xinhua Zhang University of Alberta Alberta Innovates Centre for Machine Learning (AICML) Joint work with Ankan Saha (Univ. of Chicago) and
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationUsing SVD to Recommend Movies
Michael Percy University of California, Santa Cruz Last update: December 12, 2009 Last update: December 12, 2009 1 / Outline 1 Introduction 2 Singular Value Decomposition 3 Experiments 4 Conclusion Last
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationProjection methods to solve SDP
Projection methods to solve SDP Franz Rendl http://www.math.uni-klu.ac.at Alpen-Adria-Universität Klagenfurt Austria F. Rendl, Oberwolfach Seminar, May 2010 p.1/32 Overview Augmented Primal-Dual Method
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationLearning Task Grouping and Overlap in Multi-Task Learning
Learning Task Grouping and Overlap in Multi-Task Learning Abhishek Kumar Hal Daumé III Department of Computer Science University of Mayland, College Park 20 May 2013 Proceedings of the 29 th International
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationMatrix Factorizations: A Tale of Two Norms
Matrix Factorizations: A Tale of Two Norms Nati Srebro Toyota Technological Institute Chicago Maximum Margin Matrix Factorization S, Jason Rennie, Tommi Jaakkola (MIT), NIPS 2004 Rank, Trace-Norm and Max-Norm
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationA Simple Algorithm for Nuclear Norm Regularized Problems
A Simple Algorithm for Nuclear Norm Regularized Problems ICML 00 Martin Jaggi, Marek Sulovský ETH Zurich Matrix Factorizations for recommender systems Y = Customer Movie UV T = u () The Netflix challenge:
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationLECTURE 13 LECTURE OUTLINE
LECTURE 13 LECTURE OUTLINE Problem Structures Separable problems Integer/discrete problems Branch-and-bound Large sum problems Problems with many constraints Conic Programming Second Order Cone Programming
More informationhttps://goo.gl/kfxweg KYOTO UNIVERSITY Statistical Machine Learning Theory Sparsity Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT OF INTELLIGENCE SCIENCE AND TECHNOLOGY 1 KYOTO UNIVERSITY Topics:
More informationDual and primal-dual methods
ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method
More informationNewton s Method. Ryan Tibshirani Convex Optimization /36-725
Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationAn Algorithm for Transfer Learning in a Heterogeneous Environment
An Algorithm for Transfer Learning in a Heterogeneous Environment Andreas Argyriou 1, Andreas Maurer 2, and Massimiliano Pontil 1 1 Department of Computer Science University College London Malet Place,
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationAccelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)
Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan
More informationApplications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices
Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices Vahid Dehdari and Clayton V. Deutsch Geostatistical modeling involves many variables and many locations.
More informationSupport Vector Machine via Nonlinear Rescaling Method
Manuscript Click here to download Manuscript: svm-nrm_3.tex Support Vector Machine via Nonlinear Rescaling Method Roman Polyak Department of SEOR and Department of Mathematical Sciences George Mason University
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationProximal methods. S. Villa. October 7, 2014
Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationPrimal-dual Subgradient Method for Convex Problems with Functional Constraints
Primal-dual Subgradient Method for Convex Problems with Functional Constraints Yurii Nesterov, CORE/INMA (UCL) Workshop on embedded optimization EMBOPT2014 September 9, 2014 (Lucca) Yu. Nesterov Primal-dual
More informationHomework 5. Convex Optimization /36-725
Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More information10 Numerical methods for constrained problems
10 Numerical methods for constrained problems min s.t. f(x) h(x) = 0 (l), g(x) 0 (m), x X The algorithms can be roughly divided the following way: ˆ primal methods: find descent direction keeping inside
More informationarxiv: v2 [math.oc] 18 Nov 2012
arxiv:206.2372v2 [math.oc] 8 Nov 202 PRISMA: PRoximal Iterative SMoothing Algorithm Francesco Orabona Toyota Technological Institute at Chicago francesco@orabona.com Andreas Argyriou Katholieke Universiteit
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationLarge Scale Semi-supervised Linear SVMs. University of Chicago
Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.
More informationConvex Optimization on Large-Scale Domains Given by Linear Minimization Oracles
Convex Optimization on Large-Scale Domains Given by Linear Minimization Oracles Arkadi Nemirovski H. Milton Stewart School of Industrial and Systems Engineering Georgia Institute of Technology Joint research
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationProbabilistic Time Series Classification
Probabilistic Time Series Classification Y. Cem Sübakan Boğaziçi University 25.06.2013 Y. Cem Sübakan (Boğaziçi University) M.Sc. Thesis Defense 25.06.2013 1 / 54 Problem Statement The goal is to assign
More informationLECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE
LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationOptimization for Communications and Networks. Poompat Saengudomlert. Session 4 Duality and Lagrange Multipliers
Optimization for Communications and Networks Poompat Saengudomlert Session 4 Duality and Lagrange Multipliers P Saengudomlert (2015) Optimization Session 4 1 / 14 24 Dual Problems Consider a primal convex
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More informationRanking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization
Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Jialin Dong ShanghaiTech University 1 Outline Introduction FourVignettes: System Model and Problem Formulation Problem Analysis
More informationRank, Trace-Norm & Max-Norm
Rank, Trace-Norm & Max-Norm as measures of matrix complexity Nati Srebro University of Toronto Adi Shraibman Hebrew University Matrix Learning users movies 2 1 4 5 5 4? 1 3 3 5 2 4? 5 3? 4 1 3 5 2 1? 4
More informationMirror Descent for Metric Learning. Gautam Kunapuli Jude W. Shavlik
Mirror Descent for Metric Learning Gautam Kunapuli Jude W. Shavlik And what do we have here? We have a metric learning algorithm that uses composite mirror descent (COMID): Unifying framework for metric
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationStatistical Performance of Convex Tensor Decomposition
Slides available: h-p://www.ibis.t.u tokyo.ac.jp/ryotat/tensor12kyoto.pdf Statistical Performance of Convex Tensor Decomposition Ryota Tomioka 2012/01/26 @ Kyoto University Perspectives in Informatics
More informationORIE 6326: Convex Optimization. Quasi-Newton Methods
ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted
More informationDual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationSome tensor decomposition methods for machine learning
Some tensor decomposition methods for machine learning Massimiliano Pontil Istituto Italiano di Tecnologia and University College London 16 August 2016 1 / 36 Outline Problem and motivation Tucker decomposition
More informationA Unified Approach to Proximal Algorithms using Bregman Distance
A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More information10. Unconstrained minimization
Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationTheoretical and Experimental Analyses of Tensor-Based Regression and Classification
Neural Computation, to appear. 1 Theoretical and Experimental Analyses of Tensor-Based Regression and Classification Kishan Wimalawarne Toyo Institute of Technology, Japan ishanwn@gmail.com Ryota Tomioa
More informationMaximum Margin Matrix Factorization for Collaborative Ranking
Maximum Margin Matrix Factorization for Collaborative Ranking Joint work with Quoc Le, Alexandros Karatzoglou and Markus Weimer Alexander J. Smola sml.nicta.com.au Statistical Machine Learning Program
More informationFirst-order methods for structured nonsmooth optimization
First-order methods for structured nonsmooth optimization Sangwoon Yun Department of Mathematics Education Sungkyunkwan University Oct 19, 2016 Center for Mathematical Analysis & Computation, Yonsei University
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationPrincipal Component Analysis
Principal Component Analysis Introduction Consider a zero mean random vector R n with autocorrelation matri R = E( T ). R has eigenvectors q(1),,q(n) and associated eigenvalues λ(1) λ(n). Let Q = [ q(1)
More informationImplicit Optimization Bias
Implicit Optimization Bias as a key to Understanding Deep Learning Nati Srebro (TTIC) Based on joint work with Behnam Neyshabur (TTIC IAS), Ryota Tomioka (TTIC MSR), Srinadh Bhojanapalli, Suriya Gunasekar,
More informationFantope Regularization in Metric Learning
Fantope Regularization in Metric Learning CVPR 2014 Marc T. Law (LIP6, UPMC), Nicolas Thome (LIP6 - UPMC Sorbonne Universités), Matthieu Cord (LIP6 - UPMC Sorbonne Universités), Paris, France Introduction
More informationFeature Learning for Multitask Learning Using l 2,1 Norm Minimization
Feature Learning for Multitask Learning Using l, Norm Minimization Priya Venkateshan Abstract We look at solving the task of Multitask Feature Learning by way of feature selection. We find that formulating
More informationSparse Covariance Selection using Semidefinite Programming
Sparse Covariance Selection using Semidefinite Programming A. d Aspremont ORFE, Princeton University Joint work with O. Banerjee, L. El Ghaoui & G. Natsoulis, U.C. Berkeley & Iconix Pharmaceuticals Support
More informationOn the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,
Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More information