Convex Optimization Algorithms for Machine Learning in 10 Slides

Similar documents
STA141C: Big Data & High Performance Statistical Computing

Neural Network Training

ECS289: Scalable Machine Learning

IPAM Summer School Optimization methods for machine learning. Jorge Nocedal

Big Data Analytics: Optimization and Randomization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

ECS289: Scalable Machine Learning

Proximal Methods for Optimization with Spasity-inducing Norms

Introduction to Logistic Regression and Support Vector Machine

Sparse Linear Programming via Primal and Dual Augmented Coordinate Descent

1 Sparsity and l 1 relaxation

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

CIS 520: Machine Learning Oct 09, Kernel Methods

Accelerated Block-Coordinate Relaxation for Regularized Optimization

ECS289: Scalable Machine Learning

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Variables. Cho-Jui Hsieh The University of Texas at Austin. ICML workshop on Covariance Selection Beijing, China June 26, 2014

On the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,

Convex Optimization Lecture 16

ECS289: Scalable Machine Learning

Overfitting, Bias / Variance Analysis

Support Vector Machines for Classification and Regression

Introduction to Machine Learning

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Lecture 9: September 28

Machine Learning for NLP

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Coordinate Descent and Ascent Methods

2 Regularized Image Reconstruction for Compressive Imaging and Beyond

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Zero-Order Methods for the Optimization of Noisy Functions. Jorge Nocedal

Lecture 17: October 27

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

10-725/ Optimization Midterm Exam

Linear Regression. S. Sumitra

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Dual Proximal Gradient Method

Logistic Regression. Seungjin Choi

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Conditional Gradient (Frank-Wolfe) Method

Gaussian and Linear Discriminant Analysis; Multiclass Classification

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Optimization Methods for Machine Learning

Iterative Reweighted Least Squares

ECS289: Scalable Machine Learning

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

CSC 576: Variants of Sparse Learning

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

ECS171: Machine Learning

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

Newton s Method. Javier Peña Convex Optimization /36-725

Parallel Coordinate Optimization

Lasso: Algorithms and Extensions

Accelerated Training of Max-Margin Markov Networks with Kernels

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Max Margin-Classifier

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Incremental and Decremental Training for Linear Classification

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Support Vector Machines and Kernel Methods

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Coordinate Update Algorithm Short Course Operator Splitting

Adaptive Gradient Methods AdaGrad / Adam

Support Vector Machine

Optimization methods

Chapter 4. Unconstrained optimization

Warm up: risk prediction with logistic regression

10. Unconstrained minimization

EE 367 / CS 448I Computational Imaging and Display Notes: Image Deconvolution (lecture 6)

Lecture 1: September 25

Machine Learning. Support Vector Machines. Manfred Huber

Homework 5. Convex Optimization /36-725

Linear and logistic regression

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

Statistical Machine Learning from Data

CS-E4830 Kernel Methods in Machine Learning

Ad Placement Strategies

CS60021: Scalable Data Mining. Large Scale Machine Learning

14. Nonlinear equations

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Classification: Logistic Regression from Data

Oslo Class 6 Sparsity based regularization

Discriminative Models

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

ICS-E4030 Kernel Methods in Machine Learning

Lecture 9: PGM Learning

Support Vector Machine (SVM) and Kernel Methods

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Transcription:

Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015

Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth, Non-separable Augmented Lagrangian Method

Quadratic Problem Problem: w f (w) = 1 2 w T Hw + g T w + c Example: w f (w) = 1 2 y Xw 2 + 1 2 w 2 (1) Solution: solve a linear system f (w) = 0 Hw = g (2) How to solve? In the example, H = X T X + I and g = Xy. X is n d solving linear system directly requires O(d 3 ). Hessian-vector product (Hv) only needs O(nnz(X )). Conjugate Gradient (CG) produces reasonable solution using few iters of Hessian-vector product.

Smooth Problem Newton-CG Problem: w f (w), where f (w), 2 f (w) are continuous. Ex. w n f (w) = L(w T x i,y i ) + 1 i=1 2 w 2 (3) where L(z,y) = ln(1 + exp( yz)) is logistic loss 1. Newton-CG, where each iter t we solve a quadratic approximation to find the Newton direction w nt w nt = arg w 1 2 w T H t w + g T t w + f (w t ), and do line search to find step size η t. (w t+1 = w t + η t w nt ) In (4), H t = X T DX + I, where D is diagonal matrix with D ii = L (w T t x i,y i ). Hessian-vector product O(nnz(X )). 1 Lin. et al.. Trust region Newton method for logistic regression. ICML 2007.

Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth, Non-separable Augmented Lagrangian Method

Composite Problem Problem: w f (w) + h(w), where f (w) is smooth, h(w) is not smooth but separable w.r.t. atoms. Ex. LASSO, L1-regularized Logistic Reg. 2 w Ex. Dual of SVM 3 n f (w) = L(w T x i,y i ) + λ w 1, (4) i=1 α s.t. Ex. Matrix Completion 4 W 1 2 αt Qα n α i i=1 0 α i C,i = 1..n. (5) 1 2 (A ij W ij ) 2 + λ W (6) i,j Ω 2 Yuan et al. An improved glmnet for l1-regularized logistic regression. JMLR 2012. 3 Hsieh et al. A dual coordinate descent method for large-scale linear SVM. ICML, 2008. 4 Hsieh et al.. Nuclear norm imization via active subspace selection. ICML 2014.

Composite Problem Problem: w f (w) + h(w), where f (w) is smooth, h(w) is not smooth but separable w.r.t. atoms. Insight: if f (w) is atomic quadratic function, composite problem is easy to solve. Ex. sign( b a )softthd( b a, λ a ) = arg x a 2 x 2 + bx + λ x. (Google proximal operator for... to find formula you need.) Proximal-Newton-CD: 1. Construct local quadratic approximation q( w;w t ). Solve w = arg w q( w;w t ) + h( w + w t ). (7) via Coordinate Descent (optimize w.r.t. one atom at a time). 2. Do line search to find η t and w t+1 = w t + η t w.

Composite Problem Problem: w f (w) + h(w), where f (w) is smooth, h(w) is not smooth but separable w.r.t. atoms. Proximal-Newton-CD: 1. Construct local quadratic approximation q( w;w t ). Solve w = arg w q( w;w t ) + h( w + w t ). (8) via Coordinate Descent (optimize w.r.t. one atom at a time). 2. Do line search to find η t and w t+1 = w t + η t w. Key to efficiency: whether q(.) = H t w + g t can be maintained efficiently after coordinate update. What if not? (ex. Multiclass, CRF) Prox-Quasi-Newton: replace H t with low-rank approximation B t constructed from historical f (w 1 ),..., f (w t 1 ). 5 5 Zhong et al. Proximal Quasi-Newton for Computationally Intensive l1-regularized M-estimators. NIPS 2014.

Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth, Non-separable Augmented Lagrangian Method

Non-smooth, Non-separable Problem What if the non-smooth function is non-separable? Linear Program: Robust PCA: x,ξ 0 c T x s.t. Ax + ξ = b. L,S s.t. L + λ S 1 L + S = X Reduce it to composite problem by Augmented Lagrangian Method! (9) (10)

Non-smooth, Non-separable Problem What if the non-smooth function is non-separable? Linear Program: x,ξ 0 c T x s.t. Ax + ξ = b. -max of Lagrangian: (dual variable α) (11) max c T x + α T (Ax b + ξ ) x,ξ 0 α (12) Augmented Lagrangian Method: (x,ξ ) = arg x,ξ 0 α t+1 = α t + (Ax b + ξ ) c T x + αt T (Ax b + ξ ) + 1 Ax b + ξ 2 2 (13)