ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method
|
|
- Neil Walton
- 6 years ago
- Views:
Transcription
1 ORIE 4741: Learning with Big Messy Data Proximal Gradient Method Professor Udell Operations Research and Information Engineering Cornell November 13, / 31
2 Announcements Be a TA for CS/ORIE 1380: Data Science for All! syllabus: roster/sp18/class/orie/1380 apply: 2 / 31
3 Regularized empirical risk minimization choose model by solving minimize n l(x i, y i ; w) + r(w) i=1 with variable w R d parameter vector w R d loss function l : X Y R d R regularizer r : R d R 3 / 31
4 Regularized empirical risk minimization choose model by solving minimize n l(x i, y i ; w) + r(w) i=1 with variable w R d parameter vector w R d loss function l : X Y R d R regularizer r : R d R why? want to minimize the risk E (x,y) P l(x, y; w) approximate it by the empirical risk n i=1 l(x, y; w) add regularizer to help model generalize 3 / 31
5 Solving regularized risk minimization how should we fit these models? with a different software package for each model? with a different algorithm for each model? with a general purpose optimization solver? desiderata fast flexible we ll use the proximal gradient method 4 / 31
6 Proximal operator define the proximal operator of the function r : R d R prox r (z) = argmin(r(w) + 1 w 2 w z 2 2) 5 / 31
7 Proximal operator define the proximal operator of the function r : R d R prox r : R d R d prox r (z) = argmin(r(w) + 1 w 2 w z 2 2) 5 / 31
8 Proximal operator define the proximal operator of the function r : R d R prox r : R d R d prox r (z) = argmin(r(w) + 1 w 2 w z 2 2) generalized projection: if 1 C is the indicator of set C, prox 1C (w) = Π C (w) 5 / 31
9 Proximal operator define the proximal operator of the function r : R d R prox r : R d R d prox r (z) = argmin(r(w) + 1 w 2 w z 2 2) generalized projection: if 1 C is the indicator of set C, prox 1C (w) = Π C (w) implicit gradient step: if w = prox r (z) and r is smooth, r(w) + w z = 0 w = z r(w) 5 / 31
10 Proximal operator define the proximal operator of the function r : R d R prox r : R d R d prox r (z) = argmin(r(w) + 1 w 2 w z 2 2) generalized projection: if 1 C is the indicator of set C, prox 1C (w) = Π C (w) implicit gradient step: if w = prox r (z) and r is smooth, r(w) + w z = 0 w = z r(w) simple to evaluate: closed form solutions for many functions more info: [Parikh Boyd 2013] 5 / 31
11 Maps from functions to functions no consistent notation for map from functions to functions. for a function f : R d R, prox maps f to a new function prox f : R d R d proxf (x) evaluates this function at the point x maps f to a new function f : R d R d f (x) evaluates this function at the point x f x maps f to a new function x : Rd R d f x (x) x= x evaluates this function at the point x this one has the most confusing notation of all... 6 / 31
12 Maps from functions to functions in a nice programming language, you can write something like prox ( f ) ( x ) or prox ( f, x ) 7 / 31
13 Let s evaluate some proximal operators! define the proximal operator of the function r : R d R prox r (z) = argmin(r(w) + 1 w 2 w z 2 2) r(w) = 0 (identity) r(w) = d i=1 r i(w i ) (separable) r(w) = w 2 2 (shrinkage) r(w) = w 1 (soft-thresholding) r(w) = 1(w 0) (projection) r(w) = d 1 i=1 (w i+1 w i ) 2 (smoothing) 8 / 31
14 Proximal gradient method want to solve minimize l(w) + r(w) l : R d R smooth r : R d R with a fast prox operator proximal gradient method. pick step size sequence {α t } t=1 and w 0 R d repeat w t+1 = prox αtr (w t α t l(w t )) 9 / 31
15 Proximal gradient method want to solve minimize l(w) + r(w) l : R d R smooth r : R d R with a fast prox operator proximal gradient method. pick step size sequence {α t } t=1 and w 0 R d repeat w t+1 = prox αtr (w t α t l(w t )) complexity: O(nd) to evaluate gradient of loss function O(d) to prox and to update 9 / 31
16 Example: NNLS want to solve recall minimize 1 2 y Xw 2 + 1(w 0) ( 1 2 y Xw 2) = X T (y Xw) prox 1( 0) (w) = max(0, w) proximal gradient method. pick step size sequence {α t } t=0 and w 0 R d for t = 0, 1,... w t+1 = max(0, w t + α t X T (y Xw t )) 10 / 31
17 Example: NNLS want to solve recall minimize 1 2 y Xw 2 + 1(w 0) ( 1 2 y Xw 2) = X T (y Xw) prox 1( 0) (w) = max(0, w) proximal gradient method. pick step size sequence {α t } t=0 and w 0 R d for t = 0, 1,... w t+1 = max(0, w t + α t X T (y Xw t )) note: this is not the same as max(0, (X T X ) 1 X T y) 10 / 31
18 Example: NNLS proximal gradient method for NNLS. pick step size sequence {α t } t=0 and w 0 R d for t = 0, 1,... compute g t = X T (y Xw t ) (O(nd) flops) w t+1 = max(0, w t + α t g t ) (O(d) flops) O(nd) flops per iteration 11 / 31
19 Example: NNLS option: do work up front to reduce per-iteration complexity proximal gradient method for NNLS. pick step size sequence {α t } t=0 and w 0 R d form b = X T y (O(dn) flops), G = X T X (O(nd 2 ) flops) for t = 0, 1,... w t+1 = max(0, w t + α t (b Gw t )) (O(d 2 ) flops) O(nd 2 ) flops to begin, O(d 2 ) flops per iteration 12 / 31
20 Example: NNLS option: do work up front to reduce per-iteration complexity proximal gradient method for NNLS. pick step size sequence {α t } t=0 and w 0 R d form b = X T y (O(dn) flops), G = X T X (O(nd 2 ) flops) for t = 0, 1,... w t+1 = max(0, w t + α t (b Gw t )) (O(d 2 ) flops) O(nd 2 ) flops to begin, O(d 2 ) flops per iteration note: can compute b and G in parallel / 31
21 want to solve recall minimize Example: Lasso 1 2 y Xw 2 + λ w 1 ( 1 2 y Xw 2) = X T (y Xw) prox µ 1 (w) = s µ (w) where (s µ (w)) i = proximal gradient method. w i µ w i µ 0 w i µ w i + µ w i µ pick step size sequence {α t } t=0 and w 0 R d form b = X T y (O(dn) flops), G = X T X (O(nd 2 ) flops) for t = 0, 1,... w t+1 = s α t λ(w t + α t (b Gw t )) (O(d 2 ) flops) 13 / 31
22 Example: Lasso want to solve recall minimize 1 2 y Xw 2 + λ w 1 ( 1 2 y Xw 2) = X T (y Xw) prox µ 1 (w) = s µ (w) where (s µ (w)) i = proximal gradient method. w i µ w i µ 0 w i µ w i + µ w i µ pick step size sequence {α t } t=0 and w 0 R d form b = X T y (O(dn) flops), G = X T X (O(nd 2 ) flops) for t = 0, 1,... w t+1 = s α t λ(w t + α t (b Gw t )) (O(d 2 ) flops) notice: the hard part (O(d 2 )) is computing the gradient...! 13 / 31
23 Convergence two questions to ask: will the iteration ever stop? what kind of point will it stop at? if the iteration stops, we say it has converged 14 / 31
24 Convergence: what kind of point will it stop at? let s suppose r is differentiable 1 if we find w so that then w = prox αtr (w α t l(w)) w = argmin(α t r(w ) + 1 w 2 w (w α t l(w)) 2 2) 0 = α t r(w) + w w + α t l(w) = (r(w) + l(w)) so the gradient of the objective is 0 if l and r are convex, that means w minimizes l + r 1 take Convex Optimization for the proof for non-differentiable r 15 / 31
25 Convergence: will it stop? definitions: p = inf w l(w) + r(w) assumptions: loss function is continuously differentiable with L-Lipshitz derivative: l(w ) l(w) + l(w), w w + L 2 w w 2 for simplicity, consider constant step size α t = α 16 / 31
26 Proximal point method converges prove it for l = 0 first (aka the proximal point method) for any t = 0, 1,..., w t+1 = argmin αr(w) + 1 w 2 w w t 2 so in particular, αr(w t+1 ) w t+1 w t 2 αr(w t ) w t w t 2 1 2α w t+1 w t 2 r(w t ) r(w t+1 ) now add up these inequalities for t = 0, 1,..., T : 1 2α T w t+1 w t 2 t=0 it converges! T (r(w t ) r(w t+1 )) t=0 r(w 0 ) p 17 / 31
27 Proximal gradient method converges (I) now prove it for l 0. for any t = 0, 1,..., w t+1 = argmin αr(w) + 1 w 2 w (w t α l(w t )) 2 so in particular, r(w t+1 ) + 1 2α w t+1 (w t α l(w t )) 2 r(w t ) + 1 2α w t (w t α l(w t )) 2 r(w t+1 ) + 1 2α w t+1 w t 2 + α 2 l(w t )) 2 + l(w t ), w t+1 w t r(w t ) + α 2 l(w t ) 2 r(w t+1 ) + 1 2α w t+1 w t 2 + l(w t ), w t+1 w t r(w t ) 18 / 31
28 Proximal gradient method converges (II) now use l(w ) l(w) + l(w), w w + L 2 w w 2 with w = w t+1, w = w t l(w t+1 ) + r(w t+1 ) + 1 2α w t+1 w t 2 + l(w t ), w t+1 w t l(w t ) + r(w t ) + l(w t ), w t+1 w t + L 2 w t+1 w t 2 l(w t+1 ) + r(w t+1 ) + 1 2α w t+1 w t 2 l(w t ) + r(w t ) + L 2 w t+1 w t 2 1 2α w t+1 w t 2 L 2 w t+1 w t 2 l(w t ) + r(w t ) (l(w t+1 ) + r(w t+1 )) 19 / 31
29 Proximal gradient method converges (III) now add up these inequalities for t = 0, 1,..., T : ( ) T α L w t+1 w t 2 t=0 T l(w t ) + r(w t ) (l(w t+1 ) + r(w t+1 ))) t=0 l(w 0 ) + r(w 0 ) p if 1 α L 0 = α 1 L it converges! 20 / 31
30 Lipshitz? loss function is differentiable and L-Lipshitz: l(w ) l(w) + l(w), w w + L 2 w w 2 what is L? 21 / 31
31 Lipshitz? loss function is differentiable and L-Lipshitz: l(w ) l(w) + l(w), w w + L 2 w w 2 what is L? l(w ) = Xw y 2 = X (w w) + Xw y 2 = Xw y 2 + 2(Xw y) T X (w w) + X (w w) 2 = l(w) + l(w), w w + X (w w) 2 l(w) + l(w), w w + X 2 (w w) 2 so L = 2 X 2, where X is the maximum singular value of X 21 / 31
32 Speeding it up recall: computing the gradient is slow idea: approximate the gradient! a stochastic gradient l(w) is a random variable with E l(w) = l(w) 22 / 31
33 stochastic gradient obeys Stochastic gradient: examples E l(w) = l(w) examples: for l(w) = n i=1 (y i w T x i ) 2, 23 / 31
34 stochastic gradient obeys Stochastic gradient: examples E l(w) = l(w) examples: for l(w) = n i=1 (y i w T x i ) 2, single stochastic gradient. pick a random example i. set l(w) = n (y i w T x i ) 2 = 2n(y i w T x i )x i 23 / 31
35 stochastic gradient obeys Stochastic gradient: examples E l(w) = l(w) examples: for l(w) = n i=1 (y i w T x i ) 2, single stochastic gradient. pick a random example i. set l(w) = n (y i w T x i ) 2 = 2n(y i w T x i )x i minibatch stochastic gradient. pick a random set of examples S. set ( ) l(w) = n S (y i w T x i ) 2 i S ( = n S 2 ) (y i w T x i )x i i S (here S is the number of elements in the set S.) 23 / 31
36 Stochastic proximal gradient stochastic proximal gradient method. pick step size sequence {α t } t=1 and w 0 R d repeat pick i {1,..., n} uniformly at random w t+1 = prox αtr (w t + α t 2n(y i w T x i )x i ) (O(d) flops) 24 / 31
37 Stochastic proximal gradient stochastic proximal gradient method. pick step size sequence {α t } t=1 and w 0 R d repeat pick i {1,..., n} uniformly at random w t+1 = prox αtr (w t + α t 2n(y i w T x i )x i ) (O(d) flops) per iteration complexity: O(d)! 24 / 31
38 Extensions next lecture, we ll see loss functions that are not differentiable can still use proximal gradient method! need to define the subgradient 25 / 31
39 Subgradient define the subgradient of a convex function f : R d R at x f (x) = {g : y, f (y) f (x) + g, y x } the subgradient f maps points to sets! follows the chain rule: if f = h g, h : R R, and g : R d R is differentiable, f (x) = h(g(x)) g(x) 26 / 31
40 Subgradient for f : R R and convex, here s a simpler equivalent condition: if f is differentiable at x, f (x) = { f (x)} if f is not differentiable at x, it will still be differentiable just to the left and the right of x 2, so let g + = lim ɛ 0 f (x + ɛ) let g = lim ɛ 0 f (x ɛ) f (x) is any convex combination (i.e., any weighted average) of those gradients: f (x) = {αg + + (1 α)g : α [0, 1]} 2 (because a convex function is differentiable almost everywhere) 27 / 31
41 Proximal subgradient method proximal subgradient method. pick a step size sequence {α t } t=1 and w 0 R d repeat pick any g f (w t ) w t+1 = prox αtr (w t α t l(w t )) 28 / 31
42 Convergence for stochastic proximal (sub)gradient pick your poison: stochastic (sub)gradient, fixed step size α t = α: iterates converge quickly, then wander within a small ball stochastic (sub)gradient, decreasing step size α t = 1/t: iterates converge slowly to solution minibatch stochastic (sub)gradient with increasing minibatch size, fixed step size α t = α: iterates converge quickly to solution later iterations take (much) longer proofs: [Bertsekas, 2010] conditions: l is convex, subdifferentiable, Lipshitz continuous, and r is convex and Lipshitz continuous where it is < or all iterates are bounded 29 / 31
43 Proximal subgradient method Q: Why can t we just use gradient descent to solve all our problems? 30 / 31
44 Proximal subgradient method Q: Why can t we just use gradient descent to solve all our problems? A: Because some regularizers and loss functions aren t differentiable! 30 / 31
45 Proximal subgradient method Q: Why can t we just use gradient descent to solve all our problems? A: Because some regularizers and loss functions aren t differentiable! Q: Why can t we just use subgradient descent to solve all our problems? 30 / 31
46 Proximal subgradient method Q: Why can t we just use gradient descent to solve all our problems? A: Because some regularizers and loss functions aren t differentiable! Q: Why can t we just use subgradient descent to solve all our problems? A: Because some of our regularizers don t even have subgradients defined everywhere. (e.g., 1 + ) 30 / 31
47 References Vandenberghe: lecture on proximal gradient method. lectures/proxgrad.pdf Yin: lecture on proximal method. slides/lec05_proximaloperatordual.pdf Parikh and Boyd: paper on proximal algorithms. https: //stanford.edu/~boyd/papers/pdf/prox_algs.pdf Bertsekas: convergence proofs for every proximal gradient style method you can dream of / 31
ORIE 4741: Learning with Big Messy Data. Regularization
ORIE 4741: Learning with Big Messy Data Regularization Professor Udell Operations Research and Information Engineering Cornell October 26, 2017 1 / 24 Regularized empirical risk minimization choose model
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationORIE 4741 Final Exam
ORIE 4741 Final Exam December 15, 2016 Rules for the exam. Write your name and NetID at the top of the exam. The exam is 2.5 hours long. Every multiple choice or true false question is worth 1 point. Every
More informationMATH 829: Introduction to Data Mining and Analysis Computing the lasso solution
1/16 MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution Dominique Guillot Departments of Mathematical Sciences University of Delaware February 26, 2016 Computing the lasso
More informationProximal methods. S. Villa. October 7, 2014
Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationCoordinate Update Algorithm Short Course Subgradients and Subgradient Methods
Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationMath 273a: Optimization Subgradient Methods
Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationCSC321 Lecture 2: Linear Regression
CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationGradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725
Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex
More informationOn the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,
Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationORIE 6326: Convex Optimization. Quasi-Newton Methods
ORIE 6326: Convex Optimization Quasi-Newton Methods Professor Udell Operations Research and Information Engineering Cornell April 10, 2017 Slides on steepest descent and analysis of Newton s method adapted
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationLecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More informationMath 273a: Optimization Subgradients of convex functions
Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More information10. Unconstrained minimization
Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation
More informationSequential and reinforcement learning: Stochastic Optimization I
1 Sequential and reinforcement learning: Stochastic Optimization I Sequential and reinforcement learning: Stochastic Optimization I Summary This session describes the important and nowadays framework of
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationORIE 4741: Learning with Big Messy Data. Generalization
ORIE 4741: Learning with Big Messy Data Generalization Professor Udell Operations Research and Information Engineering Cornell September 23, 2017 1 / 21 Announcements midterm 10/5 makeup exam 10/2, by
More informationr=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J
7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured
More informationRegML 2018 Class 2 Tikhonov regularization and kernels
RegML 2018 Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT June 17, 2018 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n = (x i,
More informationNumerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725
Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple
More informationSubgradient Descent. David S. Rosenberg. New York University. February 7, 2018
Subgradient Descent David S. Rosenberg New York University February 7, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 1 / 43 Contents 1 Motivation and Review:
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationLecture: Smoothing.
Lecture: Smoothing http://bicmr.pku.edu.cn/~wenzw/opt-2018-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe s lecture notes Smoothing 2/26 introduction smoothing via conjugate
More informationThe Alternating Direction Method of Multipliers
The Alternating Direction Method of Multipliers With Adaptive Step Size Selection Peter Sutor, Jr. Project Advisor: Professor Tom Goldstein December 2, 2015 1 / 25 Background The Dual Problem Consider
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationLinear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra
Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley ( For more info,
More informationDescent methods. min x. f(x)
Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationLecture 23: Online convex optimization Online convex optimization: generalization of several algorithms
EECS 598-005: heoretical Foundations of Machine Learning Fall 2015 Lecture 23: Online convex optimization Lecturer: Jacob Abernethy Scribes: Vikas Dhiman Disclaimer: hese notes have not been subjected
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationCSC 411 Lecture 6: Linear Regression
CSC 411 Lecture 6: Linear Regression Roger Grosse, Amir-massoud Farahmand, and Juan Carrasquilla University of Toronto UofT CSC 411: 06-Linear Regression 1 / 37 A Timely XKCD UofT CSC 411: 06-Linear Regression
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 4. Suvrit Sra. (Conjugates, subdifferentials) 31 Jan, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 4 (Conjugates, subdifferentials) 31 Jan, 2013 Suvrit Sra Organizational HW1 due: 14th Feb 2013 in class. Please L A TEX your solutions (contact TA if this
More informationIntroduction to Machine Learning (67577) Lecture 7
Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew
More informationOptimization and (Under/over)fitting
Optimization and (Under/over)fitting EECS 442 Prof. David Fouhey Winter 2019, University of Michigan http://web.eecs.umich.edu/~fouhey/teaching/eecs442_w19/ Administrivia We re grading HW2 and will try
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationOslo Class 2 Tikhonov regularization and kernels
RegML2017@SIMULA Oslo Class 2 Tikhonov regularization and kernels Lorenzo Rosasco UNIGE-MIT-IIT May 3, 2017 Learning problem Problem For H {f f : X Y }, solve min E(f), f H dρ(x, y)l(f(x), y) given S n
More informationDual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationMachine Learning A Geometric Approach
Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron
More informationA projection algorithm for strictly monotone linear complementarity problems.
A projection algorithm for strictly monotone linear complementarity problems. Erik Zawadzki Department of Computer Science epz@cs.cmu.edu Geoffrey J. Gordon Machine Learning Department ggordon@cs.cmu.edu
More informationDS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM
DS-GA 1003: Machine Learning and Computational Statistics Homework 6: Generalized Hinge Loss and Multiclass SVM Due: Monday, April 11, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Gradient Descent, Newton-like Methods Mark Schmidt University of British Columbia Winter 2017 Admin Auditting/registration forms: Submit them in class/help-session/tutorial this
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More informationLecture 5: September 15
10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 15 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Di Jin, Mengdi Wang, Bin Deng Note: LaTeX template courtesy of UC Berkeley EECS
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 Feature selection task 1 Why might you want to perform feature selection? Efficiency: - If size(w)
More informationAlgorithms for Nonsmooth Optimization
Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationLecture 17: October 27
0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationStochastic Composition Optimization
Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators
More informationLecture 2: Subgradient Methods
Lecture 2: Subgradient Methods John C. Duchi Stanford University Park City Mathematics Institute 206 Abstract In this lecture, we discuss first order methods for the minimization of convex functions. We
More informationLecture 2: Linear regression
Lecture 2: Linear regression Roger Grosse 1 Introduction Let s ump right in and look at our first machine learning algorithm, linear regression. In regression, we are interested in predicting a scalar-valued
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationMath 273a: Optimization Convex Conjugacy
Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper
More informationFast Rates for Exp-concave Empirial Risk Minimization
Fast Rates for Exp-concave Empirial Risk Minimization Tomer Koren, Kfir Y. Levy Presenter: Zhe Li September 16, 2016 High Level Idea Why could we use Regularized Empirical Risk Minimization? Learning theory
More informationRegression.
Regression www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts linear regression RMSE, MAE, and R-square logistic regression convex functions and sets
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationMotivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning
Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21 Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationContraction Methods for Convex Optimization and Monotone Variational Inequalities No.18
XVIII - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No18 Linearized alternating direction method with Gaussian back substitution for separable convex optimization
More information