Optimization methods
|
|
- Henry Hunt
- 5 years ago
- Views:
Transcription
1 Optimization methods Optimization-Based Data Analysis Carlos Fernandez-Granda /8/016
2 Introduction Aim: Overview of optimization methods that Tend to scale well with the problem dimension Are widely used in machine learning and signal processing Are (reasonably) well understood theoretically
3 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
4 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
5 Gradient Direction of maximum variation
6 Gradient descent (aka steepest descent) Method to solve the optimization problem where f is differentiable minimize f (x), Gradient-descent iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k f where α k is the step size (x (k))
7 Gradient descent (1D)
8 Gradient descent (D)
9 Small step size
10 Large step size
11 Line search Exact ( α k := arg min f x (k) β f (x (k))) β 0 Backtracking (Armijo rule) Given α 0 0 and β (0, 1), set α k := α 0 β i for the smallest i such that ( f x (k+1)) ( f x (k)) 1 α k f (x (k))
12 Backtracking line search
13 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
14 Lipschitz continuity A function f : R n R m is Lipschitz continuous with Lipschitz constant L if for any x, y R n f (y) f (x) L y x Example: f (x) := Ax is Lipschitz continuous with L = σ max (A)
15 Quadratic upper bound If the gradient of f : R n R is Lipschitz continuous with constant L then for any x, y R n f (y) f (x) L y x f (y) f (x) + f (x) T (y x) + L y x
16 Consequence of quadratic bound Since x (k+1) = x (k) α k f ( x (k)) f ( x (k+1)) f (x (k)) ( α k 1 α ) ( kl f x (k)) If α k 1 L the value of the function always decreases! f ( x (k+1)) ( f x (k)) α k f (x (k))
17 Gradient descent with constant step size Conditions: f is convex f is L-Lipschitz continuous There exists a solution x such that f (x ) is finite If α k = α 1 L f ( x (k)) x (k) f (x x (0) ) α k We need O ( 1 ɛ ) iterations to get an ɛ-optimal solution
18 Proof Recall that if α 1 L ( f x (i)) ( f x (i 1)) α f (x (i 1)) By the first-order characterization of convexity f ( x (i 1)) f (x ) f (x (i 1)) T (x (i 1) x ) f This implies f (x (i 1)) ( x (i 1) x x (i 1) x α f (x (i 1)) ) ( x (i 1) x ) x (i) x ( x (i)) f (x ) f (x (i 1)) T ( x (i 1) x ) α = 1 α = 1 α
19 Proof Because the value of f never increases, f ( x (k)) f (x ) 1 k k f i=1 ( x (i)) f (x ) = 1 ( x (0) x ) x α k (k) x x (0) x α k
20 Backtracking line search If the gradient of f : R n R is Lipschitz continuous with constant L the step size in the backtracking line search satisfies { α k α min := min α 0, β } L
21 Proof Line search ends when f ( x (k+1)) f ( x (k)) α k f but we know that if α k 1 L ( f x (k+1)) ( f x (k)) α k f (x (k)) (x (k)) This happens as soon as β/l α 0 β i 1/L
22 Gradient descent with backtracking Under the same conditions as before gradient descent with backtracking line search achieves f ( x (k)) x (0) f (x x ) α min k O ( 1 ɛ ) iterations to get an ɛ-optimal solution
23 Strong convexity A function f : R n is strongly convex if for any x, y R n f (y) f (x) + f (x) T (y x) + S y x. Example: f (x) := Ax y where A R m n is strongly convex with S = σ min (A) if m > n
24 Gradient descent for strongly convex functions If f is S-strongly convex and f is L-Lipschitz continuous f ( x (k)) f (x ) ck L x (k) x (0) c := L S 1 L S + 1 We need O ( log 1 ɛ ) iterations to get an ɛ-optimal solution
25 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
26 Lower bounds for convergence rate There exist convex functions with L-Lipschitz-continuous gradients such that for any algorithm that selects x (k) from { ( x (0) + span f x (0)) (, f x (1)),..., f (x (k 1))} we have f ( x (k)) f (x ) 3L x (0) x 3 (k + 1)
27 Nesterov s accelerated gradient method ( ) Achieves lower bound, i.e. O 1 convergence ɛ Uses momentum variable y (k+1) = x (k) α k f (x (k)) x (k+1) = β k y (k+1) + γ k y (k) Despite guarantees, why this works is not completely understood
28 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
29 Projected gradient descent Optimization problem minimize f (x) subject to x S, where f is differentiable and S is convex Projected-gradient-descent iteration: x (0) = arbitrary initialization x (k+1) = P S (x (k) α k f (x (k)))
30 Projected gradient descent
31 Projected gradient descent
32 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
33 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
34 Subgradient method Optimization problem minimize f (x) where f is convex but nondifferentiable Subgradient-method iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k q (k) where q (k) is a subgradient of f at x (k)
35 Least-squares regression with l 1 -norm regularization minimize 1 Ax y + λ x 1 Sum of subgradients is a subgradient of the sum ( ) q (k) = A T Ax (k) y + λ sign (x (k)) Subgradient-method iteration: x (0) = arbitrary initialization x (k+1) = x (k) α k (A T ( Ax (k) y ) + λ sign (x (k)))
36 Convergence of subgradient method It is not a descent method Convergence rate can be shown to be O ( 1/ɛ ) Diminishing step sizes are necessary for convergence Experiment: minimize 1 Ax y + λ x 1 A R , y = Ax 0 + z where x 0 is 100-sparse and z is iid Gaussian
37 Convergence of subgradient method 10 4 α α 0 / k α 0 /k 10 f(x (k) ) f(x ) f(x ) k
38 Convergence of subgradient method α 0 α 0 / n α 0 /n f(x (k) ) f(x ) f(x ) k
39 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
40 Composite functions Interesting class of functions for data analysis f (x) + g (x) f convex and differentiable, g convex but not differentiable Example: Least-squares regression (f ) + l 1 -norm regularization (g) 1 Ax y + λ x 1
41 Interpretation of gradient descent Solution of local first-order approximation = arg min f x (x (k)) x (k+1) := x (k) α k f ( x = arg min x (k) α k f (x (k))) x ( x (k)) + f (x (k)) T ( x x (k)) + 1 x x (k) α k
42 Proximal gradient method Idea: Minimize local first-order approximation + g ( x (k+1) = arg min f x (k)) + f (x (k)) T ( x x (k)) + 1 x x (k) x α k + g (x) 1 ( = arg min x x (k) α k f (x (k))) x + α k g (x) ( = prox αk g x (k) α k f (x (k))) Proximal operator: prox g (y) := arg min x g (x) + 1 y x
43 Proximal gradient method Method to solve the optimization problem minimize f (x) + g (x), where f is differentiable and prox g is tractable Proximal-gradient iteration: x (0) = arbitrary initialization ( x (k+1) = prox αk g x (k) α k f (x (k)))
44 Interpretation as a fixed-point method A vector ˆx is a solution to minimize f (x) + g (x), if and only if it is a fixed point of the proximal-gradient iteration for any α > 0 ˆx = prox αk g (ˆx α k f (ˆx))
45 Projected gradient descent as a proximal method The proximal operator of the indicator function { 0 if x S, I S (x) := if x / S. of a convex set S R n is projection onto S Proximal-gradient iteration: x (k+1) = prox αk I S (x (k) α k f = P S (x (k) α k f (x (k))) (x (k)))
46 Proximal operator of l 1 norm The proximal operator of the l 1 norm is the soft-thresholding operator where β > 0 and S β (y) i := prox β 1 (y) = S β (y) { y i sign (y i ) β if y i β 0 otherwise
47 Iterative Shrinkage-Thresholding Algorithm (ISTA) The proximal gradient method for the problem is called ISTA minimize 1 Ax y + λ x 1 ISTA iteration: x (0) = arbitrary initialization ( ( )) x (k+1) = S αk λ x (k) α k A T Ax (k) y
48 Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) ISTA can be accelerated using Nesterov s accelerated gradient method FISTA iteration: x (0) = arbitrary initialization z (0) = x (0) ( ( )) x (k+1) = S αk λ z (k) α k A T Az (k) y z (k+1) = x (k+1) + k (x (k+1) x (k)) k + 3
49 Convergence of proximal gradient method Without acceleration: Descent method Convergence rate can be shown to be O (1/ɛ) with constant step or backtracking line search With acceleration: Not a descent method ( ) Convergence rate can be shown to be O 1 ɛ backtracking line search Experiment: minimize 1 Ax y + λ x 1 with constant step or A R , y = Ax 0 + z, x sparse and z iid Gaussian
50 Convergence of proximal gradient method Subg. method (α 0 / k) ISTA FISTA f(x (k) ) f(x ) f(x ) k
51 Differentiable functions Gradient descent Convergence analysis of gradient descent Accelerated gradient descent Projected gradient descent Nondifferentiable functions Subgradient method Proximal gradient method Coordinate descent
52 Coordinate descent Idea: Solve the n-dimensional problem minimize h (x 1, x,..., x n ) by solving a sequence of 1D problems Coordinate-descent iteration: x (0) = arbitrary initialization x (k+1) i = arg min h α ( x (k) 1,..., α,..., x n (k) ) for some 1 i n
53 Coordinate descent Convergence is guaranteed for functions of the form n f (x) + g i (x i ) where f is convex and differentiable and g 1,..., g n are convex i=1
54 Least-squares regression with l 1 -norm regularization h (x) := 1 Ax y + λ x 1 The solution to the subproblem min xi h (x 1,..., x i,..., x n ) is ˆx i = S λ (γ i ) A i where A i is the ith column of A and m γ i := A li y l j i l=1 A lj x j
55 Computational experiments From Statistical Learning with Sparsity The Lasso and Generalizations by Hastie, Tibshirani and Wainwright
Optimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationLecture 1: September 25
0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationLecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More information5. Subgradient method
L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationFast proximal gradient methods
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationSmoothing Proximal Gradient Method. General Structured Sparse Regression
for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationMotivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning
Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21 Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic
More informationDescent methods. min x. f(x)
Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained
More informationNonconvex Sparse Logistic Regression with Weakly Convex Regularization
Nonconvex Sparse Logistic Regression with Weakly Convex Regularization Xinyue Shen, Student Member, IEEE, and Yuantao Gu, Senior Member, IEEE Abstract In this work we propose to fit a sparse logistic regression
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationLecture 17: October 27
0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationAccelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)
Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More information1. Gradient method. gradient method, first-order methods. quadratic bounds on convex functions. analysis of gradient method
L. Vandenberghe EE236C (Spring 2016) 1. Gradient method gradient method, first-order methods quadratic bounds on convex functions analysis of gradient method 1-1 Approximate course outline First-order
More informationLecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent
10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More informationLecture 7: September 17
10-725: Optimization Fall 2013 Lecture 7: September 17 Lecturer: Ryan Tibshirani Scribes: Serim Park,Yiming Gu 7.1 Recap. The drawbacks of Gradient Methods are: (1) requires f is differentiable; (2) relatively
More informationThe Frank-Wolfe Algorithm:
The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationSolution of Exercise Sheet
Conve Optimization and Modeling Jun.-Prof. Matthias Hein Solution of Eercise Sheet 9 16.6.21 Eercise 19 - Subgradient Steepest Descent a. (3 Points) Consider the minimization of the two-dimensional function
More informationOptimization for Learning and Big Data
Optimization for Learning and Big Data Donald Goldfarb Department of IEOR Columbia University Department of Mathematics Distinguished Lecture Series May 17-19, 2016. Lecture 1. First-Order Methods for
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More informationNewton-Like Methods for Sparse Inverse Covariance Estimation
Newton-Like Methods for Sparse Inverse Covariance Estimation Peder A. Olsen Figen Oztoprak Jorge Nocedal Stephen J. Rennie June 7, 2012 Abstract We propose two classes of second-order optimization methods
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationSelected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018
Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected
More informationLecture 6: September 17
10-725/36-725: Convex Optimization Fall 2015 Lecturer: Ryan Tibshirani Lecture 6: September 17 Scribes: Scribes: Wenjun Wang, Satwik Kottur, Zhiding Yu Note: LaTeX template courtesy of UC Berkeley EECS
More information1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:
Newton s Method Suppose we want to solve: (P:) min f (x) At x = x, f (x) can be approximated by: n x R. f (x) h(x) := f ( x)+ f ( x) T (x x)+ (x x) t H ( x)(x x), 2 which is the quadratic Taylor expansion
More informationMath 273a: Optimization Subgradient Methods
Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R
More informationThis can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization
This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth
More informationGaussian Graphical Models and Graphical Lasso
ELE 538B: Sparsity, Structure and Inference Gaussian Graphical Models and Graphical Lasso Yuxin Chen Princeton University, Spring 2017 Multivariate Gaussians Consider a random vector x N (0, Σ) with pdf
More informationA Sparsity Preserving Stochastic Gradient Method for Composite Optimization
A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationMinimizing the Difference of L 1 and L 2 Norms with Applications
1/36 Minimizing the Difference of L 1 and L 2 Norms with Department of Mathematical Sciences University of Texas Dallas May 31, 2017 Partially supported by NSF DMS 1522786 2/36 Outline 1 A nonconvex approach:
More informationCOMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017
COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We
More informationLinear Regression. S. Sumitra
Linear Regression S Sumitra Notations: x i : ith data point; x T : transpose of x; x ij : ith data point s jth attribute Let {(x 1, y 1 ), (x, y )(x N, y N )} be the given data, x i D and y i Y Here D
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationSparsity Regularization
Sparsity Regularization Bangti Jin Course Inverse Problems & Imaging 1 / 41 Outline 1 Motivation: sparsity? 2 Mathematical preliminaries 3 l 1 solvers 2 / 41 problem setup finite-dimensional formulation
More informationAlgorithms for Nonsmooth Optimization
Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationMATH 680 Fall November 27, Homework 3
MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and
More informationApproximation. Inderjit S. Dhillon Dept of Computer Science UT Austin. SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina.
Using Quadratic Approximation Inderjit S. Dhillon Dept of Computer Science UT Austin SAMSI Massive Datasets Opening Workshop Raleigh, North Carolina Sept 12, 2012 Joint work with C. Hsieh, M. Sustik and
More informationProximal methods. S. Villa. October 7, 2014
Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:
Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful
More informationAn Efficient Proximal Gradient Method for General Structured Sparse Learning
Journal of Machine Learning Research 11 (2010) Submitted 11/2010; Published An Efficient Proximal Gradient Method for General Structured Sparse Learning Xi Chen Qihang Lin Seyoung Kim Jaime G. Carbonell
More informationSIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University
SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some
More informationCoordinate Update Algorithm Short Course Proximal Operators and Algorithms
Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow
More informationMathematics of Data: From Theory to Computation
Mathematics of Data: From Theory to Computation Prof. Volkan Cevher volkan.cevher@epfl.ch Lecture 8: Composite convex minimization I Laboratory for Information and Inference Systems (LIONS) École Polytechnique
More informationEfficient block-coordinate descent algorithms for the Group Lasso
Math. Prog. Comp. (2013) 5:143 169 DOI 10.1007/s12532-013-0051-x FULL LENGTH PAPER Efficient block-coordinate descent algorithms for the Group Lasso Zhiwei Qin Katya Scheinberg Donald Goldfarb Received:
More informationA Tutorial on Primal-Dual Algorithm
A Tutorial on Primal-Dual Algorithm Shenlong Wang University of Toronto March 31, 2016 1 / 34 Energy minimization MAP Inference for MRFs Typical energies consist of a regularization term and a data term.
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationConvex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014
Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,
More informationLecture 5: September 12
10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS
More informationRegression Shrinkage and Selection via the Lasso
Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,
More information6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection
6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE Three Alternatives/Remedies for Gradient Projection Two-Metric Projection Methods Manifold Suboptimization Methods
More informationThe Proximal Gradient Method
Chapter 10 The Proximal Gradient Method Underlying Space: In this chapter, with the exception of Section 10.9, E is a Euclidean space, meaning a finite dimensional space endowed with an inner product,
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationLecture 6: September 12
10-725: Optimization Fall 2013 Lecture 6: September 12 Lecturer: Ryan Tibshirani Scribes: Micol Marchetti-Bowick Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have not
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationAccelerating Nesterov s Method for Strongly Convex Functions
Accelerating Nesterov s Method for Strongly Convex Functions Hao Chen Xiangrui Meng MATH301, 2011 Outline The Gap 1 The Gap 2 3 Outline The Gap 1 The Gap 2 3 Our talk begins with a tiny gap For any x 0
More informationNumerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725
Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple
More informationDuality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725
Duality in Linear Programs Lecturer: Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: proximal gradient descent Consider the problem x g(x) + h(x) with g, h convex, g differentiable, and
More informationMachine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso
Machine Learning Data Mining CS/CS/EE 155 Lecture 4: Regularization, Sparsity Lasso 1 Recap: Complete Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Function,b
More informationCHAPTER 11. A Revision. 1. The Computers and Numbers therein
CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of
More informationOptimal Regularized Dual Averaging Methods for Stochastic Optimization
Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business
More informationMATH 829: Introduction to Data Mining and Analysis Computing the lasso solution
1/16 MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution Dominique Guillot Departments of Mathematical Sciences University of Delaware February 26, 2016 Computing the lasso
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationHomework 2. Convex Optimization /36-725
Homework 2 Convex Optimization 0-725/36-725 Due Monday October 3 at 5:30pm submitted to Christoph Dann in Gates 803 Remember to a submit separate writeup for each problem, with your name at the top) Total:
More information