MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Size: px
Start display at page:

Download "MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution"

Transcription

1 1/16 MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution Dominique Guillot Departments of Mathematical Sciences University of Delaware February 26, 2016

2 Computing the lasso solution 2/16 Lasso if often used in high-dimensional problems.

3 Computing the lasso solution 2/16 Lasso if often used in high-dimensional problems. Cross-validation involves solving many lasso problems. (Note: the solutions can be computed in parallel with a computer cluster when working with large problems.)

4 Computing the lasso solution 2/16 Lasso if often used in high-dimensional problems. Cross-validation involves solving many lasso problems. (Note: the solutions can be computed in parallel with a computer cluster when working with large problems.) How can we eciently compute the lasso solution?

5 Computing the lasso solution 2/16 Lasso if often used in high-dimensional problems. Cross-validation involves solving many lasso problems. (Note: the solutions can be computed in parallel with a computer cluster when working with large problems.) How can we eciently compute the lasso solution? Recall: the lasso objective y Xβ β 1 is NOT dierentiable everywhere on R p.

6 Computing the lasso solution 2/16 Lasso if often used in high-dimensional problems. Cross-validation involves solving many lasso problems. (Note: the solutions can be computed in parallel with a computer cluster when working with large problems.) How can we eciently compute the lasso solution? Recall: the lasso objective y Xβ β 1 is NOT dierentiable everywhere on R p. Many strategies exist for solving minimizing the lasso objective function,

7 Computing the lasso solution 2/16 Lasso if often used in high-dimensional problems. Cross-validation involves solving many lasso problems. (Note: the solutions can be computed in parallel with a computer cluster when working with large problems.) How can we eciently compute the lasso solution? Recall: the lasso objective y Xβ β 1 is NOT dierentiable everywhere on R p. Many strategies exist for solving minimizing the lasso objective function, We will look at two approaches: coordinate descent, and least-angle regression (LARS).

8 Coordinate descent optimization 3/16 Objective: Minimize a function f : R n R.

9 Coordinate descent optimization 3/16 Objective: Minimize a function f : R n R. Strategy: Minimize each coordinate separately while cycling through the coordinates.

10 Coordinate descent optimization 3/16 Objective: Minimize a function f : R n R. Strategy: Minimize each coordinate separately while cycling through the coordinates. x (k+1) 1 = argmin x (k+1) x 2 = argmin x (k+1) x 3 = argmin x (k+1) p. x f(x, x (k) 2, x(k) 3,..., x(k) p ) f(x (k+1) 1, x, x (k) 3,..., x(k) p ) f(x (k+1) 1, x (k+1) 2, x, x (k) 4,..., x(k) p ) = argmin f(x (k+1) 1, x (k+1) 2,..., x (k+1) p 1 x, x).

11 Coordinate descent optimization 3/16 Objective: Minimize a function f : R n R. Strategy: Minimize each coordinate separately while cycling through the coordinates. x (k+1) 1 = argmin x (k+1) x 2 = argmin x (k+1) x 3 = argmin x (k+1) p. x f(x, x (k) 2, x(k) 3,..., x(k) p ) f(x (k+1) 1, x, x (k) 3,..., x(k) p ) f(x (k+1) 1, x (k+1) 2, x, x (k) 4,..., x(k) p ) = argmin f(x (k+1) 1, x (k+1) 2,..., x (k+1) p 1 x, x). Neglected technique in the past that gained popularity recently.

12 Coordinate descent optimization 3/16 Objective: Minimize a function f : R n R. Strategy: Minimize each coordinate separately while cycling through the coordinates. x (k+1) 1 = argmin x (k+1) x 2 = argmin x (k+1) x 3 = argmin x (k+1) p. x f(x, x (k) 2, x(k) 3,..., x(k) p ) f(x (k+1) 1, x, x (k) 3,..., x(k) p ) f(x (k+1) 1, x (k+1) 2, x, x (k) 4,..., x(k) p ) = argmin f(x (k+1) 1, x (k+1) 2,..., x (k+1) p 1 x, x). Neglected technique in the past that gained popularity recently. Can be very ecient when the coordinate-wise problems are easy to solve (e.g. if they admit a closed-form solution).

13 Coordinate descent optimization 4/16 Source: Wikipedia (Nicoguaro).

14 Convergence 5/16 Does this procedure always converge to an extreme point of the objective in general? NO! Source: Wikipedia (Nicoguaro).

15 Convergence (cont.) 6/16 Does coordinate descent work for the lasso? YES! We exploit the fact that the non-dierentiable part of the objective is separable.

16 Convergence (cont.) 6/16 Does coordinate descent work for the lasso? YES! We exploit the fact that the non-dierentiable part of the objective is separable. Theorem: (See Tseng, 2001). Suppose p f(x 1,..., x p ) = f 0 (x 1,..., x p ) + f i (x i ) (f R p ) satises i=1 1 f 0 : R p R is convex and continuously dierentiable. 2 f i : R R is convex (i = 1,..., p). 3 The set X 0 := {x R p : f(x) f(x 0 )} is compact. 4 f is continuous on X 0. Then every limit point of the sequence (x (k) ) k 1 generated by cyclic coordinate descent converges to a global minimum of f.

17 Lasso: individual step 7/16 Fix x j for j i. We need to solve: 1 min x i 2 y Ax = min x i 2 ( n y l l=1 p x k k=1 ) 2 p a lm x m + m=1 p x k. k=1

18 Lasso: individual step 7/16 Fix x j for j i. We need to solve: Now, 1 x i 2 1 min x i 2 y Ax = min x i 2 ( n y l l=1 ( n y l l=1 p x k k=1 ) 2 p a lm x m + m=1 ) 2 p a lm x m = m=1 ( n y l l=1 = A T i (Ax y) p x k. k=1 ) p a lm x m ( a li ) m=1 = A T i (A i x i y) + A T i A i x i.

19 Lasso: individual step 7/16 Fix x j for j i. We need to solve: Now, 1 x i 2 1 min x i 2 y Ax = min x i 2 ( n y l l=1 ( n y l l=1 p x k k=1 ) 2 p a lm x m + m=1 ) 2 p a lm x m = m=1 ( n y l l=1 What about the non-dierential part? = A T i (Ax y) p x k. k=1 ) p a lm x m ( a li ) m=1 = A T i (A i x i y) + A T i A i x i.

20 Digression: subdierential calculus 8/16 Suppose f is convex and dierentiable. Then f(y) f(x) + f(x) T (y x). Boyd & Vandenberghe, Figure 3.2.

21 Digression: subdierential calculus Suppose f is convex and dierentiable. Then f(y) f(x) + f(x) T (y x). Boyd & Vandenberghe, Figure 3.2. We say that g is a subgradient of f at x if f(y) f(x) + g T (y x) y. Boyd, lecture notes. 8/16

22 Digression: subdierential calculus (cont.) 9/16 We dene f(x) := {all subgradients of f at x}.

23 Digression: subdierential calculus (cont.) 9/16 We dene f(x) := {all subgradients of f at x}. f(x) is a closed convex set (can be empty).

24 Digression: subdierential calculus (cont.) 9/16 We dene f(x) := {all subgradients of f at x}. f(x) is a closed convex set (can be empty). f(x) = { f(x)} if f is dierentiable at x.

25 Digression: subdierential calculus (cont.) 9/16 We dene f(x) := {all subgradients of f at x}. f(x) is a closed convex set (can be empty). f(x) = { f(x)} if f is dierentiable at x. If f(x) = {g}, then f is dierentiable at x and f(x) = g.

26 Digression: subdierential calculus (cont.) 9/16 We dene f(x) := {all subgradients of f at x}. f(x) is a closed convex set (can be empty). f(x) = { f(x)} if f is dierentiable at x. If f(x) = {g}, then f is dierentiable at x and f(x) = g. Basic properties: (f) = f if > 0. (f 1 + f 2 ) = f 1 + f 2. Example: { 1} if x < 0 f(x) = [ 1, 1] if x = 0. {1} if x > 0

27 Digression: subdierential calculus (cont.) 10/16 Recall: If f is convex and dierentiable, then f(x ) = inf x f(x) 0 = f(x ).

28 Digression: subdierential calculus (cont.) 10/16 Recall: If f is convex and dierentiable, then f(x ) = inf x f(x) 0 = f(x ). Theorem: Let f be a (not necessarily dierentiable) convex function. Then f(x ) = inf x f(x) 0 f(x ).

29 Digression: subdierential calculus (cont.) 10/16 Recall: If f is convex and dierentiable, then f(x ) = inf x f(x) 0 = f(x ). Theorem: Let f be a (not necessarily dierentiable) convex function. Then f(x ) = inf x f(x) 0 f(x ). Proof. f(y) f(x ) + 0 (y x ) 0 f(x ).

30 Digression: subdierential calculus (cont.) 10/16 Recall: If f is convex and dierentiable, then f(x ) = inf x f(x) 0 = f(x ). Theorem: Let f be a (not necessarily dierentiable) convex function. Then f(x ) = inf x f(x) 0 f(x ). Proof. f(y) f(x ) + 0 (y x ) 0 f(x ). Despite its simplicity, this is a very powerful and important result.

31 Back to the lasso 11/16 The function f(x i ) := 1 p 2 y Ax x k k=1 is convex. Its minimum is obtained if 0 f(x ).

32 Back to the lasso 11/16 The function f(x i ) := 1 2 y Ax p x k k=1 is convex. Its minimum is obtained if 0 f(x ). Let g := x i y Ax 2 2 = AT i (A ix i y) + A T i A ix i. Then, {g } if x i < 0 f(x) = [g, g + ] if x i = 0. {g + } if x i > 0

33 Back to the lasso 11/16 The function f(x i ) := 1 2 y Ax p x k k=1 is convex. Its minimum is obtained if 0 f(x ). Let g := x i y Ax 2 2 = AT i (A ix i y) + A T i A ix i. Then, {g } if x i < 0 f(x) = [g, g + ] if x i = 0. {g + } if x i > 0 Now, g = 0 x i = AT i (y A ix i ) + A T i A i = g + A i 2. 2

34 Back to the lasso 11/16 The function f(x i ) := 1 2 y Ax p x k k=1 is convex. Its minimum is obtained if 0 f(x ). Let g := x i y Ax 2 2 = AT i (A ix i y) + A T i A ix i. Then, {g } if x i < 0 f(x) = [g, g + ] if x i = 0. {g + } if x i > 0 Now, g = 0 x i = AT i (y A ix i ) + A T i A i This implies 0 f(x ) if x = g + < 0. = g + A i 2. 2

35 Back to the lasso (cont.) 12/16 Similarly, g + = 0 x i = AT i (y A ix i ) A T i A i = g A i 2. 2

36 Back to the lasso (cont.) 12/16 Similarly, g + = 0 x i = AT i (y A ix i ) A T i A i = g A i 2. 2 Therefore, 0 f(x ) if x = g > 0.

37 Back to the lasso (cont.) 12/16 Similarly, g + = 0 x i = AT i (y A ix i ) A T i A i = g A i 2. 2 Therefore, 0 f(x ) if x = g > 0. We found a (unique) x so that 0 f(x ) if g < or g > A i 2. 2 What happens when g?

38 Back to the lasso (cont.) 13/16 We have g A i 2 AT i (y A ix i ) 2 A T i A i A T i (y A i x i ). If x i = 0, then g = A T i (y A ix i ) and so 0 [g, g + ].

39 Back to the lasso (cont.) 13/16 We have g A i 2 AT i (y A ix i ) 2 A T i A i A T i (y A i x i ). If x i = 0, then g = A T i (y A ix i ) and so 0 [g, g + ]. We have therefore shown that 0 f(x ) if x = 0 and g.

40 Lasso: summary 14/16 We have shown the following: x = g + 0 f(x ) if x = g A i 2 2 and g < and g > x = 0 and g.

41 Lasso: summary 14/16 We have shown the following: x = g + 0 f(x ) if x = g A i 2 2 and g < and g > x = 0 and Therefore, the minimum of f(x) is obtained at g + A i 2 if g < x 2 = g A i 2 if g > 2 0 if g g..

42 Lasso: summary We have shown the following: x = g + 0 f(x ) if x = g A i 2 2 and g < and g > x = 0 and Therefore, the minimum of f(x) is obtained at g + A i 2 if g < x 2 = g A i 2 if g > 2 0 if g In other words, x = η S / A i (g ) = η S 2 2 / g. ( A T i (y A i x i ) A T i A i ),. where η ɛ is the soft-thresholding function. 14/16

43 15/16 Soft-thresholding Hard-thresholding: η H ɛ (x) = x1 x >ɛ. Soft-thresholding: η S ɛ (x) = sgn(x)( x ɛ) + 10 Hard thresholding 10 Soft thresholding Note: soft-thresholding shrinks the value until it hits zero (and then leaves it at zero). x ɛ if x > ɛ ηɛ S (x) = x + ɛ if x < ɛ. 0 if ɛ x ɛ

44 16/16 Conclusion To solve the lasso problem using coordinate descent: Pick an initial point x. Cycle through the coordinates and perform the updates ( ) A x i η S T i (y A i x i ) / A T i A. i Continue until convergence (i.e., stop when the coordinates vary less than some threshold). Exercise: Implement this algorithm in Python.

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

5. Subgradient method

5. Subgradient method L. Vandenberghe EE236C (Spring 2016) 5. Subgradient method subgradient method convergence analysis optimal step size when f is known alternating projections optimality 5-1 Subgradient method to minimize

More information

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method ORIE 4741: Learning with Big Messy Data Proximal Gradient Method Professor Udell Operations Research and Information Engineering Cornell November 13, 2017 1 / 31 Announcements Be a TA for CS/ORIE 1380:

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve

More information

MATH 829: Introduction to Data Mining and Analysis Linear Regression: old and new

MATH 829: Introduction to Data Mining and Analysis Linear Regression: old and new 1/15 MATH 829: Introduction to Data Mining and Analysis Linear Regression: old and new Dominique Guillot Departments of Mathematical Sciences University of Delaware February 10, 2016 Linear Regression:

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

Dual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725 Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

MATH 567: Mathematical Techniques in Data Science Linear Regression: old and new

MATH 567: Mathematical Techniques in Data Science Linear Regression: old and new 1/14 MATH 567: Mathematical Techniques in Data Science Linear Regression: old and new Dominique Guillot Departments of Mathematical Sciences University of Delaware February 13, 2017 Linear Regression:

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Subgradient Method. Ryan Tibshirani Convex Optimization

Subgradient Method. Ryan Tibshirani Convex Optimization Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Lecture 5: September 15

Lecture 5: September 15 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 15 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Di Jin, Mengdi Wang, Bin Deng Note: LaTeX template courtesy of UC Berkeley EECS

More information

MATH 829: Introduction to Data Mining and Analysis Support vector machines

MATH 829: Introduction to Data Mining and Analysis Support vector machines 1/10 MATH 829: Introduction to Data Mining and Analysis Support vector machines Dominique Guillot Departments of Mathematical Sciences University of Delaware March 11, 2016 Hyperplanes 2/10 Recall: A hyperplane

More information

Lecture 7: September 17

Lecture 7: September 17 10-725: Optimization Fall 2013 Lecture 7: September 17 Lecturer: Ryan Tibshirani Scribes: Serim Park,Yiming Gu 7.1 Recap. The drawbacks of Gradient Methods are: (1) requires f is differentiable; (2) relatively

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Lecture 1: Background on Convex Analysis

Lecture 1: Background on Convex Analysis Lecture 1: Background on Convex Analysis John Duchi PCMI 2016 Outline I Convex sets 1.1 Definitions and examples 2.2 Basic properties 3.3 Projections onto convex sets 4.4 Separating and supporting hyperplanes

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives Subgradients subgradients and quasigradients subgradient calculus optimality conditions via subgradients directional derivatives Prof. S. Boyd, EE392o, Stanford University Basic inequality recall basic

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives

Subgradients. subgradients. strong and weak subgradient calculus. optimality conditions via subgradients. directional derivatives Subgradients subgradients strong and weak subgradient calculus optimality conditions via subgradients directional derivatives Prof. S. Boyd, EE364b, Stanford University Basic inequality recall basic inequality

More information

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms

Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow

More information

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels 1/12 MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels Dominique Guillot Departments of Mathematical Sciences University of Delaware March 14, 2016 Separating sets:

More information

MATH 829: Introduction to Data Mining and Analysis Least angle regression

MATH 829: Introduction to Data Mining and Analysis Least angle regression 1/14 MATH 829: Introduction to Data Mining and Analysis Least angle regression Dominique Guillot Departments of Mathematical Sciences University of Delaware February 29, 2016 Least angle regression (LARS)

More information

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725

Subgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725 Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:

More information

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods

Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Coordinate Update Algorithm Short Course Subgradients and Subgradient Methods Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 30 Notation f : H R { } is a closed proper convex function domf := {x R n

More information

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Is the test error unbiased for these programs? 2017 Kevin Jamieson Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2013-14) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Lecture 6: Conic Optimization September 8

Lecture 6: Conic Optimization September 8 IE 598: Big Data Optimization Fall 2016 Lecture 6: Conic Optimization September 8 Lecturer: Niao He Scriber: Juan Xu Overview In this lecture, we finish up our previous discussion on optimality conditions

More information

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016

Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall Nov 2 Dec 2016 Design and Analysis of Algorithms Lecture Notes on Convex Optimization CS 6820, Fall 206 2 Nov 2 Dec 206 Let D be a convex subset of R n. A function f : D R is convex if it satisfies f(tx + ( t)y) tf(x)

More information

Math 273a: Optimization Subgradient Methods

Math 273a: Optimization Subgradient Methods Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R

More information

Optimization Tutorial 1. Basic Gradient Descent

Optimization Tutorial 1. Basic Gradient Descent E0 270 Machine Learning Jan 16, 2015 Optimization Tutorial 1 Basic Gradient Descent Lecture by Harikrishna Narasimhan Note: This tutorial shall assume background in elementary calculus and linear algebra.

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

10. Unconstrained minimization

10. Unconstrained minimization Convex Optimization Boyd & Vandenberghe 10. Unconstrained minimization terminology and assumptions gradient descent method steepest descent method Newton s method self-concordant functions implementation

More information

MATH 829: Introduction to Data Mining and Analysis Clustering II

MATH 829: Introduction to Data Mining and Analysis Clustering II his lecture is based on U. von Luxburg, A Tutorial on Spectral Clustering, Statistics and Computing, 17 (4), 2007. MATH 829: Introduction to Data Mining and Analysis Clustering II Dominique Guillot Departments

More information

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Gradient Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725 Gradient Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Based on slides from Vandenberghe, Tibshirani Gradient Descent Consider unconstrained, smooth convex

More information

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus

Subgradient. Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes. definition. subgradient calculus 1/41 Subgradient Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes definition subgradient calculus duality and optimality conditions directional derivative Basic inequality

More information

MATH 829: Introduction to Data Mining and Analysis Consistency of Linear Regression

MATH 829: Introduction to Data Mining and Analysis Consistency of Linear Regression 1/9 MATH 829: Introduction to Data Mining and Analysis Consistency of Linear Regression Dominique Guillot Deartments of Mathematical Sciences University of Delaware February 15, 2016 Distribution of regression

More information

Math 273a: Optimization Convex Conjugacy

Math 273a: Optimization Convex Conjugacy Math 273a: Optimization Convex Conjugacy Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Convex conjugate (the Legendre transform) Let f be a closed proper

More information

Regression.

Regression. Regression www.biostat.wisc.edu/~dpage/cs760/ Goals for the lecture you should understand the following concepts linear regression RMSE, MAE, and R-square logistic regression convex functions and sets

More information

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 4. Subgradient Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 4 Subgradient Shiqian Ma, MAT-258A: Numerical Optimization 2 4.1. Subgradients definition subgradient calculus duality and optimality conditions Shiqian

More information

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.)

MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.) 1/12 MATH 829: Introduction to Data Mining and Analysis Graphical Models III - Gaussian Graphical Models (cont.) Dominique Guillot Departments of Mathematical Sciences University of Delaware May 6, 2016

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

Lecture 17: October 27

Lecture 17: October 27 0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent

Lecture 5: Gradient Descent. 5.1 Unconstrained minimization problems and Gradient descent 10-725/36-725: Convex Optimization Spring 2015 Lecturer: Ryan Tibshirani Lecture 5: Gradient Descent Scribes: Loc Do,2,3 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Subgradient Descent. David S. Rosenberg. New York University. February 7, 2018

Subgradient Descent. David S. Rosenberg. New York University. February 7, 2018 Subgradient Descent David S. Rosenberg New York University February 7, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 February 7, 2018 1 / 43 Contents 1 Motivation and Review:

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning Elman Mansimov 1 September 24, 2015 1 Modified based on Shenlong Wang s and Jake Snell s tutorials, with additional contents borrowed from Kevin Swersky and Jasper Snoek

More information

Regression Shrinkage and Selection via the Lasso

Regression Shrinkage and Selection via the Lasso Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding Patrick Breheny February 15 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/24 Introduction Last week, we introduced penalized regression and discussed ridge regression, in which the penalty

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Classification: Logistic Regression from Data

Classification: Logistic Regression from Data Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber Boulder Classification:

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 4. Suvrit Sra. (Conjugates, subdifferentials) 31 Jan, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 4. Suvrit Sra. (Conjugates, subdifferentials) 31 Jan, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 4 (Conjugates, subdifferentials) 31 Jan, 2013 Suvrit Sra Organizational HW1 due: 14th Feb 2013 in class. Please L A TEX your solutions (contact TA if this

More information

MATH 829: Introduction to Data Mining and

MATH 829: Introduction to Data Mining and Supervised and unsupervised learning Supervised learning problems: MATH 829: Introduction to Data Mining and (X, Y ) P (X, Y ). Data Analysis Clustering I is labelled (input/output) with joint density

More information

The lasso, persistence, and cross-validation

The lasso, persistence, and cross-validation The lasso, persistence, and cross-validation Daniel J. McDonald Department of Statistics Indiana University http://www.stat.cmu.edu/ danielmc Joint work with: Darren Homrighausen Colorado State University

More information

Least Mean Squares Regression

Least Mean Squares Regression Least Mean Squares Regression Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Lecture Overview Linear classifiers What functions do linear classifiers express? Least Squares Method

More information

Lecture 14 Ellipsoid method

Lecture 14 Ellipsoid method S. Boyd EE364 Lecture 14 Ellipsoid method idea of localization methods bisection on R center of gravity algorithm ellipsoid method 14 1 Localization f : R n R convex (and for now, differentiable) problem:

More information

Lecture 5: September 12

Lecture 5: September 12 10-725/36-725: Convex Optimization Fall 2015 Lecture 5: September 12 Lecturer: Lecturer: Ryan Tibshirani Scribes: Scribes: Barun Patra and Tyler Vuong Note: LaTeX template courtesy of UC Berkeley EECS

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Convex optimization COMS 4771

Convex optimization COMS 4771 Convex optimization COMS 4771 1. Recap: learning via optimization Soft-margin SVMs Soft-margin SVM optimization problem defined by training data: w R d λ 2 w 2 2 + 1 n n [ ] 1 y ix T i w. + 1 / 15 Soft-margin

More information

Convergence of Fixed-Point Iterations

Convergence of Fixed-Point Iterations Convergence of Fixed-Point Iterations Instructor: Wotao Yin (UCLA Math) July 2016 1 / 30 Why study fixed-point iterations? Abstract many existing algorithms in optimization, numerical linear algebra, and

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 20 Subgradients Assumptions

More information

Oslo Class 6 Sparsity based regularization

Oslo Class 6 Sparsity based regularization RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity

More information

Lecture 1: September 25

Lecture 1: September 25 0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If

More information

ICS-E4030 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This

More information

Convex Optimization / Homework 1, due September 19

Convex Optimization / Homework 1, due September 19 Convex Optimization 1-725/36-725 Homework 1, due September 19 Instructions: You must complete Problems 1 3 and either Problem 4 or Problem 5 (your choice between the two). When you submit the homework,

More information

14. Nonlinear equations

14. Nonlinear equations L. Vandenberghe ECE133A (Winter 2018) 14. Nonlinear equations Newton method for nonlinear equations damped Newton method for unconstrained minimization Newton method for nonlinear least squares 14-1 Set

More information

Math 328 Course Notes

Math 328 Course Notes Math 328 Course Notes Ian Robertson March 3, 2006 3 Properties of C[0, 1]: Sup-norm and Completeness In this chapter we are going to examine the vector space of all continuous functions defined on the

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu February

More information

Barrier Method. Javier Peña Convex Optimization /36-725

Barrier Method. Javier Peña Convex Optimization /36-725 Barrier Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: Newton s method For root-finding F (x) = 0 x + = x F (x) 1 F (x) For optimization x f(x) x + = x 2 f(x) 1 f(x) Assume f strongly

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Nonconvex penalties: Signal-to-noise ratio and algorithms

Nonconvex penalties: Signal-to-noise ratio and algorithms Nonconvex penalties: Signal-to-noise ratio and algorithms Patrick Breheny March 21 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/22 Introduction In today s lecture, we will return to nonconvex

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

18.657: Mathematics of Machine Learning

18.657: Mathematics of Machine Learning 8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture Scribe: Kevin Li Oct. 4, 05. CONVEX OPTIMIZATION FOR MACHINE LEARNING In this lecture, we will cover the basics of convex optimization

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Convex functions, subdifferentials, and the L.a.s.s.o.

Convex functions, subdifferentials, and the L.a.s.s.o. Convex functions, subdifferentials, and the L.a.s.s.o. MATH 697 AM:ST September 26, 2017 Warmup Let us consider the absolute value function in R p u(x) = (x, x) = x 2 1 +... + x2 p Evidently, u is differentiable

More information

Lecture 11. Linear Soft Margin Support Vector Machines

Lecture 11. Linear Soft Margin Support Vector Machines CS142: Machine Learning Spring 2017 Lecture 11 Instructor: Pedro Felzenszwalb Scribes: Dan Xiang, Tyler Dae Devlin Linear Soft Margin Support Vector Machines We continue our discussion of linear soft margin

More information

converges as well if x < 1. 1 x n x n 1 1 = 2 a nx n

converges as well if x < 1. 1 x n x n 1 1 = 2 a nx n Solve the following 6 problems. 1. Prove that if series n=1 a nx n converges for all x such that x < 1, then the series n=1 a n xn 1 x converges as well if x < 1. n For x < 1, x n 0 as n, so there exists

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

MATH 567: Mathematical Techniques in Data Science Clustering II

MATH 567: Mathematical Techniques in Data Science Clustering II This lecture is based on U. von Luxburg, A Tutorial on Spectral Clustering, Statistics and Computing, 17 (4), 2007. MATH 567: Mathematical Techniques in Data Science Clustering II Dominique Guillot Departments

More information

Lecture 2: Minimax theorem, Impagliazzo Hard Core Lemma

Lecture 2: Minimax theorem, Impagliazzo Hard Core Lemma Lecture 2: Minimax theorem, Impagliazzo Hard Core Lemma Topics in Pseudorandomness and Complexity Theory (Spring 207) Rutgers University Swastik Kopparty Scribe: Cole Franks Zero-sum games are two player

More information