MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution

Size: px

Start display at page:

Download "MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution"

Asher Turner
5 years ago
Views:

1 1/16 MATH 829: Introduction to Data Mining and Analysis Computing the lasso solution Dominique Guillot Departments of Mathematical Sciences University of Delaware February 26, 2016

2 Computing the lasso solution 2/16 Lasso if often used in high-dimensional problems.

3 Computing the lasso solution 2/16 Lasso if often used in high-dimensional problems. Cross-validation involves solving many lasso problems. (Note: the solutions can be computed in parallel with a computer cluster when working with large problems.)

4 Computing the lasso solution 2/16 Lasso if often used in high-dimensional problems. Cross-validation involves solving many lasso problems. (Note: the solutions can be computed in parallel with a computer cluster when working with large problems.) How can we eciently compute the lasso solution?

5 Computing the lasso solution 2/16 Lasso if often used in high-dimensional problems. Cross-validation involves solving many lasso problems. (Note: the solutions can be computed in parallel with a computer cluster when working with large problems.) How can we eciently compute the lasso solution? Recall: the lasso objective y Xβ β 1 is NOT dierentiable everywhere on R p.

6 Computing the lasso solution 2/16 Lasso if often used in high-dimensional problems. Cross-validation involves solving many lasso problems. (Note: the solutions can be computed in parallel with a computer cluster when working with large problems.) How can we eciently compute the lasso solution? Recall: the lasso objective y Xβ β 1 is NOT dierentiable everywhere on R p. Many strategies exist for solving minimizing the lasso objective function,

7 Computing the lasso solution 2/16 Lasso if often used in high-dimensional problems. Cross-validation involves solving many lasso problems. (Note: the solutions can be computed in parallel with a computer cluster when working with large problems.) How can we eciently compute the lasso solution? Recall: the lasso objective y Xβ β 1 is NOT dierentiable everywhere on R p. Many strategies exist for solving minimizing the lasso objective function, We will look at two approaches: coordinate descent, and least-angle regression (LARS).

8 Coordinate descent optimization 3/16 Objective: Minimize a function f : R n R.

9 Coordinate descent optimization 3/16 Objective: Minimize a function f : R n R. Strategy: Minimize each coordinate separately while cycling through the coordinates.

10 Coordinate descent optimization 3/16 Objective: Minimize a function f : R n R. Strategy: Minimize each coordinate separately while cycling through the coordinates. x (k+1) 1 = argmin x (k+1) x 2 = argmin x (k+1) x 3 = argmin x (k+1) p. x f(x, x (k) 2, x(k) 3,..., x(k) p ) f(x (k+1) 1, x, x (k) 3,..., x(k) p ) f(x (k+1) 1, x (k+1) 2, x, x (k) 4,..., x(k) p ) = argmin f(x (k+1) 1, x (k+1) 2,..., x (k+1) p 1 x, x).

11 Coordinate descent optimization 3/16 Objective: Minimize a function f : R n R. Strategy: Minimize each coordinate separately while cycling through the coordinates. x (k+1) 1 = argmin x (k+1) x 2 = argmin x (k+1) x 3 = argmin x (k+1) p. x f(x, x (k) 2, x(k) 3,..., x(k) p ) f(x (k+1) 1, x, x (k) 3,..., x(k) p ) f(x (k+1) 1, x (k+1) 2, x, x (k) 4,..., x(k) p ) = argmin f(x (k+1) 1, x (k+1) 2,..., x (k+1) p 1 x, x). Neglected technique in the past that gained popularity recently.

12 Coordinate descent optimization 3/16 Objective: Minimize a function f : R n R. Strategy: Minimize each coordinate separately while cycling through the coordinates. x (k+1) 1 = argmin x (k+1) x 2 = argmin x (k+1) x 3 = argmin x (k+1) p. x f(x, x (k) 2, x(k) 3,..., x(k) p ) f(x (k+1) 1, x, x (k) 3,..., x(k) p ) f(x (k+1) 1, x (k+1) 2, x, x (k) 4,..., x(k) p ) = argmin f(x (k+1) 1, x (k+1) 2,..., x (k+1) p 1 x, x). Neglected technique in the past that gained popularity recently. Can be very ecient when the coordinate-wise problems are easy to solve (e.g. if they admit a closed-form solution).

13 Coordinate descent optimization 4/16 Source: Wikipedia (Nicoguaro).

14 Convergence 5/16 Does this procedure always converge to an extreme point of the objective in general? NO! Source: Wikipedia (Nicoguaro).

15 Convergence (cont.) 6/16 Does coordinate descent work for the lasso? YES! We exploit the fact that the non-dierentiable part of the objective is separable.

16 Convergence (cont.) 6/16 Does coordinate descent work for the lasso? YES! We exploit the fact that the non-dierentiable part of the objective is separable. Theorem: (See Tseng, 2001). Suppose p f(x 1,..., x p ) = f 0 (x 1,..., x p ) + f i (x i ) (f R p ) satises i=1 1 f 0 : R p R is convex and continuously dierentiable. 2 f i : R R is convex (i = 1,..., p). 3 The set X 0 := {x R p : f(x) f(x 0 )} is compact. 4 f is continuous on X 0. Then every limit point of the sequence (x (k) ) k 1 generated by cyclic coordinate descent converges to a global minimum of f.

17 Lasso: individual step 7/16 Fix x j for j i. We need to solve: 1 min x i 2 y Ax = min x i 2 ( n y l l=1 p x k k=1 ) 2 p a lm x m + m=1 p x k. k=1

18 Lasso: individual step 7/16 Fix x j for j i. We need to solve: Now, 1 x i 2 1 min x i 2 y Ax = min x i 2 ( n y l l=1 ( n y l l=1 p x k k=1 ) 2 p a lm x m + m=1 ) 2 p a lm x m = m=1 ( n y l l=1 = A T i (Ax y) p x k. k=1 ) p a lm x m ( a li ) m=1 = A T i (A i x i y) + A T i A i x i.

19 Lasso: individual step 7/16 Fix x j for j i. We need to solve: Now, 1 x i 2 1 min x i 2 y Ax = min x i 2 ( n y l l=1 ( n y l l=1 p x k k=1 ) 2 p a lm x m + m=1 ) 2 p a lm x m = m=1 ( n y l l=1 What about the non-dierential part? = A T i (Ax y) p x k. k=1 ) p a lm x m ( a li ) m=1 = A T i (A i x i y) + A T i A i x i.

20 Digression: subdierential calculus 8/16 Suppose f is convex and dierentiable. Then f(y) f(x) + f(x) T (y x). Boyd & Vandenberghe, Figure 3.2.

21 Digression: subdierential calculus Suppose f is convex and dierentiable. Then f(y) f(x) + f(x) T (y x). Boyd & Vandenberghe, Figure 3.2. We say that g is a subgradient of f at x if f(y) f(x) + g T (y x) y. Boyd, lecture notes. 8/16

22 Digression: subdierential calculus (cont.) 9/16 We dene f(x) := {all subgradients of f at x}.

23 Digression: subdierential calculus (cont.) 9/16 We dene f(x) := {all subgradients of f at x}. f(x) is a closed convex set (can be empty).

24 Digression: subdierential calculus (cont.) 9/16 We dene f(x) := {all subgradients of f at x}. f(x) is a closed convex set (can be empty). f(x) = { f(x)} if f is dierentiable at x.

25 Digression: subdierential calculus (cont.) 9/16 We dene f(x) := {all subgradients of f at x}. f(x) is a closed convex set (can be empty). f(x) = { f(x)} if f is dierentiable at x. If f(x) = {g}, then f is dierentiable at x and f(x) = g.

26 Digression: subdierential calculus (cont.) 9/16 We dene f(x) := {all subgradients of f at x}. f(x) is a closed convex set (can be empty). f(x) = { f(x)} if f is dierentiable at x. If f(x) = {g}, then f is dierentiable at x and f(x) = g. Basic properties: (f) = f if > 0. (f 1 + f 2 ) = f 1 + f 2. Example: { 1} if x < 0 f(x) = [ 1, 1] if x = 0. {1} if x > 0

27 Digression: subdierential calculus (cont.) 10/16 Recall: If f is convex and dierentiable, then f(x ) = inf x f(x) 0 = f(x ).

28 Digression: subdierential calculus (cont.) 10/16 Recall: If f is convex and dierentiable, then f(x ) = inf x f(x) 0 = f(x ). Theorem: Let f be a (not necessarily dierentiable) convex function. Then f(x ) = inf x f(x) 0 f(x ).

29 Digression: subdierential calculus (cont.) 10/16 Recall: If f is convex and dierentiable, then f(x ) = inf x f(x) 0 = f(x ). Theorem: Let f be a (not necessarily dierentiable) convex function. Then f(x ) = inf x f(x) 0 f(x ). Proof. f(y) f(x ) + 0 (y x ) 0 f(x ).

30 Digression: subdierential calculus (cont.) 10/16 Recall: If f is convex and dierentiable, then f(x ) = inf x f(x) 0 = f(x ). Theorem: Let f be a (not necessarily dierentiable) convex function. Then f(x ) = inf x f(x) 0 f(x ). Proof. f(y) f(x ) + 0 (y x ) 0 f(x ). Despite its simplicity, this is a very powerful and important result.

31 Back to the lasso 11/16 The function f(x i ) := 1 p 2 y Ax x k k=1 is convex. Its minimum is obtained if 0 f(x ).

32 Back to the lasso 11/16 The function f(x i ) := 1 2 y Ax p x k k=1 is convex. Its minimum is obtained if 0 f(x ). Let g := x i y Ax 2 2 = AT i (A ix i y) + A T i A ix i. Then, {g } if x i < 0 f(x) = [g, g + ] if x i = 0. {g + } if x i > 0

33 Back to the lasso 11/16 The function f(x i ) := 1 2 y Ax p x k k=1 is convex. Its minimum is obtained if 0 f(x ). Let g := x i y Ax 2 2 = AT i (A ix i y) + A T i A ix i. Then, {g } if x i < 0 f(x) = [g, g + ] if x i = 0. {g + } if x i > 0 Now, g = 0 x i = AT i (y A ix i ) + A T i A i = g + A i 2. 2

34 Back to the lasso 11/16 The function f(x i ) := 1 2 y Ax p x k k=1 is convex. Its minimum is obtained if 0 f(x ). Let g := x i y Ax 2 2 = AT i (A ix i y) + A T i A ix i. Then, {g } if x i < 0 f(x) = [g, g + ] if x i = 0. {g + } if x i > 0 Now, g = 0 x i = AT i (y A ix i ) + A T i A i This implies 0 f(x ) if x = g + < 0. = g + A i 2. 2

35 Back to the lasso (cont.) 12/16 Similarly, g + = 0 x i = AT i (y A ix i ) A T i A i = g A i 2. 2

36 Back to the lasso (cont.) 12/16 Similarly, g + = 0 x i = AT i (y A ix i ) A T i A i = g A i 2. 2 Therefore, 0 f(x ) if x = g > 0.

37 Back to the lasso (cont.) 12/16 Similarly, g + = 0 x i = AT i (y A ix i ) A T i A i = g A i 2. 2 Therefore, 0 f(x ) if x = g > 0. We found a (unique) x so that 0 f(x ) if g < or g > A i 2. 2 What happens when g?

38 Back to the lasso (cont.) 13/16 We have g A i 2 AT i (y A ix i ) 2 A T i A i A T i (y A i x i ). If x i = 0, then g = A T i (y A ix i ) and so 0 [g, g + ].

39 Back to the lasso (cont.) 13/16 We have g A i 2 AT i (y A ix i ) 2 A T i A i A T i (y A i x i ). If x i = 0, then g = A T i (y A ix i ) and so 0 [g, g + ]. We have therefore shown that 0 f(x ) if x = 0 and g.

40 Lasso: summary 14/16 We have shown the following: x = g + 0 f(x ) if x = g A i 2 2 and g < and g > x = 0 and g.

41 Lasso: summary 14/16 We have shown the following: x = g + 0 f(x ) if x = g A i 2 2 and g < and g > x = 0 and Therefore, the minimum of f(x) is obtained at g + A i 2 if g < x 2 = g A i 2 if g > 2 0 if g g..

42 Lasso: summary We have shown the following: x = g + 0 f(x ) if x = g A i 2 2 and g < and g > x = 0 and Therefore, the minimum of f(x) is obtained at g + A i 2 if g < x 2 = g A i 2 if g > 2 0 if g In other words, x = η S / A i (g ) = η S 2 2 / g. ( A T i (y A i x i ) A T i A i ),. where η ɛ is the soft-thresholding function. 14/16

43 15/16 Soft-thresholding Hard-thresholding: η H ɛ (x) = x1 x >ɛ. Soft-thresholding: η S ɛ (x) = sgn(x)( x ɛ) + 10 Hard thresholding 10 Soft thresholding Note: soft-thresholding shrinks the value until it hits zero (and then leaves it at zero). x ɛ if x > ɛ ηɛ S (x) = x + ɛ if x < ɛ. 0 if ɛ x ɛ

44 16/16 Conclusion To solve the lasso problem using coordinate descent: Pick an initial point x. Cycle through the coordinates and perform the updates ( ) A x i η S T i (y A i x i ) / A T i A. i Continue until convergence (i.e., stop when the coordinates vary less than some threshold). Exercise: Implement this algorithm in Python.

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like