Nonconvex? NP! (No Problem!) Ryan Tibshirani Convex Optimization /36-725
|
|
- Candace Dawson
- 5 years ago
- Views:
Transcription
1 Nonconvex? NP! (No Problem!) Ryan Tibshirani Convex Optimization /
2 Beyond the tip? 2
3 Some takeaway points If possible, formulate task in terms of convex optimization typically easier to solve, easier to analyze Nonconvex does not necessarily mean nonscientific! However, statistically, it does typically mean high(er) variance In more cases than you might expect, nonconvex problems can be solved exactly (to global optimality) 3
4 What does it mean for a problem to be nonconvex? Consider a generic convex optimization problem: min x subject to f(x) h i (x) 0, i = 1,... m l j (x) = 0, j = 1,... r Here f, h i, i = 1,... m are convex, and l j, j = 1,... r are affine A nonconvex problem is one of this form, where not all conditions are met on the functions But trivial modifications of convex problems can lead to nonconvex formulations... so we really just consider nonconvex problems that are not trivially equivalent to convex ones 4
5 What does it mean to solve a nonconvex problem? Nonconvex problems can have local minima, i.e., there can exist a feasible x such that f(y) f(x) for all feasible y such that x y 2 R but x is still not globally optimal. (Note: we proved that this could not happen for convex problems) Hence by solving a nonconvex problem, we mean finding the global minimizer We also implicitly mean doing it efficiently, i.e., in polynomial time 5
6 Addendum This is really about putting together a list of cool problems, that are suprisingly tractable... hence there will be exceptions about nonconvexity and/or requiring exact global optima (Also, I m sure that there are many more examples out there that I m missing, so I invite you to contribute to the list!) 6
7 Outline Rough categories for today s problems: Classical/core nonconvex problems Eigen problems Graph problems Nonconvex proximal operators Discrete problems Infinite-dimensional problems Statistical problems 7
8 Classic/core nonconvex problems 8
9 Linear-fractional programs A linear-fractional program is of the form min x R n c T x + d e T x + f subject to Gx h, e T x + f > 0 Ax = b This is nonconvex (but quasiconvex). Provided that this problem is feasible, it is in fact equivalent to the linear program min y R n, z R c T y + dz subject to Gy hz 0, z 0 Ay bz = 0, e T y + fz = 1 9
10 10 The link between the two problems is the transformation y = x e T x + f, z = 1 e T x + f The proof of their equivlance is simple; e.g., see B & V Chapter 4 Linear-fractional problems show up in the study of solutions paths for many common statistical estimation problems E.g., the knots in the lasso path can be seen as the optimal values of linear-fractional programs See Taylor et al. (2013), Tests in adaptive regression via the Kac- Rice formula
11 Geometric programs A monomial is a function f : R n ++ R of the form f(x) = γx a 1 1 xa 2 2 xan n for γ > 0, a 1,... a n R. A posynomial is a sum of monomials, f(x) = p k=1 A geometric program of the form γ k x a k1 1 x a k2 2 x a kn n min subject to f(x) g i (x) 1, i = 1,... m h j (x) = 1, j = 1,... r where f, g i, i = 1,... m are posynomials and h j, j = 1,... r are monomials. This is nonconvex 11
12 12 This is equivalent to a convex problem, via a simple transformation. Given f(x) = γx a 1 1 xa 2 2 xan n, let y i = log x i and rewrite this as γ(e y 1 ) a 1 (e y 2 ) a2 (e yn ) an = e at y+b for b = log γ. Also, a posynomial can be written as p k=1 eat k y+b k. With this variable substitution, and after taking logs, a geometric program is equivalent to ( p0 ) min log e at 0k y+b 0k subject to log k=1 ( pi e at ik y+b ik k=1 ) c T j y + d j = 0, j = 1,... r 0, i = 1,... m This is convex, recalling the convexity of soft max functions
13 Many interesting problems are geometric programs; see Boyd et al. (2007), A tutorial on geometric programming, and also Chapter 8.8 of B & V book w i (x i,y i ) C i h i H W Extension to matrix world: Sra and Hosseini (2013), Geometric optimization on positive definite matrices with application to elliptically contoured distributions 13
14 14 Handling convex equality constraints Given convex f, h i, i = 1,... m, the problem min x subject to f(x) h i (x) 0, i = 1,... m l(x) = 0 is nonconvex when l is convex but not affine. A convex relaxation of this problem is min x subject to f(x) h i (x) 0, i = 1,... m l(x) 0 If we can ensure that l(x ) = 0 at any solution x of the above problem, then the two are equivalent
15 From B & V Exercises 4.6 and 4.58, e.g., consider the maximum utility problem T max α t u(x t ) x 0,...x T R b 1,...b T +1 R t=0 subject to b t+1 = b t + f(b t ) x t, t = 0,... T 0 x t b t, t = 0,... T where b 0 0 is fixed. Interpretation: x t is the amount spent of your total available money b t at time t; concave function u gives utility, concave function f measures investment return This is not a convex problem, because of the equality constraint; but can relax to b t+1 b t + f(b t ) x t, t = 0,... T without changing solution (think about throwing out money) 15
16 Problems with two quadratic functions Consider the problem involving two quadratics min x R n x T A 0 x + 2b T 0 x + c 0 subject to x T A 1 x + 2b T 1 x + c 1 0 Here A 0, A 1 need not be positive definite, so this is nonconvex. The dual problem can be cast as max u R, v R subject to u [ ] A0 + va 1 b 0 + vb 1 (b 0 + vb 1 ) T 0 c 0 + vc 1 u v 0 and (as always) is convex. Furthermore, strong duality holds. See Appendix B of B & V, see also Beck and Eldar (2006), Strong duality in nonconvex quadratic optimization with two quadratic constraints 16
17 Eigen problems 17
18 18 Principal component analysis Given a matrix Y R n p, consider the nonconvex problem min Y X R X 2 n p F subject to rank(x) = k for some fixed k. The solution here is given by the singular value decomposition of Y : if Y = UDV T, then ˆX = U k D k V T k, where U k, V k are the first k columns of U, V, and D k is the first k diagonal elements of D. I.e., ˆX is the reconstruction of Y from its first k principal components This is often called the Eckart-Young Theorem, established in 1936, but was probably known even earlier see Stewart (1992), On the early history of the singular value decomposition
19 19 Fantope Another characterization of the SVD is via the following nonconvex problem, given a symmetric matrix S R p p : min S Z R Z 2 p p F subject to rank(z) = k, Z is a projection The solution here is Ẑ = V kv T k, where the columns of V k R p k give the first k eigenvectors of S This is equivalent to a convex problem. Start by expressing the constraint set C as { } C = Z R p p : rank(z) = k, Z is a projection { = Z R p p : Z = Z T, λ i (Z) {0, 1} for i = 1,... p, } tr(z) = k
20 20 Now consider the convex hull F k = conv(c): { } F k = Z R p p : Z = Z T, λ i (Z) [0, 1], i = 1,... p, tr(z) = k { } = Z R p p : Z = Z T, 0 Z I, tr(z) = k This is called the Fantope of order k. Further, the convex problem min Z R p p S Z 2 F subject to Z F k admits the same solution as the original one, i.e., Ẑ = V kv T k See Fan (1949), On a theorem of Weyl conerning eigenvalues of linear transformations, and Overton and Womersley (1992), On the sum of the largest eigenvalues of a symmetric matrix Sparse PCA extension: Vu et al. (2013), Fantope projection and selection: near-optimal convex relaxation of sparse PCA
21 Classical multidimensional scaling Let x 1,... x n R p, and define similarities S ij = (x i x) T (x j x). Classical multidimensional scaling solves the nonconvex problem ( ) 2 S ij (z i z) T (z j z) for a fixed k min z 1,...z n R k i,j From Hastie et al. (2009), The elements of statistical learning 21
22 22 Let S be the similarity matrix (entries S ij = (x i x) T (x j x)) The classical MDS problem has an exact solution in terms of the eigendecomposition S = UD 2 U T : ẑ 1,... ẑ n are the rows of U k D k where U k is the first k columns of U, and D k the first k diagonal entries of D Note: other very similar forms of MDS are not convex, and not directly solveable, e.g., least squares scaling, with d ij = x i x j 2 : min z 1,...z n R k ( ) 2 dij z i z j 2 i,j See Hastie et al. (2009), Chapter 14
23 23 Generalized eigenvalue problems Given B, W R p p, B, W 0, consider the nonconvex problem max v R n v T Bv v T W v This is a generalized eigenvalue problem, with exact solution given by the top eigenvector of W 1 B This is important, e.g., in Fisher s discriminant analysis, where B is the between-class covariance matrix, and W the within-class covariance matrix See Hastie et al. (2009), Chapter 4
24 Graph problems 24
25 25 Min cut Given a graph G = (V, E) with V = {1,... n}, two nodes s, t V, and costs c ij 0 on edges (i, j) E. Min cut problem: min b R E,x R V subject to (i,j) E b ij c ij b ij x i x j b ij, x i, x j {0, 1} for all i, j, x s = 0, x t = 1 Think of b ij as the indicator that the edge (i, j) traverses the cut from s to t; think of x i as an indicator that node i is grouped with t. This nonconvex problem can be solved exactly using max flow (max flow/min cut theorem)
26 26 A relaxation of min cut min b R E,x R V subject to (i,j) E b ij c ij b ij x i x j for all i, j b 0 x s = 0, x t = 1 This is an LP; it is the dual of the max flow LP (see lecture 12): max f R E (s,j) E f sj subject to f ij 0, f ij c ij for all (i, j) E f ik = f kj for all k V \ {s, t} (i,k) E (k,j) E Max flow min cut theorem tells us that the relaxed min cut is tight
27 e P Shortest paths Given a graph G = (V, E), with edge costs c e, e E, consider the shortest path problem, between two nodes s, t V min c e min c e paths P P =(e 1,...e r) Dijkstra s algorithm solves this problem (and more), from Dijkstra (1959), A note on two problems in connexion with graphs e P subject to e 1,1 = s, e r,2 = t e i,2 = e i+1,1, i = 1,... r 1 Clever implementations run in O( E log V ) time; e.g., see Kleinberg and Tardos (2005), Algorithm design, Chapter 5 27
28 Nonconvex proximal operators 28
29 Hard-thresholding One of the simplest nonconvex problems, given y R n : min β R n n n (y i β i ) 2 + λ i 1{β i 0} i=1 i=1 Solution is given by hard-thresholding y, { y i if yi 2 β i = > λ i 0 otherwise, i = 1,... n and can be seen by inspection. Special case λ i = λ, i = 1,... n, min β R n y β λ β 0 Compare to soft-thresholding, prox operator for l 1 penalty. Note: changing the loss to y Xβ 2 2 gives best subset selection, which is NP hard for general X 29
30 30 l 0 segmentation Consider the nonconvex l 0 segmentation problem min β R n n n 1 (y i β i ) 2 + λ 1{β i β i+1 } i=1 i=1 Can be solved exactly using dynamic programming, in two ways: Bellman (1961), On the approximation of curves by line segments using dynamic programming, and Johnson (2013) A dynamic programming algorithm for the fused lasso and L 0 -segmentation Johnson: more efficient, Bellman: more general Worst-case O(n 2 ), but with practical performance more like O(n)
31 31 Tree-leaves projection Given target u R n, tree g on R n, and label y {0, 1}, consider min u z R z 2 n 2 + λ 1{g(z) y} Interpretation: find z close to u, whose label under g is not unlike y. Argue directly that solution is either ẑ = u or ẑ = P S (u), where S = g 1 (1) = {z : g(z) = y} the set of leaves of g assigned label y. We simply compute both options for ẑ and compare costs. Therefore problem reduces to computing P S (y), the projection onto a set of tree leaves, a highly nonconvex set This appears as a subroutine of a broader algorithm for nonconvex optimization; see Carreira-Perpinan and Wang (2012), Distributed optimization of deeply nested systems
32 The set S is a union of axis-aligned boxes; projection onto any one box is fast, O(n) operations 32
33 33 To project onto S, could just scan through all boxes, and take the closest Faster: decorate each node of tree with labels of its leaves, and bounding box. Perform depthfirst search, pruning nodes that do not contain a leaf labeled y, or whose bounding box is farther away than the current closest box
34 Discrete problems 34
35 35 Binary graph segmentation Given y R n, and a graph G = (V, E), V = {1,... n}, consider binary graph segmentation: min β {0,1} n n (y i β i ) 2 + λ ij 1{β i β j } i=1 (i,j) E Simple manipulation brings this problem to the form max a i + b j A {1,...n} i A j A c (i,j) E, A {i,j} =1 which is a segmentation problem that can be solved exactly using min cut/max flow. E.g., Kleinberg and Tardos (2005), Algorithm design, Chapter 7 λ ij
36 36 E.g., apply recursively to get a verison of graph hierarchical clustering (divisive) E.g., take the graph as a 2d grid for image segmentation (From ac.kr)
37 37 Discrete l 0 segmentation Now consider discrete l 0 segmentation: min β {b 1,...b k } n n n 1 (y i β i ) 2 + λ 1{β i β i+1 } i=1 where {b 1,... b k } is some fixed discrete set. This can be efficiently solved using classic (discrete) dynamic programming Key insight is that the 1-dimensional structure allows us to exactly solve and store ˆβ 1 (β 2 ) = ˆβ 2 (β 3 ) =... argmin β 1 {b 1,...b k } argmin β 2 {b 1,...b k } i=1 (y 1 β 1 ) 2 + λ β 1 β 2 }{{} f 1 (β 1,β 2 ) f 1 ( ˆβ1 (β 2 ), β 2 ) + (y2 β 2 ) 2 + λ β 2 β 3
38 38 Algorithm: Make a forward pass over β 1,... β n 1, keeping a look-up table; also keep a look-up table for the optimal partial criterion values f 1,... f n 1 Solve exactly for β n Make a backward pass β n 1,... β 1, reading off the look-up table β 1 β 2... β n 1 f 1 f 2... f n 1 b 1 b 2... b k b 1 b 2... b k Requires O(nk) operations
39 Infinite-dimensional problems 39
40 Smoothing splines Given pairs (x i, y i ) R R, i = 1,... n, smoothing splines solve min f n i=1 ( yi f(x i ) ) 2 + λ (f ( k+1 2 ) (t) ) 2 dt for a fixed odd k. The domain of minimization here is all functions f for which (f ( k+1 2 ) (t)) 2 dt <. Infinite-dimensional problem, but convex (in function space) Can show that the solution ˆf to the above problem is unique, and given by a natural spline of order k, with knots at x 1,... x n. This means we can restrict our attention to functions f = n θ j η j j=1 where η 1,... η n are natural spline basis functions 40
41 41 Plugging in f = n j=1 θ jη j, transform smoothing spline problem into finite-dimensional form: min θ R n y Nθ λθ T Ωθ where N ij = η j (x i ), and Ω ij = η ( k+1 2 ) i (t) η ( k+1 j solution is explicitly given by ˆθ = (N T N + λω) 1 N T y 2 ) (t) dt. The and fitted function is ˆf = n j=1 ˆθ j η j. With proper choice of basis function (B-splines), calculation of ˆθ is O(n) See, e.g., Wahba (1990), Splines models for observational data ; Green and Silverman (1994), Nonparametric regression and generalized linear models ; Hastie et al. (2009), Chapter 5
42 42 Locally adaptive regression splines Given same setup, locally adaptive regression splines solve min f n ( yi f(x i ) ) 2 + λ TV(f (k) ) i=1 for fixed k, even or odd. The domain is all f with TV(f (k) ) <, and again this is infinite-dimensional but convex Again, can show that a solution ˆf to above problem is given by a spline of order k, but two key differences: Can have any number of knots n k 1 (tuned by λ) Knots do not necessarily coincide with input points x 1,... x n See Mammen and van de Geer (1997), Locally adaptive regression splines ; in short, these are statistically more adaptive but computationally more challenging than smoothing splines
43 43 Mammen and van de Geer (1997) consider restricting attention to splines with knots contained in {x 1,... x n }; this turns the problem into finite-dimensional form, min θ R n y Gθ λ n j=k+2 θ j where G ij = g j (x i ), and g 1,... g n is a basis for splines with knots at x 1,... x n. The fitted function is ˆf = n j=1 ˆθ j g j These authors prove that the solution of this (tractable) problem ˆf and of the original problem f differ by max x [x 1,x n] ˆf(x) f (x) d k TV ( (f ) (k)) k with the maximum gap between inputs. Therefore, statistically it is reasonable to solve the finite-dimensional problem
44 E.g., a comparison, tuned to the same overall model complexity: Smoothing spline Finite-dimensional locally adaptive regression spline The left fit is easier to compute, but the right is more adaptive (Note: trend filtering estimates are asymptotically equivalent to locally adaptive regression splines, but much more efficient) 44
45 Statistical problems 45
46 46 Sparse underdetermined linear systems Suppose that X R n p has unit normed columns, X i 2 = 1, for i = 1,... n. Given y, consider the problem of finding the sparsest sparse linear solution min β β R p 0 subject to Xβ = y This is nonconvex and known to be NP hard, for a generic X. A natural convex relaxation is the l 1 basis pursuit problem: min β β R p 1 subject to Xβ = y It turns out that there is a deep connection between the two; we cite results from Donoho (2006), For most large underdetermined systems of linear equations, the minimal l 1 norm solution is also the sparsest solution
47 47 As n, p grow large, p > n, there exists a threshold ρ (depending on the ratio p/n), such that for most matrices X, if we solve the l 1 problem and find a solution with: fewer than ρn nonzero components, then this is the unique solution of the l 0 problem greater than ρn nonzero components, then there is no solution of the linear system with less than ρn nonzero components (Here most is quantified precisely in terms of a probability over matrices X, constructed by drawing columns of X uniformly at random over the unit sphere in R n ) There is a large and fast-moving body of related literature. See Donoho et al. (2009), Message-passing algorithms for compressed sensing for a nice review
48 48 Nearly optimal K-means Given data points x 1,... x n R p, the K-means problem solves 1 min c 1,...c K R p n n i=1 min k=1,...k x i c k 2 2 } {{ } f(c 1,...c K ) This is NP hard, and is usually approximately solved using Lloyd s algorithm, run many times, with random starts Careful choice of starting positions makes a big impact: running Lloyd s algorithm once, from c 1 = s 1,... c K = s K, for cleverly chosen random s 1,... s K, yields estimates ĉ 1,... ĉ K satisfying E [ f(ĉ 1,... ĉ K ) ] 8(log k + 2) min f(c 1,... c K ) c 1,...c K R p
49 49 See Arthur and Vassilvitskii (2007), k-means++: The advantages of careful seeding. In fact, their construction of s 1,... s K is very simple: Begin by choosing s 1 uniformly at random among x 1,... x n Compute squared distances d 2 i = x i s for all points i not chosen, and choose s 2 by drawing from the remaining points, with probability weights d 2 i / j d2 j Recompute the squared distances as d 2 i = min { x i s 1 2 2, x i s 2 2 } 2 and choose s 3 according to the same recipe And so on, until s 1,... s K are chosen
Nonconvex? NP! (No Problem!) Ryan Tibshirani Convex Optimization
Nonconvex? NP! (No Problem!) Ryan Tibshirani Convex Optimization 10-725 1 Outline Today: Convex versus nonconvex? Classical nonconvex problems Eigen problems Graph problems Nonconvex proximal operators
More informationNonconvex? NP! (No Problem!) Lecturer: Ryan Tibshirani Convex Optimization /36-725
Nonconvex? NP! (No Problem!) Lecturer: Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: integer programg Given convex function f, convex set C, J {1,... n}, an integer program is a problem
More informationLecture 26: April 22nd
10-725/36-725: Conve Optimization Spring 2015 Lecture 26: April 22nd Lecturer: Ryan Tibshirani Scribes: Eric Wong, Jerzy Wieczorek, Pengcheng Zhou Note: LaTeX template courtesy of UC Berkeley EECS dept.
More informationConvexity II: Optimization Basics
Conveity II: Optimization Basics Lecturer: Ryan Tibshirani Conve Optimization 10-725/36-725 See supplements for reviews of basic multivariate calculus basic linear algebra Last time: conve sets and functions
More informationAdaptive Piecewise Polynomial Estimation via Trend Filtering
Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering
More informationLecture 6: Methods for high-dimensional problems
Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationAlternating Direction Method of Multipliers. Ryan Tibshirani Convex Optimization
Alternating Direction Method of Multipliers Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last time: dual ascent min x f(x) subject to Ax = b where f is strictly convex and closed. Denote
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationIV. Matrix Approximation using Least-Squares
IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that
More information4. Convex optimization problems
Convex Optimization Boyd & Vandenberghe 4. Convex optimization problems optimization problem in standard form convex optimization problems quasiconvex optimization linear optimization quadratic optimization
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationLecture: Convex Optimization Problems
1/36 Lecture: Convex Optimization Problems http://bicmr.pku.edu.cn/~wenzw/opt-2015-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghe s lecture notes Introduction 2/36 optimization
More informationNumerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725
Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationDuality. Geoff Gordon & Ryan Tibshirani Optimization /
Duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Duality in linear programs Suppose we want to find lower bound on the optimal value in our convex problem, B min x C f(x) E.g., consider
More information14 Singular Value Decomposition
14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationConvex optimization problems. Optimization problem in standard form
Convex optimization problems optimization problem in standard form convex optimization problems linear optimization quadratic optimization geometric programming quasiconvex optimization generalized inequality
More informationLecture 2: Linear Algebra Review
EE 227A: Convex Optimization and Applications January 19 Lecture 2: Linear Algebra Review Lecturer: Mert Pilanci Reading assignment: Appendix C of BV. Sections 2-6 of the web textbook 1 2.1 Vectors 2.1.1
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More informationCS675: Convex and Combinatorial Optimization Fall 2016 Convex Optimization Problems. Instructor: Shaddin Dughmi
CS675: Convex and Combinatorial Optimization Fall 2016 Convex Optimization Problems Instructor: Shaddin Dughmi Outline 1 Convex Optimization Basics 2 Common Classes 3 Interlude: Positive Semi-Definite
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationNumerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization
Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725 Consider Last time: proximal Newton method min x g(x) + h(x) where g, h convex, g twice differentiable, and h simple. Proximal
More information7 Principal Component Analysis
7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationCMSC858P Supervised Learning Methods
CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors
More informationLecture 4: September 12
10-725/36-725: Conve Optimization Fall 2016 Lecture 4: September 12 Lecturer: Ryan Tibshirani Scribes: Jay Hennig, Yifeng Tao, Sriram Vasudevan Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More information10-725/36-725: Convex Optimization Prerequisite Topics
10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the
More informationClassification 2: Linear discriminant analysis (continued); logistic regression
Classification 2: Linear discriminant analysis (continued); logistic regression Ryan Tibshirani Data Mining: 36-462/36-662 April 4 2013 Optional reading: ISL 4.4, ESL 4.3; ISL 4.3, ESL 4.4 1 Reminder:
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationTractable Upper Bounds on the Restricted Isometry Constant
Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.
More informationLinear Algebra for Machine Learning. Sargur N. Srihari
Linear Algebra for Machine Learning Sargur N. srihari@cedar.buffalo.edu 1 Overview Linear Algebra is based on continuous math rather than discrete math Computer scientists have little experience with it
More informationCSCI5654 (Linear Programming, Fall 2013) Lectures Lectures 10,11 Slide# 1
CSCI5654 (Linear Programming, Fall 2013) Lectures 10-12 Lectures 10,11 Slide# 1 Today s Lecture 1. Introduction to norms: L 1,L 2,L. 2. Casting absolute value and max operators. 3. Norm minimization problems.
More informationConvex Optimization M2
Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More information4. Convex optimization problems
Convex Optimization Boyd & Vandenberghe 4. Convex optimization problems optimization problem in standard form convex optimization problems quasiconvex optimization linear optimization quadratic optimization
More informationNonlinear Optimization for Optimal Control
Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]
More informationNewton s Method. Ryan Tibshirani Convex Optimization /36-725
Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x
More informationminimize x x2 2 x 1x 2 x 1 subject to x 1 +2x 2 u 1 x 1 4x 2 u 2, 5x 1 +76x 2 1,
4 Duality 4.1 Numerical perturbation analysis example. Consider the quadratic program with variables x 1, x 2, and parameters u 1, u 2. minimize x 2 1 +2x2 2 x 1x 2 x 1 subject to x 1 +2x 2 u 1 x 1 4x
More informationLecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon
More informationClassification 1: Linear regression of indicators, linear discriminant analysis
Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification
More informationSupplement to A Generalized Least Squares Matrix Decomposition. 1 GPMF & Smoothness: Ω-norm Penalty & Functional Data
Supplement to A Generalized Least Squares Matrix Decomposition Genevera I. Allen 1, Logan Grosenic 2, & Jonathan Taylor 3 1 Department of Statistics and Electrical and Computer Engineering, Rice University
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationLecture 10: Duality in Linear Programs
10-725/36-725: Convex Optimization Spring 2015 Lecture 10: Duality in Linear Programs Lecturer: Ryan Tibshirani Scribes: Jingkun Gao and Ying Zhang Disclaimer: These notes have not been subjected to the
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationPractical Linear Algebra: A Geometry Toolbox
Practical Linear Algebra: A Geometry Toolbox Third edition Chapter 12: Gauss for Linear Systems Gerald Farin & Dianne Hansford CRC Press, Taylor & Francis Group, An A K Peters Book www.farinhansford.com/books/pla
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationConvex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)
ORF 523 Lecture 8 Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Any typos should be emailed to a a a@princeton.edu. 1 Outline Convexity-preserving operations Convex envelopes, cardinality
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis
ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear
More informationMatrix Rank Minimization with Applications
Matrix Rank Minimization with Applications Maryam Fazel Haitham Hindi Stephen Boyd Information Systems Lab Electrical Engineering Department Stanford University 8/2001 ACC 01 Outline Rank Minimization
More informationLecture 5 : Projections
Lecture 5 : Projections EE227C. Lecturer: Professor Martin Wainwright. Scribe: Alvin Wan Up until now, we have seen convergence rates of unconstrained gradient descent. Now, we consider a constrained minimization
More informationLecture Note 5: Semidefinite Programming for Stability Analysis
ECE7850: Hybrid Systems:Theory and Applications Lecture Note 5: Semidefinite Programming for Stability Analysis Wei Zhang Assistant Professor Department of Electrical and Computer Engineering Ohio State
More informationA Modern Look at Classical Multivariate Techniques
A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico
More informationSTATISTICAL LEARNING SYSTEMS
STATISTICAL LEARNING SYSTEMS LECTURE 8: UNSUPERVISED LEARNING: FINDING STRUCTURE IN DATA Institute of Computer Science, Polish Academy of Sciences Ph. D. Program 2013/2014 Principal Component Analysis
More informationEE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 18
EE/ACM 150 - Applications of Convex Optimization in Signal Processing and Communications Lecture 18 Andre Tkacenko Signal Processing Research Group Jet Propulsion Laboratory May 31, 2012 Andre Tkacenko
More informationGaussian Elimination and Back Substitution
Jim Lambers MAT 610 Summer Session 2009-10 Lecture 4 Notes These notes correspond to Sections 31 and 32 in the text Gaussian Elimination and Back Substitution The basic idea behind methods for solving
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationDimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas
Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx
More informationLecture 2 Part 1 Optimization
Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss
More information15 Singular Value Decomposition
15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing
More informationHigh-dimensional Ordinary Least-squares Projection for Screening Variables
1 / 38 High-dimensional Ordinary Least-squares Projection for Screening Variables Chenlei Leng Joint with Xiangyu Wang (Duke) Conference on Nonparametric Statistics for Big Data and Celebration to Honor
More information10725/36725 Optimization Homework 4
10725/36725 Optimization Homework 4 Due November 27, 2012 at beginning of class Instructions: There are four questions in this assignment. Please submit your homework as (up to) 4 separate sets of pages
More informationMultivariate Statistical Analysis
Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 4 for Applied Multivariate Analysis Outline 1 Eigen values and eigen vectors Characteristic equation Some properties of eigendecompositions
More informationLecture Notes 10: Matrix Factorization
Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To
More informationLearning with Singular Vectors
Learning with Singular Vectors CIS 520 Lecture 30 October 2015 Barry Slaff Based on: CIS 520 Wiki Materials Slides by Jia Li (PSU) Works cited throughout Overview Linear regression: Given X, Y find w:
More informationMatrix Factorizations
1 Stat 540, Matrix Factorizations Matrix Factorizations LU Factorization Definition... Given a square k k matrix S, the LU factorization (or decomposition) represents S as the product of two triangular
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationarxiv: v2 [stat.ml] 27 Oct 2014
The Falling Factorial Basis and Its Statistical Applications arxiv:45558v2 statml 27 Oct 24 Yu-Xiang Wang yuxiangw@cscmuedu Alex Smola alex@smolaorg Ryan J Tibshirani, ryantibs@statcmuedu Machine Learning
More informationSparse PCA in High Dimensions
Sparse PCA in High Dimensions Jing Lei, Department of Statistics, Carnegie Mellon Workshop on Big Data and Differential Privacy Simons Institute, Dec, 2013 (Based on joint work with V. Q. Vu, J. Cho, and
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More information16.1 L.P. Duality Applied to the Minimax Theorem
CS787: Advanced Algorithms Scribe: David Malec and Xiaoyong Chai Lecturer: Shuchi Chawla Topic: Minimax Theorem and Semi-Definite Programming Date: October 22 2007 In this lecture, we first conclude our
More informationMachine Learning (BSMC-GA 4439) Wenke Liu
Machine Learning (BSMC-GA 4439) Wenke Liu 02-01-2018 Biomedical data are usually high-dimensional Number of samples (n) is relatively small whereas number of features (p) can be large Sometimes p>>n Problems
More information15-780: LinearProgramming
15-780: LinearProgramming J. Zico Kolter February 1-3, 2016 1 Outline Introduction Some linear algebra review Linear programming Simplex algorithm Duality and dual simplex 2 Outline Introduction Some linear
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More informationHomework 3. Convex Optimization /36-725
Homework 3 Convex Optimization 10-725/36-725 Due Friday October 14 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationGraph Functional Methods for Climate Partitioning
Graph Functional Methods for Climate Partitioning Mathilde Mougeot - with D. Picard, V. Lefieux*, M. Marchand* Université Paris Diderot, France *Réseau Transport Electrique (RTE) Buenos Aires, 2015 Mathilde
More informationPreprocessing & dimensionality reduction
Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016
More informationSingular Value Decomposition and Principal Component Analysis (PCA) I
Singular Value Decomposition and Principal Component Analysis (PCA) I Prof Ned Wingreen MOL 40/50 Microarray review Data per array: 0000 genes, I (green) i,i (red) i 000 000+ data points! The expression
More informationORF 523 Lecture 9 Spring 2016, Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Thursday, March 10, 2016
ORF 523 Lecture 9 Spring 2016, Princeton University Instructor: A.A. Ahmadi Scribe: G. Hall Thursday, March 10, 2016 When in doubt on the accuracy of these notes, please cross check with the instructor
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationClassification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).
Regression and PCA Classification The goal: map from input X to a label Y. Y has a discrete set of possible values We focused on binary Y (values 0 or 1). But we also discussed larger number of classes
More informationRelaxations and Randomized Methods for Nonconvex QCQPs
Relaxations and Randomized Methods for Nonconvex QCQPs Alexandre d Aspremont, Stephen Boyd EE392o, Stanford University Autumn, 2003 Introduction While some special classes of nonconvex problems can be
More informationKarush-Kuhn-Tucker Conditions. Lecturer: Ryan Tibshirani Convex Optimization /36-725
Karush-Kuhn-Tucker Conditions Lecturer: Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given a minimization problem Last time: duality min x subject to f(x) h i (x) 0, i = 1,... m l j (x) = 0, j =
More informationConvex Optimization / Homework 1, due September 19
Convex Optimization 1-725/36-725 Homework 1, due September 19 Instructions: You must complete Problems 1 3 and either Problem 4 or Problem 5 (your choice between the two). When you submit the homework,
More informationIn English, this means that if we travel on a straight line between any two points in C, then we never leave C.
Convex sets In this section, we will be introduced to some of the mathematical fundamentals of convex sets. In order to motivate some of the definitions, we will look at the closest point problem from
More informationLecture 24 May 30, 2018
Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This
More informationData Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.
TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin
More information2.2. Show that U 0 is a vector space. For each α 0 in F, show by example that U α does not satisfy closure.
Hints for Exercises 1.3. This diagram says that f α = β g. I will prove f injective g injective. You should show g injective f injective. Assume f is injective. Now suppose g(x) = g(y) for some x, y A.
More informationSTATS 306B: Unsupervised Learning Spring Lecture 13 May 12
STATS 306B: Unsupervised Learning Spring 2014 Lecture 13 May 12 Lecturer: Lester Mackey Scribe: Jessy Hwang, Minzhe Wang 13.1 Canonical correlation analysis 13.1.1 Recap CCA is a linear dimensionality
More informationFoundations of Computer Vision
Foundations of Computer Vision Wesley. E. Snyder North Carolina State University Hairong Qi University of Tennessee, Knoxville Last Edited February 8, 2017 1 3.2. A BRIEF REVIEW OF LINEAR ALGEBRA Apply
More informationGI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil
GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection
More information