Coordinate descent methods
|
|
- Octavia Lilian McCarthy
- 5 years ago
- Views:
Transcription
1 Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5 4 Stochastic dual coordinate ascent for support vector machines 9 Exact coordinate descent The idea of coordinate descent is to decompose a large optimisation problem into a sequence of one-dimensional optimisation problems. The algorithm was first described for the minimization of quadratic functions by Gauss and Seidel in Seidel (874). Coordinate descent methods have become unavoidable in machine learning because they are very efficient for ey problems, namely Lasso, logistic regression and support vector machines. Moreover, the decomposition into small subproblems means that only a small part of the data is processed at each iteration and this maes coordinate descent easily scalable to high dimensions. We first decompose the space of optimisation variables X into blocs X... X n = X. A classical choice when X = R n is to choose X =... = X n = R. We will denote U i the canonical injection from X i to X, that is U i is such that for all h X i, U i h = (0,..., 0,h, 0,..., 0) X. }{{}}{{} i zeros n i zeros For a function f : X... X n R, we define the following algorithm. Algorithm : Exact coordinate descent Start at x 0 X. At iteration, choose l = ( mod n) + (cyclic rule) and define x + X by { + = arg min z X l f(x () + = x(i),..., x(l ), z, x (l+),..., x (n) ) if i = l if i l
2 Figure : The successive iterates of the coordinate descent method on a D example. The function we are minimising is represented by its level sets: the bluer is the circle, the lower is the function values. Proposition (Warga (963)). If f is continuously differentiable and strictly convex and there exists x = arg min x X f(x), then the exact coordinate descent method (Alg. ) converges to x. Example (least squares). f(x) = Ax b = m j= (a j x b j) At each iteration, we need to solve in z the D equation For all x R n, so we get f x (x() (l),..., x(l ), z, x (l+),..., x (n) ) = 0 f x (l) (x) = a l (Ax b) = a l a lx (l) + a l ( a j x (j) ) a l b j l z = x (l) + = ( ) a l a l ( a j x (j) ) + a l b = x (l) a j l l (a l ( j= a j x (j) ) a l b ) Example (non-differentiable function). f(x (), x () ) = x () x () min(x (), x () ) + I [0,] (x) f is convex but not differentiable. If we nevertheless try to run exact coordinate descent, the algorithm proceeds as x () = arg min z f(z, x () 0 ) = x() 0, x() = arg min z f(x (), z) = x() 0, and so on. Thus exact coordinate descent converges in two iterations to (x () 0, x() 0 ): the algorithm is stuc on the non-differentiability point on the line {x () = x () } and does not reach the minimiser (, ). Example 3 (non-convex differentiable function). f(x (), x (), x (3) ) = (x () x () + x () x (3) + x (3) x () ) + 3 max(0, x(i) ) As shown by Powell (976), exact coordinate descent on this function started at the initial point x (0) = ( ɛ, + ɛ/, ɛ/4) has a limit cycle around the 6 corners of the cube that are not minimisers and avoids the corners that are minimisers. Exercise: Show Powell s result.
3 x 0 x = x Figure : The function in Example Example 4 (Adaboost). The Adaboost algorithm Collins et al. (00) was designed to minimise the exponential loss given by m f(x) = exp( y j h j x). j= At each iteration, we select the variable l such that l = arg max i i f(x) and we perform an exact coordinate descent step along this coordinate. This variable selection rule is called the greedy or Gauss-Southwell rule. Lie the cyclic rule, it leads to a converging algorithm but requires to compute the full gradient at each iteration. Greedy coordinate descent is interesting in the case of the exponential loss because the gradient of the function has a few very large coefficients and many negligible coefficients. Exercise: Suppose that y j {, } and h j,i {, 0, } for all j, i. Give the explicit formulas of i f(x) and of the next update x + nowing x. Coordinate gradient descent Solving a one-dimensional optimisation problems is generally easy and the solution can be approximated very well by algorithms lie the bisection method. However, for the exact coordinate descent method, one needs to solve a huge number of one-dimensional problems and the expense quicly becomes prohibitive. Moreover, why should we solve to high accuracy the -dimensional problem and destroy this solution at the next iteration? The idea of coordinate gradient descent is to perform one iteration of gradient descent in the -dimensional problem min z Xl f(x (),..., x(l ), z, x (l+),..., x (n) ) instead of solving it completely. In general, this reduces drastically the cost of each iteration while eeping the same convergence behaviour. When choosing the cyclic or greedy rule, the algorithm does converge for any convex function f that has a Lipschitz-continuous gradient and such that arg min x f(x). In fact we will assume that we actually now the coordinate-wise Lipschitz constants of the gradient of f, namely the Lipschitz constants of the functions g i,x : X i R h f(x + U i h) = f(x (),..., x (i ), + h, x (i+),..., x (n) ) 3
4 Algorithm : Coordinate gradient descent Start at x 0. At iteration, choose i + {,..., n} and define x + by { + = x(i) γ i i f(x ) if i = i + + = x(i) if i i + We will denote L i = L( g i,x ) this Lipschitz constant. Written in terms of f, this means that x X, i {,..., n}, h X i, f(x + U i h) f(x) L i U i h. Lemma. If f has a coordinate-wise Lipschitz gradient with constants L,..., L n, then x X, i {,..., n}, h X i, f(x + U i h) f(x) + i f(x), h + L i h Proof. This is Taylor s inequality applied to g i,x. Note that we do not require the function to be twice differentiable. Proposition (Bec and Tetruashvili (03)). Assume that f is convex, f is Lipschitz continuous and arg min x X f(x). If i + is chosen with the cyclic rule i + = ( mod n) + and i, γ i = L i, then the coordinate gradient descent method (Alg. ) satisfies f(x + ) f(x ) 4L max ( + n 3 L max/l min) R (x 0 ) + 8/n where R (x 0 ) = max x,y X { x y : f(y) f(x) f(x 0 )}, L max = max i L i and L min = min i L i. The proof of this result is quite technical and in fact the bound is much more pessimistic than what is observed in practice (n 3 is very large if n is large). This is due to the fact that the cyclic rule behaves particularly bad on some extreme examples. To avoid such traps, it has been suggested to randomise the coordinate selection process. Proposition 3 (Nesterov (0)). Assume that f is convex, f is Lipschitz continuous and arg min x X f(x). If i + is randomly generated, independently of i,..., i and i {,..., n}, P(i + = i) = n and γ i = L i, then the coordinate gradient descent method (Alg. ) satisfies for all x arg min x f(x) E[f(x + ) f(x )] where x L = n L i. n ( ( + n n )(f(x 0) f(x )) + ) x x 0 L Proof. This is a particular case of the method developed in the next section. Comparison with gradient descent The iteration complexity of the gradient descent method is f(x + ) f(x ) L( f) ( + ) x x 0 4
5 This means that to get an ɛ-solution (i.e. such that f(x ) f(x ) ɛ), we need at most L( f) ɛ x x 0 iterations. What is most expensive in gradient descent is the evaluation of the gradient f(x) with a cost C, so the total cost of the method is C grad = C L( f) x x 0 ɛ Neglecting the effect of randomisation, we usually have an ɛ-solution with coordinate descent in n ( ɛ ( n )(f(x 0) f(x )) + x x 0 L) iterations. The cost of one iteration of coordinate descent is of the order of the cost of evaluation one partial derivative i f(x), with a cost c, so the total cost of the method is C cd = c n ( ( ɛ n )(f(x 0) f(x )) + x x 0 ) L How do these two quantities compare? Let us consider the case where f(x) = Ax b. Computing f(x) = A (Ax b) amounts to updating the residuals r = Ax b (one matrix vector product and a sum) and computing one matrix vector product. We thus have C = O(nnz(A)). Computing i f(x) = e i A (Ax b) amounts to. updating the residuals r = Ax b: one scalar-vector product and a sum since we have r + = r + (x (i +) + x (i +) )Ae i+,. computing one vector-vector product (the i th column of A versus the residuals). Thus c = O(nnz(Ae i+ )) = O(nnz(A)/n) = C/n if the columns of A are equally sparse. f(x 0 ) f(x ) L( f) x 0 x and it may happen that f(x 0) f(x ) L( f) x 0 x L( f) = λ max (A A) and L i = a i a i with a i = Ae i. We always have L i L( f) and it may happen that L i = O(L( f)/n). To conclude, in the quadratic case, C cd C grad and we may have C cd = O(C grad /n). 3 Proximal coordinate descent We are often interested in solving problems of the type min x X F (x) = min f(x) + g(x) (3.) x X where f and g are convex so that F = f + g is convex, f has a Lipschitz continuous gradient and g may be nonsmooth but is separable. This means that for all x X = X... X n, g(x) = g i ( ). We can solve this ind of problems with the proximal coordinate descent method (Alg. 3, Tseng (00)), which is also using the coordinate-wise Lipschitz constant. For this algorithm to be practical, we need to be able to compute efficiently prox γ,g (y) = arg min x X g(x) + x y γ, the proximal operator of g (remember that x γ = n γ i ). 5
6 Algorithm 3: Proximal coordinate descent Start at x 0 X. At iteration, choose i + {,..., n} and define x + X by + = arg min x X i g i (x) + f(x ) + i f(x ), x + = x(i) + L i x x(i) if i = i + if i i + Example 5 (Simple proximal operators). Indicator of a box: if g(x) = I [a,b] (x), then prox γ,g (y) = max(a, min(x, b)). This is the projection on [a, b] (it does not depend on γ). Absolute value: if g(x) = λ x, then prox γ,g (y) = sign(y) max(0, y γλ). This is the soft-thresholding operator. We define x + = prox L,g(x L f(x )) = arg min x X g(x) + f(x ) + f(x ), x x + x x L, so that + = Lemma. For all γ R n + and x X { + if i = i + if i i + g( x + )+ f(x ), x + x + x + x γ g(x)+ f(x ), x x + x x γ x + x γ Proof. The function ψ : x g(x) + f(x ), x x + x x γ is strongly convex and its minimiser is x +. The inequality is just the strong convexity inequality of this function with respect to the norm γ and applied at x and x +. Theorem (Richtári and Taáč (04)). The proximal coordinate descent method (Alg. 3) with the random selection rule applied to Problem 3. satisfies for all x arg min x F (x) E[F (x + ) F (x )] where x L = n L i. n ( ( + n n )(F (x 0) F (x )) + ) x x 0 L Proof. By definition of the algorithm, x + x = U i+ (x (i +) + x (i +) ), so by Lemma, f(x + ) f(x ) + i+ f(x ), x (i +) + x (i +) + L i + x(i +) + x (i +) = f(x ) + f(x ), x + x + x + x L (3.) Using the notation x + = arg min x X g(x) + f(x ) + f(x ), x x + x x L, we have { + = + if i = i + if i i + 6
7 Using the conditional expectation nowing F = (i,... i ), we get E[ + F ] = P(i + = i) + + P(i + i) = n x(i) + + ( n )x(i) E[ i f(x ), + x(i) F ] = P(i + = i) i f(x ), + x(i) + P(i + i) i f(x ), x(i) = n if(x ), + x(i) E[ f(x ), x + x F ] = E[ x + x L F ] = E[g(x + ) g(x ) F ] = Combining (3.3), (3.4) and (3.5) with (3.), we get E[ i f(x ), + x(i) F ] = n f(x ), x + x (3.3) E[ L i x(i) + x(i) F ] = n x + x L (3.4) E[g i ( + ) g i( ) F ] = n (g( x +) g(x )) (3.5) [ E[g(x + ) + f(x + ) F ] E[g(x + ) F ] + E f(x ) + f(x ), x + x + ] x + x L F = ( n )g(x ) + n g( x +) + f(x ) + n f(x ), x + x + n x + x L Using Lemma with x = x, we get E[F (x + ) F ] = E[g(x + ) + f(x + ) F ] g(x ) + f(x ) n x + x g(x ) + f(x ) = F (x ) (3.6) Then, using Lemma again, with x = x, we get E[F (x + ) F ] = E[g(x + ) + f(x + ) F ] ( n )g(x ) + f(x ) + n g(x ) + n f(x ), x x + n x x L n x x + L We remar that so that E[ x x L x x + L F ] = n x x L n x x + L, E[F (x + ) F ] ( n )g(x ) + f(x ) + n g(x ) + n f(x ), x x We use the convexity of f: + x x L E[ x x + L F ]. E[F (x + ) F ] ( n )(g(x )+f(x ))+ n (g(x )+f(x ))+ x x L E[ x x + L F ]. We rearrange and we apply total expectation: E[F (x + ) F (x ) + x x + L] E[( n )(F (x ) F (x )) + x x L]. 7
8 Summing for from 0 to K- yields E[F (x K+ ) F (x )] + K E[ x x K+ L] + E[ n (F (x ) F (x )] = Using (3.6) and the fact that E[ x x K+ L ] 0, ( n )(F (x 0) F (x )) + x x 0 L]. ( + n )E[F (x K+) F (x )] ( n )(F (x 0) F (x )) + x x 0 L] We just need to divide by n + to conclude. Example 6 (Lasso). Proximal coordinate descent is widely used to solve the Lasso problem given by min x R p y Zx + λ x Here, f(x) = y Zx is differentiable while g(x) = λ x is a non-differentiable function whose proximal operator is the soft-thresholding operator. Example 7 (Multi-tas Lasso). In the multi-tas framewor, the Lasso problem can be generalised as p min x R p q Y Zx F + λ x j,:. Here, the optimisation variable is a p q matrix. One can see that the nonsmooth part of the objective is g(x) = λ p j= x j,:. This function is not separable when we consider the entries of x one by one but it is separable if we group these entries column-wise. Hence, we can consider bloc coordinate descent with p blocs of size q for the resolution of the multi-tas Lasso problem. Example 8 (l /l -regularised multinomial logistic regression). Logistic regression is famous for classification problems. One observes for each i [n] a class label c i {,..., q} and a vector of features z i R p. This information can be recast into a matrix Y R n q filled by 0 s and s: Y i, = {ci =}. A matrix B R p q is formed by q vectors encoding the hyperplanes for the linear classification. The multinomial l /l regularized regression reads: min B R p q ( q ( q Y i, zi B :, + log = = j= ( ) )) exp zi B :, + λ p B j,:, Lie the multi-tas Lasso problem, this problem can be solved with proximal coordinate descent as long as we consider blocs of variables corresponding to the columns of B rather than single variables. Exercise:. Find the proximal operator of the non-smooth function g(b) = λ p j= B j,:.. Give the expression of the partial derivatives of the smooth function ( q ( q ( ) )) f(b) = Y i, zi B :, + log exp zi B :, = = j= 8
9 3. Give an estimate of the p bloc-wise Lipschitz constants of f. 4. Write the proximal coordinate descent method for l /l -regularised multinomial logistic regression. 4 Stochastic dual coordinate ascent for support vector machines In this section, we focus on the linear Support Vector Machines (SVM) problem min C max(0, y i z w R p i w) + w where C is a positive real number, y R n and i, z i R p. Note that we consider the formulation without intercept. The objective function contains a non-smooth and non-separable term so we cannot apply coordinate descent to it. However, a dual formulation of the SVM problem is given by max α R n p ( n Z i,j y i α (i)) + α (i) I [0,C] n(α). j= The objective function of this problem does decompose into a differentiable concave function f(α) = p j= ( n Z i,jy i α (i)) + n α(i) and a nonsmooth concave and separable function g(α) = I [0,C] n(α). Stochastic Dual Coordinate Ascent (SDCA) is proximal coordinate ascent (the version of coordinate descent for concave functions) on this problem. Exercise. Write an implement of SDCA. It may be useful to maintain residuals w defined by w (j) = n Z i,jy i α (i) for all j {,..., p}. Even if we are running the algorithm in the dual, we are interested in the primal problem. The following result shows that we can recover a good primal solution from the dual solution and gives theoretical guarantees for the convergence in the primal. Theorem (Shalev-Shwartz and Zhang (03)). Let us define a primal point w = Z Diag (y) α, where (α ) 0 is generated by SDCA. The duality gap satisfies for all K n, [ K E K =K ] P (w ) D(α ) n ( ( K + n n )(D(α ) D(α 0 )) + ) α α 0 L + n K C where the primal value is P (w ) = C n max(0, y izi w ) + w, the dual value is D(α ) = Z Diag (y) α + n α(i) I [0,C] n(α ) and i, L i = yi z i. Proof. As SDCA solves the dual problem with coordinate ascent, by Theorem, E[D(α ) D(α + )] n ( ( + n n )(D(α ) D(α 0 )) + ) α α 0 L. The goal of the theorem is to upper bound E[P (w ) D(α )] by quantities involving E[D(α ) D(α + )]. Note that by wea duality, P (w ) D(α ) D(α ) D(α + ) but what we need is an inequality in the other way. For this, we will need to use the fact that (α ) 0 is generated by the coordinate ascent method. L i 9
10 Using the feasibility of α and the definition of w, we can simplify D(α + ) as D(α + ) = Z Diag (y) α + + α (i) + I [0,C] n(α +) = w + + e α + As α + = α + U i+ (ᾱ (i +) + α (i +) ), w + = w + z i+ y i+ (ᾱ (i +) + α (i +) ) and D(α + ) = w + z i+ y i+ (ᾱ (i +) + α (i +) ) + e α + ᾱ (i +) + α (i +) To simplify notations, we will write here i = i +. Note that ᾱ (i) + = arg max (y iz i Z Diag (y) α + )(a α (i) a [0,C] ) y iz i = arg max a [0,C] w + z i y i (a α (i) ) + a α (i). (a α (i) ) So let us consider φ : x C max(0, x), u φ(y i z i w ) [0, C] and s [0, ]. D(α + ) = max a [0,C] w + z i y i (a α (i) ) + e α + a α (i) w + z i y i ((su + ( s)α (i) ) α(i) ) + e α + (su + ( s)α (i) ) α(i) w + z i y i s(u α (i) ) + e α + s(u α (i) ) = w s z iy i (u α (i) ) s(u α (i) )y iz i w + e α + s(u α (i) ) = D(α ) s z iy i (u α (i) ) s(u α (i) )y iz i w + s(u α (i) ) As u φ(y i z i w ) [0, C] and φ (q) = q + I [ C,0] (q), Fenchel-Young equality leads to: φ(y i z i w ) u = uy i z i w. Hence, D(α + ) D(α ) s z iy i (u α (i) ) + sφ(y i zi w ) + sα (i) y izi w sα (i) Applying conditional expectation, we get E[D(α + ) F ] D(α ) s n Now, So that P (w ) D(α ) = C = z i y i (u α (i) ) + s n ( φ(yi zi w ) + α (i) y izi w α (i) ) max(0, y i zi w ) + w ( w + e α ) E[D(α + ) F ] D(α ) s n φ(y i zi w ) + α (i) y izi w α (i) s n z i y i (u α (i) ) + s n (P (w ) D(α )) ( z i y i )C + s n (P (w ) D(α )) 0
11 where the last inequality derives from α (i) [0, C] and u [0, C]. We apply total expectation and we sum for from K to K : s n K =K E[P (w ) D(α )] E[D(α K )] E[D(α K )] + s n C E[D(α )] E[D(α K )] + s n C c 0n K + n + s n C ( z i y i )(K K ) ( z i y i )(K K ) ( z i y i )(K K ) where c 0 = ( n )(D(α ) D(α 0 )) + α α 0 L. Choosing K = K and s = n K, we obtain, for K n (because we need s ) K K =K E[P (w ) D(α )] c 0n K + n + n C K ( z i y i ) References Bec, A. and Tetruashvili, L. (03). On the convergence of bloc coordinate descent type methods. SIAM Journal on Optimization, 3(4): Collins, M., Schapire, R. E., and Singer, Y. (00). Logistic regression, adaboost and bregman distances. Machine Learning, 48(-3): Nesterov, Y. (0). Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, (): Powell, M. (976). Some global convergence properties of a variable metric algorithm for minimization without exact line searches. Nonlinear programming, 9:53 7. Richtári, P. and Taáč, M. (04). Iteration complexity of randomized bloc-coordinate descent methods for minimizing a composite function. Mathematical Programming, 44(-): Seidel, L. (874). Ueber ein verfahren, die gleichungen, auf welche die methode der leinsten quadrate führt, sowie lineäre gleichungen überhaupt, durch successive annäherung aufzulösen:(aus den abhandl. d bayer. aademie ii. cl. xi. bd. iii. abth. Verlag der. Aademie. Shalev-Shwartz, S. and Zhang, T. (03). Stochastic dual coordinate ascent methods for regularized loss. The Journal of Machine Learning Research, 4(): Tseng, P. (00). Convergence of a bloc coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 09(3): Warga, J. (963). Minimizing certain convex functions. Journal of the Society for Industrial & Applied Mathematics, (3): Wright, S. J. (05). Coordinate descent algorithms. Mathematical Programming, 5():3 34.
Coordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationFAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč
FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationPrimal-dual coordinate descent
Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationLecture 23: November 21
10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationIteration Complexity of Feasible Descent Methods for Convex Optimization
Journal of Machine Learning Research 15 (2014) 1523-1548 Submitted 5/13; Revised 2/14; Published 4/14 Iteration Complexity of Feasible Descent Methods for Convex Optimization Po-Wei Wang Chih-Jen Lin Department
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This
More informationNesterov s Acceleration
Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x
More informationICS-E4030 Kernel Methods in Machine Learning
ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This
More informationAccelerated primal-dual methods for linearly constrained convex problems
Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationDISCUSSION PAPER 2011/70. Stochastic first order methods in smooth convex optimization. Olivier Devolder
011/70 Stochastic first order methods in smooth convex optimization Olivier Devolder DISCUSSION PAPER Center for Operations Research and Econometrics Voie du Roman Pays, 34 B-1348 Louvain-la-Neuve Belgium
More informationMini-Batch Primal and Dual Methods for SVMs
Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationRandomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming
Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationDescent methods. min x. f(x)
Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained
More informationMath 273a: Optimization Subgradients of convex functions
Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationLinear Models in Machine Learning
CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationMachine Learning. Lecture 6: Support Vector Machine. Feng Li.
Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)
More informationOn the interior of the simplex, we have the Hessian of d(x), Hd(x) is diagonal with ith. µd(w) + w T c. minimize. subject to w T 1 = 1,
Math 30 Winter 05 Solution to Homework 3. Recognizing the convexity of g(x) := x log x, from Jensen s inequality we get d(x) n x + + x n n log x + + x n n where the equality is attained only at x = (/n,...,
More informationThe Frank-Wolfe Algorithm:
The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationLecture 2: Convex Sets and Functions
Lecture 2: Convex Sets and Functions Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 2 Network Optimization, Fall 2015 1 / 22 Optimization Problems Optimization problems are
More informationSubgradient Method. Ryan Tibshirani Convex Optimization
Subgradient Method Ryan Tibshirani Convex Optimization 10-725 Consider the problem Last last time: gradient descent min x f(x) for f convex and differentiable, dom(f) = R n. Gradient descent: choose initial
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationCHAPTER 11. A Revision. 1. The Computers and Numbers therein
CHAPTER A Revision. The Computers and Numbers therein Traditional computer science begins with a finite alphabet. By stringing elements of the alphabet one after another, one obtains strings. A set of
More informationBlock Coordinate Descent for Regularized Multi-convex Optimization
Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline
More informationSubgradient Method. Guest Lecturer: Fatma Kilinc-Karzan. Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization /36-725
Subgradient Method Guest Lecturer: Fatma Kilinc-Karzan Instructors: Pradeep Ravikumar, Aarti Singh Convex Optimization 10-725/36-725 Adapted from slides from Ryan Tibshirani Consider the problem Recall:
More informationRelative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent
Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order
More informationLecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More informationEfficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization
Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationRandom coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara
Random coordinate descent algorithms for huge-scale optimization problems Ion Necoara Automatic Control and Systems Engineering Depart. 1 Acknowledgement Collaboration with Y. Nesterov, F. Glineur ( Univ.
More informationSmoothing Proximal Gradient Method. General Structured Sparse Regression
for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013 Outline Motivation:
More informationMath 273a: Optimization Overview of First-Order Optimization Algorithms
Math 273a: Optimization Overview of First-Order Optimization Algorithms Wotao Yin Department of Mathematics, UCLA online discussions on piazza.com 1 / 9 Typical flow of numerical optimization Optimization
More informationCoordinate gradient descent methods. Ion Necoara
Coordinate gradient descent methods Ion Necoara January 2, 207 ii Contents Coordinate gradient descent methods. Motivation..................................... 5.. Coordinate minimization versus coordinate
More informationStochastic Proximal Gradient Algorithm
Stochastic Institut Mines-Télécom / Telecom ParisTech / Laboratoire Traitement et Communication de l Information Joint work with: Y. Atchade, Ann Arbor, USA, G. Fort LTCI/Télécom Paristech and the kind
More informationCoordinate Descent Methods on Huge-Scale Optimization Problems
Coordinate Descent Methods on Huge-Scale Optimization Problems Zhimin Peng Optimization Group Meeting Warm up exercise? Warm up exercise? Q: Why do mathematicians, after a dinner at a Chinese restaurant,
More informationFast proximal gradient methods
L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient
More informationSupport Vector Machines
Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationDual Proximal Gradient Method
Dual Proximal Gradient Method http://bicmr.pku.edu.cn/~wenzw/opt-2016-fall.html Acknowledgement: this slides is based on Prof. Lieven Vandenberghes lecture notes Outline 2/19 1 proximal gradient method
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationWHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????
DUALITY WHY DUALITY? No constraints f(x) Non-differentiable f(x) Gradient descent Newton s method Quasi-newton Conjugate gradients etc???? Constrained problems? f(x) subject to g(x) apple 0???? h(x) =0
More informationAlgorithms for Nonsmooth Optimization
Algorithms for Nonsmooth Optimization Frank E. Curtis, Lehigh University presented at Center for Optimization and Statistical Learning, Northwestern University 2 March 2018 Algorithms for Nonsmooth Optimization
More informationDistributed Optimization via Alternating Direction Method of Multipliers
Distributed Optimization via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University ITMANET, Stanford, January 2011 Outline precursors dual decomposition
More informationMATH 680 Fall November 27, Homework 3
MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and
More informationGradient methods for minimizing composite functions
Gradient methods for minimizing composite functions Yu. Nesterov May 00 Abstract In this paper we analyze several new methods for solving optimization problems with the objective function formed as a sum
More informationAccelerated Gradient Method for Multi-Task Sparse Learning Problem
Accelerated Gradient Method for Multi-Task Sparse Learning Problem Xi Chen eike Pan James T. Kwok Jaime G. Carbonell School of Computer Science, Carnegie Mellon University Pittsburgh, U.S.A {xichen, jgc}@cs.cmu.edu
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationConvex Optimization and Support Vector Machine
Convex Optimization and Support Vector Machine Problem 0. Consider a two-class classification problem. The training data is L n = {(x 1, t 1 ),..., (x n, t n )}, where each t i { 1, 1} and x i R p. We
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationDuality in Linear Programs. Lecturer: Ryan Tibshirani Convex Optimization /36-725
Duality in Linear Programs Lecturer: Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: proximal gradient descent Consider the problem x g(x) + h(x) with g, h convex, g differentiable, and
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationLinear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction
Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the
More informationConvex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014
Convex Optimization Dani Yogatama School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA February 12, 2014 Dani Yogatama (Carnegie Mellon University) Convex Optimization February 12,
More informationLECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE
LECTURE 25: REVIEW/EPILOGUE LECTURE OUTLINE CONVEX ANALYSIS AND DUALITY Basic concepts of convex analysis Basic concepts of convex optimization Geometric duality framework - MC/MC Constrained optimization
More informationLecture Support Vector Machine (SVM) Classifiers
Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in
More information18.657: Mathematics of Machine Learning
8.657: Mathematics of Machine Learning Lecturer: Philippe Rigollet Lecture 3 Scribe: Mina Karzand Oct., 05 Previously, we analyzed the convergence of the projected gradient descent algorithm. We proved
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationSelected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018
Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More informationminimize x subject to (x 2)(x 4) u,
Math 6366/6367: Optimization and Variational Methods Sample Preliminary Exam Questions 1. Suppose that f : [, L] R is a C 2 -function with f () on (, L) and that you have explicit formulae for
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationMath 273a: Optimization Subgradient Methods
Math 273a: Optimization Subgradient Methods Instructor: Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com Nonsmooth convex function Recall: For ˉx R n, f(ˉx) := {g R
More informationStatistical Methods for Data Mining
Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find
More informationLecture 1: September 25
0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More information