Linear programming III - PDF Free Download

Linear prgramming III

Review 1/33 What have cvered in previus tw classes LP prblem setup: linear bjective functin, linear cnstraints. exist extreme pint ptimal slutin. Simplex methd: g thrugh extreme pint t find the ptimal slutin. Primal-dual prperty f the LP prblem. Interir pint algrithm: based n the Primal-dual prperty, travel thrugh the interir f the feasible slutin space. Quadratic prgramming: based n KKT cnditin. LP applicatin: quantile regressin minimize the asymmetric abslute deviatins.

LP/QP applicatin in statistics II: LASSO 2/33 Cnsider usual regressin settings with data (x i, y i ), where x i = (x i1,..., x ip ) is a vectr f predictrs and y i is the respnse fr the i th bject. The rdinary linear regressin setting is: Find cefficient t minimize the residual sum f squares: n ˆb = argmin (y b i x i b) 2 Here b = (b 1, b 2,..., b p ) T is a vectr f cefficients. The slutin happens t be the MLE assuming a nrmal mdel: i=1 y i = x i b + ɛ i, ɛ i N(0, σ 2 ) This is nt ideal when the number f predictrs (p) is large, because 1. it requires p < n, r there must be sme degree f freedms fr residual. 2. ne wants a small subset f predictrs in the mdel, but OLS prvides an estimated cefficient fr each predictr.

The LASSO 3/33 LASSO stands fr Least Abslute Shrinkage and Selectin Operatr, which aims fr mdel selectin when p is large (wrks even p > n). The LASSO prcedure will shrink the cefficients tward 0, and eventually frce sme t be exactly 0 (s predictrs with 0 cefficient will be selected ut). The LASSO estimates are defined as: n b = argmin (y b i x i b) 2, s.t. b 1 t i=1 Here b 1 = p j=1 b j is the L 1 nrm, and t 0 is a tuning parameter cntrlling the strength f shrinkage. S LASSO tries t minimize the residual sum f square, with a cnstraint n the sum f the abslute values f the cefficients. NOTE: There are ther types f regularized regressins. Fr example, regressin with an L 2 penalty, e.g., j b 2 j t, is called ridge regressin.

Mdel selectin by LASSO 4/33 The feasible slutin space fr LASSO is linear (defined by the cnstraints), s ften the ptimal slutin is at a crner pint. The implicatin: at ptimal, many cefficient (nn-basic variables) will be 0 variable selectin. On the cntrary, ridge regressin usually desn t have any cefficient being 0, s it desn t d mdel selectin. The LASSO prblem can be slved by standard quadratic prgramming algrithm.

LASSO mdel fitting 5/33 In LASSO, we need t slve the fllwing ptimizatin prblem: n max (y i b j x j ) 2 s.t. i=1 b j t j The trick is t cnvert the prblem int the standard QP prblem setting, e.g., remve the abslute value peratr. The easiest way is t let b j = b + j b j, where b + j, b j 0. Then b j = b + j + b j, and the prblem can be written as: n max (y i b + j x j + b j x j) 2 s.t. i=1 (b + j + b j ) t, j b + j, b j 0 j j j This is a standard QP prblem can be slved by standard QP slvers.

A little mre n LASSO 6/33 The Lagrangian fr the LASSO ptimizatin prblem is: n L(b, λ) = (y i b j x j ) 2 λ i=1 j p b j This is equivalent t the likelihd functin f a hierarchical mdel with a duble expnential (DE) prir n b s (remember ADE used in quantile regressin?): b j DE(1/λ) Y X, b N(Xb, 1) j=1 The DE density functin is f (x, τ) = 1 ( ) x 2τ exp. τ

As a side nte, the ridge regressin is equivalent with the hierarchical mdel with a Nrmal prir n b (verify it).

LASSO in R 8/33 The glmnet package has functin glmnet glmnet package:glmnet R Dcumentatin fit a GLM with lass r elasticnet regularizatin Descriptin: Fit a generalized linear mdel via penalized maximum likelihd. The regularizatin path is cmputed fr the lass r elasticnet penalty at a grid f values fr the regularizatin parameter lambda. Can deal with all shapes f data, including very large sparse data matrices. Fits linear, lgistic and multinmial, pissn, and Cx regressin mdels. Usage: glmnet(x, y, family=c("gaussian","binmial","pissn","multinmial","cx","mgaussian"), weights, ffset=null, alpha = 1, nlambda = 100, lambda.min.rati = ifelse(nbs<nvars,0.01,0.0001), lambda=null, standardize = TRUE, intercept=true, thresh = 1e-07, dfmax = nvars + 1, pmax = min(dfmax * 2+20, nvars), exclude, penalty.factr = rep(1, nvars), lwer.limits=-inf, upper.limits=inf, maxit=100000, type.gaussian=ifelse(nvars<500,"cvariance","naive"), type.lgistic=c("newtn","mdified.newtn"), standardize.respnse=false, type.multinmial=c("ungruped","gruped"))

LASSO in R example 9/33 > x=matrix(rnrm(100*20),100,10) > b = c(-1, 2) > y=rnrm(100) + x[,1:2]%*%b > fit1=glmnet(x,y) > > cef(fit1, s=0.05) 11 x 1 sparse Matrix f class "dgcmatrix" 1 (Intercept) 0.003020916 V1-0.967153276 V2 1.809566641 V3-0.106775004 V4 0.041574896 V5. V6. V7 0.102566050 V8. V9. V10.

> cef(fit1, s=0.1) 11 x 1 sparse Matrix f class "dgcmatrix" 1 (Intercept) 0.01304181 V1-0.92725224 V2 1.76178647 V3-0.05743472 V4. V5. V6. V7 0.05953563 V8. V9. V10. > cef(fit1, s=0.5) 11 x 1 sparse Matrix f class "dgcmatrix" 1 (Intercept) 0.08689072 V1-0.52883089 V2 1.29823139 V3. V4. V5. V6....

> plt(fit1, "lambda") #### run crss validatin > cv=cv.glmnet(x,y) > plt(cv) 10 7 5 4 2 2 10 10 7 5 5 4 3 2 2 2 0 Cefficients 1.0 0.0 0.5 1.0 1.5 5 4 3 2 1 0 Lg Lambda Mean Squared Errr 1 2 3 4 5 5 4 3 2 1 0 lg(lambda)

Supprt Vectr Machine (SVM) 12/33 Figures fr the slides are btained frm Hastie et al. The Elements f Statistical Learning. Prblem setting: Given training data pairs (x 1, y 1 ),..., (x N, y N ). x i s are p-vectr predictrs. y i { 1, 1} are utcmes. Our gal: t predict y based n x (find a classifier). Such classifier is defined as a functin f x, G(x). G is estimated based n the training data (x, y) pairs. Once G is btained, it can be used fr future predictins. There are many ways t cnstruct G(x), and Supprt Vectr Machine (SVM) is ne f them. We ll first cnsider the simple case: G(x) is based n linear functin f x. It s ften called linear SVM r supprt vectr classifier.

Simple case: perfectly separable case 13/33 First define a linear hyperplane by {x : f (x) = x T b + b 0 = 0}. It is required that b is a unit vectr with b = 1 fr identifiability. A classificatin rule can be defined as G(x) = sign[x T b + b 0 ]. The prblem is t estimate b s. Cnsider a simple case where tw grups are perfectly separated. We want t find a brder t separate tw grups. There are infinite number f brders can perfectly separate tw grups. Which ne is ptimal? Cnceptually, the ptimal brder shuld separates the tw classes with the largest margins. We define the ptimal brder t be the ne satisfying: (1) the distances between the clsest pints t the brder are the same in bth grups, dente the distance by M; and (2) M is maximized. M is called the margin.

Prblem setup 14/33 Then prblem t find the best brder can be framed int fllwing ptimizatin prblem: max β,β 0 s.t. M y i (x T i b + b 0) M, i = 1,..., N This is nt a typical LP/QP prblem s we d sme transfrmatins t make it lk mre familiar. Divided bth sides f the cnstraint by M, and define β = b/m, β 0 = b 0 /M, the cnstraints becme: y i (x T i β + β 0) 1. This means that we scale the cefficients f the brder hyperplane, s that the margin lines are in the frms f x T i β + β 0 + 1 = 0 (upper margin) and x T i β + β 0 1 = 0 (lwer margin).

Nw we have β = b /M = 1/M. S the bjective functin (maximizing M) is equivalent t minimizing β. After this transfrmatin, the ptimizatin prblem can be expressed as a simpler, mre familiar frm: min β,β 0 s.t. β y i (x T i β + β 0) 1, i = 1,..., N This is a typical quadratic prgram prblem.

Illustratin f the ptimal brder (slid line) with margins (dash lines). 418 12. Flexible Discriminants x T β + β 0 =0 M = 1 β M = 1 β margin x T β + β 0 =0 ξ ξ 3 1 ξ 2 ξ 4 ξ5 M = FIGURE 12.1. Supprt vectr classifiers. The left panel shw case. The decisin bundary is the slid line, while brken lines b maximal margin f width 2M =2/ β. The right panel shws th (verlap) case. The pints labeled ξj are n the wrng side f t an amunt ξj = Mξ j ; pints n the crrect side have ξj =0. maximized subject t a ttal budget P ξ i cnstant. Hence P distance f pints n the wrng side f their margin.

{x : f(x) =x T β + β 0 =0}, (12.1) Nn-separable case 17/33 When tw classes are nt perfectly separable, we still want t find a brder with tw margins. But nw there will be pints n the wrng sides. We intrduce slack 418 variables 12. Flexible t accunt Discriminants fr thse pints. x T β + β 0 =0 M = 1 β M = 1 β margin x T β + β 0 =0 ξ ξ 3 1 ξ 2 ξ 4 ξ5 M = 1 β M = 1 β margin FIGURE 12.1. Supprt vectr classifiers. The left panel shws the separable case. Define The decisin slack variables bundary is the {ξslid 1,.. line,., ξwhile N }, where brken lines ξ i bund 0 i theand shaded maximal margin f width 2M =2/ β. The right panel shws the nnseparable (verlap) case. The pints labeled ξj are n the wrng side f their margin by an amunt ξj = Mξ j ; pints n the crrect side have ξj =0.Themarginis maximized subject t a ttal budget P ξ i cnstant. Hence P ξj is the ttal distance f pints n the wrng side f their margin. ξ = 0 when the pint is n the crrect side f the margin. ξ > 1 when the pint passes the brder t the wrng side. 0 < ξ < 1 when the pint is in the margin but still n the crrect side. Our training data cnsists f N pairs (x 1,y 1 ), (x 2,y 2 ),...,(x N,y N ), with x i IR p and y i { 1, 1}. Define a hyperplane by

Nw the cnstraints in the riginal ptimizatin prblem is mdified t: y i (x T i β + β 0) 1 ξ i, i = 1,..., N ξ i can be interpreted as the prprtinal amunt by which the predicatin is n the wrng side f the margin. Anther cnstraint i ξ i C is added t bund the ttal number f misclassificatin. Tgether, the ptimizatin prblem fr this case is written as : 1 min β,β 0 2 β s.t. y i (xi T β + β 0) 1 ξ i ξ i C, ξ i 0 Again this is a quadratic prgramming prblem. What are the unknwns? i

Cmputatin 19/33 The primal Lagrangian is: L P = 1 2 β 2 + γ ξ i α i [y i (xi T β + β 0) (1 ξ i )] µ i ξ i i i i Take derivatives f β, β 0, ξ i then set t zer, get (the statinary cnditins) : β = α i y i x i i 0 = α i y i i α i = γ µ i, i Plug these back t the primal Lagrangian, get the fllwing dual bjective functin (verify): L D = α i 1 α i α i y i y i xi T 2 x i i i i

The L D needs t be maximized subject t cnstraints: α i y i = 0 i 0 α i γ The KKT cnditins fr the prblem (in additinal t the statinary cnditins) include fllwing cmplementary slackness and primal/dual feasibilities: α i [y i (x T i β + β 0) (1 ξ i )] = 0 µ i ξ i = 0 y i (x T i β + β 0) (1 ξ i ) 0 α i, µ i, ξ i 0 The QP prblem can be slved using interir pint methd based n these.

Slve fr ˆβ 0 21/33 With ˆα i and ˆβ given, we still need t get ˆβ 0 t cnstruct the decisin bundary. One f the cmplementary slackness cnditin is: α i [y i (x T i β + β 0) (1 ξ i )] = 0 Any pint with ˆα i > 0 and ˆξ i = 0 (the pints n the margins) can be used t slve fr ˆβ 0. In practice we ften use the average f thse t get a stable result fr ˆβ 0.

The supprt vectrs 22/33 At ptimal slutin, β is in the frm f: ˆβ = i ˆα i y i x i. This means ˆβ is a linear cmbinatin f y i x i, and nly depends n thse data pints with ˆα 0. These data pints are called supprt vectrs. Accrding t the cmplmentary slackness in the KKT cnditins, at ptimal pint we have: α i [y i (xi T β + β 0) (1 ξ i )] = 0, i which means α i culd be nn-zer nly when y i (x T i β + β 0) (1 ξ i ) = 0. What des this result tell us?

Fr pints with nn-zer α i : The pints with ξ i = 0 will have y i (x T i β + β 0) = 1, r these pints are n the margin lines. Other pints with y i (x T i β + β 0) = 1 ξ i are n the wrng side f the margins. S nly the pints n the margin r at the wrng side f the margin are infrmative fr the separating hyperplane. These pints are called the supprt vectrs, because they prvide supprt fr the decisin bundary. This makes sense, because the pints that can be crrectly separated and far away frm the margin (thse easy pints) dn t tell us anything abut the classificatin rule (the hyperplane).

Supprt Vectr Machine 24/33 We have discussed supprt vectr classifier, which uses hyperplane t separate tw grups. Supprt Vectr Machine enlarges the feature space t make the prcedure mre flexible. T be specific, we transfrm the input data x i using sme basis functins h m (x), m = 1,..., M. Nw the input data becme h(x i ) = (h 1 (x i ),..., h M (x i )). This basically transfrm the data t anther space, which culd be nnlinear in the riginal space. We then find SV classifier in the transfrmed space using the same prcedure, e.g., find ptimal ˆ f (x) = h(x) T ˆβ + ˆβ 0. And the decisin is made by: Ĝ(x) = sign( ˆ f (x)). Nte: the classifier is linear in the transfrmed space, but nnlinear in the riginal ne.

Chse basis functin? 25/33 Nw the prblem becmes the chice f basis functin, r d we even need t chse basis functin. Recall in the linear space, β is in the frm f: β = α i y i x i. In the transfrmed space, it becmes: β = α i y i h(x i ). i i S the decisin bundary is: f (x) = h(x) T i α i y i h(x i ) + β 0 = α i y i h(x), h(x i ) + β 0. i

Mrever, the dual bjective functin in transfrmed space becmes: L D = α i 1 α i α i y i y i h(x i ), h(x i ) 2 i i i What des this tell us? Bth the bjective functin and the decisin bundary in the transfrmed space invlves nly the inner prducts f the transfrmed data, nt the transfrmatin itself! S the basis functins are nt imprtant, as lng as we knw h(x), h(x i ).

Kernel tricks 27/33 Define the kernal functin K : R P R P R, t represent the inner prduct in the transfrmed space: K(x, x ) = h(x), h(x ). K needs t be a symmetric and psitive semi-definite. With the kernel trick, the decisin bundary becmes: f (x) = α i y i K(x, x i ) + β 0. i Sme ppular chices f the kernel functins are: Plynmial with d degree: K(x, x ) = (a 0 + a 1 x, x ) d. Radial basis functin (RBF): K(x, x ) = exp{ x x 2 /c}. Sigmid: K(x, x ) = tanh(a 0 + a 1 x, x ).

Cmputatin f SVM 28/33 With kernels defined, the Lagrangian dual functin is: L D = α i 1 α i α i y i y i K(x i, x i ) 2 i i i Maximize L D, with α i s being the unknwns, subject t the same cnstrains: α i y i = 0 i 0 < α i < γ This is a standard QP prblem can be slved easily.

The rle f γ 29/33 T cntrl the smthness f bundary. Remember γ is intrduced in the primal prblem t cntrl the ttal misclassificatin, e.g., dual variable fr riginal cnstraint i ξ i C. we can always prject the riginal data t higher dimensinal space s that they can be better separated by a linear classifier (in the transfrmed space), but Large γ: fewer errr in transfrmed space, wiggly bundary in riginal space. Small γ: mre errrs in transfrm space, smther bundary in riginal space. γ is a tuning parameter ften btained frm crss-validatin.

A little mre abut the decisin rule 30/33 Recall the decisin bundary nly depends n supprt vectrs, r the pints with α i 0. S f (x) can be written as: f (x) = α i y i K(x, x i ) + β 0, where S is the set f supprt vectrs. i S The kernel K(x, x ) can be seen as a similarity measure between x and x. S t classify fr pint x, the decisin is made essentially by a weighted sum f similarity f x t all the supprt vectrs.

An example 31/33 SVM using 4-degree plynmial kernal. Decisin bundary prjected int 2-D space. 12.3 Supprt Vectr Machines and Kernels 425 SVM - Degree-4 Plynmial in Feature Space.................................................................................................................................................................................................................................... Training Errr: 0.180...... Test Errr: 0.245......... Bayes Errr: 0.210 SVM - Radial Kernel in Feature Space................

SVM in R 32/33 There are several R packages include SVM functin: e1071, kernlab, klar, svmpath, etc. Jurnal f Statistical Sftware 21 Table belw summarize the R SVM functins. Fr mre details please refer t the Supprt Vectr Machines in R paper at the class website. ksvm() svm() svmlight() svmpath() (kernlab) (e1071) (klar) (svmpath) Frmulatins C-SVC, C-SVC, ν- C-SVC, -SVR binary C-SVC ν-svc, SVC, ne- C-BSVC, SVC, -SVR, spc-svc, ν-svr ne-svc, - SVR, ν-svr, -BSVR Kernels Gaussian, plynmial, Gaussian, plynmial, Gaussian, plynmial, Gaussian, plynmial linear, sigmid, linear, sigmid linear, sigmid Laplace, Bessel, Anva, Spline Optimizer SMO, TRON SMO chunking NA Mdel Selectin hyperparameter estimatin grid-search functin NA NA fr Gaussian kernels Data frmula, ma- frmula, ma- frmula, ma- matrix

Summary f SVM 33/33 Strengths f SVM: flexibility. scales well fr high-dimensinal data. can cntrl cmplexity and errr trade-ff explicitly. as lng as a kernel can be defined, nn-traditinal (vectr) data, like strings, trees can be input. Weakness: hw t chse a gd kernel (a lw degree plynmial r radial basis functin can be a gd start).