Support Vector Machine and Its Application to Regression and Classification

Size: px

Start display at page:

Download "Support Vector Machine and Its Application to Regression and Classification"

August Taylor
5 years ago
Views:

BearWorks Institutiona Repository MSU Graduate Theses Spring 2017 Support Vector Machine and Its Appication to Regression and Cassification Xiaotong Hu As with any inteectua project, the content and

However, this student-schoar s work has been judged to have academic vaue by the student s thesis committee members trained in the discipine.

1 BearWorks Institutiona Repository MSU Graduate Theses Spring 2017 Support Vector Machine and Its Appication to Regression and Cassification Xiaotong Hu As with any inteectua project, the content and views expressed in this thesis may be considered objectionabe by some readers. However, this student-schoar s work has been judged to have academic vaue by the student s thesis committee members trained in the discipine. The content and views expressed in this thesis are those of the student-schoar and are not endorsed by Missouri State University, its Graduate Coege, or its empoyees. Foow this and additiona works at: Part of the Mathematics Commons Recommended Citation Hu, Xiaotong, "Support Vector Machine and Its Appication to Regression and Cassification" (2017). MSU Graduate Theses This artice or document was made avaiabe through BearWorks, the institutiona repository of Missouri State University. The work contained in it may be protected by copyright and require permission of the copyright hoder for reuse or redistribution. For more information, pease contact BearWorks@ibrary.missouristate.edu.

2 SUPPORT VECTOR MACHINE AND ITS APPLICATION TO REGRESSION AND CLASSIFICATION A Masters Thesis Presented to The Graduate Coege of Missouri State University In Partia Fufiment Of the Requirements for the Degree Master of Science, Mathematics By Xiaotong Hu Apri 2017

3 SUPPORT VECTOR MACHINE AND ITS APPLICATION TO REGRESSION AND CLASSIFICATION Mathematics Missouri State University, Apri 2017 Master of Science Xiaotong Hu ABSTRACT Support Vector machine is currenty a hot topic in the statistica earning area and is now widey used in data cassification and regression modeing. In this thesis, we introduce the basic idea for support vector machine, its appication in the cassification area incuding both inear and noninear parts, and the idea of support vector regression contains the comparison of oss functions and the usage of kerne function. Two rea ife exampes, which are taken from R package, are aso provided for both cassification and regression part respectivey, taking about cassification of gass type and prediction for Ozone poution. KEYWORDS: support vector machine, data Cassification, regression mode, hyperpane, statistica earning This abstract is approved as to form and content Songfeng Zheng, PhD Chairperson, Advisory Committee Missouri State University ii

4 SUPPORT VECTOR MACHINE AND ITS APPLICATION TO REGRESSION AND CLASSIFICATION By Xiaotong Hu A Master Thesis Submitted to the Graduate Coege Of Missouri State University In Partia Fufiment of the Requirements For the Degree of Master of Science, Mathematics May, 2017 Approved: Songfeng Zheng, PhD Yingcai Su, PhD George Mathew, PhD Juie Masterson, PhD: Dean, Graduate Coege iii

5 TABLE OF CONTENTS Chapter 1. Introduction Basic Idea of Statistica Learning Theory Introduction to Support Vector Machine...1 Chapter 2. Support Vector Cassification Hyperpane Maxima Margin Cassifier Cassification using a Separating Hyperpane The Maxima Margin Cassifier Support Vector Cassifier Overview of Support Vector Cassifier Computing the Support Vector Cassifier Cassification with non-inear decision boundaries Appication of Support Vector Cassification to Rea Word Data...13 Chapter 3. Support Vector Regression Linear regression ε-insensitive Loss Function Quadratic Loss Function Huber Loss Function Noninear Regression Appication of Support Vector Regression to Rea Word Data...32 References...39 Appendices...40 Appendix A. Gass Data...40 Appendix B. Ozone Data...46 iv

6 LIST OF TABLES Tabe 1. Different Type of Gass and their Chemica Anaysis Data (Partia)...12 Tabe 2. Summary of Data Gass...13 Tabe 3. Quantities for Each Type of Gass...13 Tabe 4. Summary of Base Mode for Data Gass...15 Tabe 5. Statistics of Base Mode for Data Gass...16 Tabe 6. Summary of Adjusted Mode 1 for Data Gass...17 Tabe 7. Statistics of Adjusted Mode 1 for Data Gass...18 Tabe 8. Summary of Adjusted Mode 2 for Data Gass...18 Tabe 9. Statistics of Adjusted Mode 2 for Data Gass...19 Tabe 10. Summary of Fittest Mode for Data Gass...20 Tabe 11. Statistics of Fittest Mode for Data Gass...21 Tabe 12. Los Angees daiy ozone poution in 1976 (Partia)...33 Tabe 13. Summary of Base Mode for Data Gass...35 Tabe 14. Summary of Adjusted Mode 1 for Data Gass...36 Tabe 15. Summary of Adjusted Mode 2 for Data Gass...37 Tabe 16. Summary of Fittest Mode for Data Gass...37 v

7 LIST OF FIGURES Figure 2.1 Separating Hyperpane β 0 + β 1 x 1 + β 2 x 2 =0 in 2-dimensiona space...4 Figure 2.2 Maxima Margin Cassifier and the margin in 2-dimisiona space...5 Figure 2.3 Hyperpane change eaded by a singe observation...7 Figure 2.4 Soft Margin Cassifier...7 Figure 2.5 Support Vector Machine with poynomia kerne of degree Figure 3.1 A pot of typica ε-insensitive oss function...24 Figure 3.2 A pot of typica quadratic oss function...27 Figure 3.3 Huber oss and quadratic oss as a function of y f(x)...29 Figure 3.4 R square pot for 5-fod cross vaidation...38 vi

8 CHAPTER 1. INTRODUCTION 1.1 Basic Idea of Statistica Learning Theory Statistica earning theory is a framework for machine earning area reated to the fieds of statistics and functiona anaysis. The goa for statistica earning is find a fittest function f, representing the systematic information of the response from its predictors, which wi be used for prediction and inference. To estimate this unknown function, the approach used in statistica earning is caed training data, whose name come from the usage that train our method how to estimate f. Every point in the training data is a input-output pair, where the input maps an output. The statistica eaning probem consist of inferring the function that maps between the input and the output, such that the earned function can be used to predict output from future input. Statistica earning theory is widey used in regression modeing and data cassification. In the regression probem, the responses are quantitative vaues which take a continuous range of vaues, meanwhie, the responses in cassification probem are quaitative vaues which are eements from a discrete set of abes. In the statistica earning area, the approach to find the function f are different in many aspects, for exampe, the choice of oss function, however, they aso share some simiarities. In the foowing chapter, we wi go further in both of these two areas. 1.2 Introduction to Support Vector Machine Support Vector machine are currenty a hot topic in the statistica earning area. In ate 1990s, the traditiona neura network approaches suffered severe difficuties with generaization and producing modes which is the main reason for the foundation of 1

9 support vector machines. It was deveoped in 1995 by Vadimir Vapnik, and soon gained popuarity due to many attractive features. In statistica earning, support vector machines are supervised earning method with assoxiated eaning agorithms that anayze dataset. It is first been introduced as an method for soving cassification probems. However, due to many attractive features, it is recenty extended to the area of regression anaysis. This thesis wi mainy focus on the appication of support vector machine used in cassification and regression area and is structured as foows. Chapter 2 introduces the basic idea of cassification, incuding the concept of hyperpane, different type of cassifier and their properties, and wi be focused on the support vector cassification in both inear and non-inear condition, and its appication in rea ife exampes operating by computationa anguages R. Chapter 3 wi focus on support vector regression which is separated into inear and non-inear regression. In both these two sections, it wi introduce different type and choice for oss function and how wi it infuence the performance of regression modeing. It wi aso introduce an exampe at the end of these chapter to show how support vector regression deas with rea ife probem. 2

10 CHAPTER 2. SUPPORT VECTOR CLASSIFICATION 2.1 Hyperpane In a p-dimensiona space, a hyperpane is a fat affine subspace (a subspace need not pass through the origin) of dimension p-1. It is defined by equation β 0 + β 1 x 1 + β 2 x β p x p = 0 (2.1) for parameter β 1 β p. In this case, the point X=( x 1,x 2,,x p ) ies on the hyperpane. Now, suppose X does not satisfy 2.1, but β 0 + β 1 x 1 + β 2 x β p x p > 0 (2.2) then we say X ies on the one side of the hyperpane. On the other hand, if then X ies on the other side of the hyperpane. β 0 + β 1 x 1 + β 2 x β p x p < 0 (2.3) Here we can consider that the hyperpane divides this p-dimensiona space into two haves, which wi be the basic idea of cassification. 2.2 Maxima Margin Cassifier Cassification using a Separating Hyperpane space Suppose we have a n p data matrix that contains n observation in p-dimensiona x 1 = (x1 1 n = x 1 p),.,x 3 (xn 1 p) x n What we want is to separate these data into two cass. The approach is to abe y 1,,y n { 1,1}, where -1 represent a cass and 1 represent the other cass. Our goa is to deveop a cassifier that can cassify the observation using its feature measurement, and

11 it is based on the concept of separating hyperpane (James, Witten, Hastie & Tibshirani, 2009). The separating hyperpane with parameter β 1 β p has the property that which equivaenty means { β 0 + β x 1 i 1 + β x 2 i β x p i p > 0, if y i = 1 β 0 + β 1 x i 1 + β 2 x i β p x i p < 0, if y i = 1 y i (β 0 + β 1 x 1 + β 2 x β p x p ) > 0 (2.5) (2.4) Thus, if a hyperpane exist, the observation dataset can be assigned a cass based on which side if the hyperpane it is ocated. In figure 2.1, we can ceary cassify the observation x i based on hyperpane β 0 + β 1 x 1 + β 2 x 2 =0. That is, if f(x i ) = β 0 + β 1 x 1 + β 2 x 2 is positive, it is assigned to cass 1; if negative, on the other hand, it is assigned to cass -1. 4

12 Figure 2.1 Separating Hyperpane The Maxima Margin Cassifier β 0 + β 1 x 1 + β 2 x 2 =0 in 2-dimensiona space In genera, if our data can be perfecty separated using a hyperpane, then there wi exist infinitey many such hyperpane. In order to decide which separating hyperpane to use, we need to use the maxima margin cassifier, which is aso known as the optima separating hyperpane. The maxima margin cassifier is farthest from the training observation. That is, when we compute the distance of each observation to the separating hyperpane, the minima one is what we caed the margin. The maxima margin cassifier is the hyperpane for which the margin is the argest (James, Witten, Hastie & Tibshirani, 2009). Figure 2.2 Maxima Margin Cassifier and the margin in 2-dimisiona space 5

13 probem In genera, the maxima margin cassifier is the soution to the optimization max M (2.6) β 0,β 1,,β p subject to p j = 1 β 2 j = 1 (2.7) y i (β 0 + β 1 x 1 + β 2 x β p x p ) M for a,,n (2.8) Equation 2.5 guarantees M is positive, and with constraints from 2.6 to 2.8, the perpendicuar distance for ith observation to the hyperpane is actuay given by y i (β 0 + β 1 x 1 + β 2 x β p x p ). Thus, M is the margin of the hyperpane, and the optimization probem is exacty the definition of the maxima margin hyperpane. 2.3 Support Vector Cassifier Overview of Support vector Cassifier Sometimes in practica situation, the observations that beong to two casses are not necessariy separabe by the hyperpane. In fact, the maxima margin hyperpane may not be desirabe even if it does exist. A cassifier based on the maxima margin hyperpane coud be changed dramaticay with addition or miss one singe observation which may resut in overfitting the training data. An exampe is shown in figure 2.3, where a singe observation coud dramaticay change the hyperpane from the eft-hande pane to the right hande pane. 6

14 Figure 2.3 Hyperpane change eaded by a singe observation In this case, another approach to cassify the observation, which is known as soft margin cassifier or support vector cassifier is needed. Unike maxima margin cassifier, the soft margin cassifier aows some observation on the wrong side of the margin or even on the wrong side of the hyperpane. Figure 2.4 Soft Margin Cassifier 7

15 Figure 2.4 shows an exampe of soft margin cassifier. Here, observation x1 is on the wrong side of margin but right side of hyperpane, and observation x2 is on both wrong side of margin and hyperpane. Observation x3, 4 and 5, which ie on the edge of the margin, these 3 observations and observation x1 and x2 which ie inside the soft margin are what we caed the support vectors, and we wi detai this part in Simiar with the maxima margin cassifier, the soft margin cassifier is the soution to the optimization probem subject to max M (2.9) β 0,β 1,,β p p j = 1 β 2 j = 1 (2.10) y i (β 0 + β 1 x 1 + β 2 x β p x p ) M(1 ε i ) for a,,n (2.11) ε i 0, n ε i C (2.12) where C is a nonnegative tuning parameter. As showing in Figure 2.3, M is the width of the margin, which we seek to make this vaue as arge as possibe. ε i is what we caed sack variabes that aows individua observation to be on the wrong side of margin or hyperpane. By the constraints shows from 2.9 to 2.11, we can easiy te that if 0 ε i 1, the observation is on the wrong side of margin but right side of hyperpane ( x 1 in Figure 2.3 with sack variabe ε 1 ); if ε i 1, the observation is on the both wrong side of margin and hyperpane ( x 2 in Figure 2.3 with sack variabe ε 2 ) Computing the Support Vector Cassifier We can rewrite into 8

16 max M (2.13) β,β 0, β = 1 y i (x T i β + β 0 ) M(1 ε i ) for a,,n (2.14) ε i 0, n ε i C (2.15) where x i R p with unit vector β = 1. If we then drop the norm constraint on β, where we define M = 1/ β, wi be represented as subject to min β y i (x T i β + β 0 ) 1 ε i (2.16) for a,,n (2.17) ε i 0, n ε i C (2.18) The probem is a quadratic with inear inequaity constraints, hence, we can describe the programming soution using Lagrange mutipiers (Smoa & Schokopf, 2003). The probem is equivaent to min 1 2 β 2 + C n ε i (2.19) subject to y i (x T i β + β 0 ) 1 ε i for a,,n (2.20) ε i 0, (2.21) The Lagrange function is L = 1 2 β 2 + C n ε i Taking partia derivative to zero, we get N α i [y i (x T i β + β 0) (1 ε i )] N μ i ε i (2.22) 9

17 β = N N α i y i x i (2.23) α i y i = 0 (2.24) α i = C μ i, for, N (2.25) where the constraints α i,μ i,ε i 0. By substituting into 2.22 are, max L = N α i 1 2 N N j = 1 α i α j y i y j x T i x j (2.26) subject to 0 α i C (2.27) N α i y i = 0 (2.28) In addition, the Karush-Kuhn-Trcker conditions that are satisfied by the soution α i [y i (x T i β + β 0) (1 ε i )] = 0 (2.29) μ i ε i = 0 (2.30) [y i (x T i β + β 0) (1 ε i )] 0 (2.31) For, N. A the equations from uniquey characterize the soution to the prima and dua probems (Smoa & Schokopf, 2003). Meanwhie, from 2.23, we can get the soution for β which is in the form of β = N α i y i x i, and the observations with constraints in 2.31 are what we caed support vectors, which refer to those observation ies in the soft-margin. Among those support 10

18 vectors, some meet the constrains in 2.29, which are observations ies on the edge of the margin, and we normay use them to sove β Cassification with Noninear Decision Boundaries In genera, the soft margin cassifier is a approach for inear boundary cassification. However, we sometimes need to dea with noninear boundary in practice when inear cassifier cannot perfecty perform. In this case, we need to enarge the feature space using functions of predictors, ike quadratic or cubic. For exampe, with p observation x 1, x p, we can fit a vector cassifier aong with their quadratic form, i.e. x 1, x 2 1,,x p,x2 p. Thus, the optimization probem show from 2.9 to 2.12 wi become y i (β 0 + p max M (2.32) β 0,β 11,,β p1,β 12, β p2 subject to p p j = 1 2 k β 2 jk = 1 (2.33) β j1 x ij + β j2 xij 2 ) M(1 ε i ) for a,,n (2.34) j = 1 j = 1 ε i 0, n ε i C (2.35) The soution wi ead to a noninear decision boundary, which is known as support vector machine. The key idea of support vector machine is what we caed the kerne (James, Witten, Hastie & Tibshirani, 2009). A kerne is a function which quantifies the simiarity 11

19 of two observation, representing as K(x i,x i ' ). For exampe, we can represent inear support vector cassifier by taking inner product as kerne, which means K(x i,x i ') = and the support vector cassifier can be represented as f(x) = β 0 + p x ij x i ' (2.36) j j = 1 p α i K(x i,x i ') (2.37) j = 1 K(x i,x which is aso the genera form of support vector machine if i ' ) is not specific. Thus, if we woud ike a noninear cassifier, another form of kerne function wi be needed. For instance, K(x i,x i ') = (1 + p x ij x i ' ) d (2.38) j j = 1 which is known as a poynomia kerne of degree d. When d > 1, the support vector cassifier wi ead to a more fexibe, noninear form. 12

20 Figure 2.5 Support Vector Machine with poynomia kerne of degree 3 Another popuar choice is the radia kerne, aso is known as the Gaussian kerne, which wi be discussed in Chapter 3, as it is more widey used in Support Vector Regression area. 2.5 Appication of Support Vector Cassification to Rea Word Data The data Gass used in this chapter, taken from UCI Repository Of Machine Learning Databases, containing exampes of the chemica anaysis of 7 different type of gass. It is a perfect data set for cassification and our task is to forecast the type of gass on basis of its chemica anaysis. Tabe 1. Different Type of Gass and their Chemica Anaysis Data (Partia) 13

21 Observation RI Na Mg A Si K Ca Ba Fe Type This dataset containing 214 observation on 10 variabes, and the foowing tabe provides a basic summary of this dataset. Tabe 2. Summary of Data Gass Min 1st Qu Median Mean 3rd Qu Max Ri Na Mg A Si K Ca Ba Fe Tabe 3. Quantities for Each Type of Gass Type

22 Quantity We are going to take R package e1071, which is deveoped for support vector machine to hep us to do the cassification and prediction. First of a, we need to start by spitting the data into train and test data set, by randomy seect 71 observations (onethird of dataset) as test data and the remains as train data. In this case, we are going to buid our support vector cassification nonineary, and the kerne we are going to use is radia kerne. Starting with setting cost as 100 and gamma in radia kerne as 1, the support vector cassifier wi be represented as f(x i ) = β 0 + p a i exp ( x i x j 2 ) (2.39) j = 1 And by taking code svm from package e1071 in R, we got the summary of our mode showing beow: Tabe 4. Summary of Base Mode for Data Gass Parameters: SVM-Type: C-cassification SVM-Kerne: radia cost: 100 gamma: 1 Cass Tota Number of support vector

23 The number of support vector is a measurement of the performance of a certain dataset fits a cassification mode. The ower the number is, the better it fits in this mode. The next step is to predict the test vaue, and the way for us to visuaize the performance of the accuracy our prediction is to buid the confusion matrix, which is aso known as the error matrix. By taking code confusionmatrix from package caret, we obtained the confusion matrix and the summary of this mode is showing beow: Confusion Matrix predict true Tabe 5. Statistics of Base Mode for Data Gass Overa Statistics: Accuracy : % CI : (0.4782, ) No Information Rate : P-Vaue [Acc > NIR] : Statistics by Cass: Cass 1 Cass 2 Cass 3 Cass 5 Cass 6 Cass 7 16

24 Sensitivity N/A Specificity Baanced Accuracy N/A The accuracy quantifies the performance of a specific support vector machine cassification mode Baanced accuracy is cacuated as the average of the proportion corrects of each cass individuay. In our case, the overa accuracy is This means if we are going to predict a dataset, the probabiity of cassifying a certain data point to its actua cass is The training accuracy is determined by severa factors: cost, type of the kerne, and the parameters in a certain kerne. In our case, for instance, if we change our cost from 100 to 50, the outcomes wi become Tabe 6. Summary of Adjusted Mode 1 for Data Gass Parameters: SVM-Type: C-cassification SVM-Kerne: radia cost: 50 gamma: 1 Cass Tota Number of support vector

25 It shows from the summary that the number of the support vector is ower than the situation when we set the cost equa to 100. Normay, the number of support vector wi decreasing when the cost become ower, since the soft-margin is narrow down. Aso, we can get a another confusion matrix from this mode. Confusion matrix: predict true Tabe 7. Statistics of Adjusted Mode 1 for Data Gass Overa Statistics Statistics by Cass: Accuracy : % CI : (0.3844, ) No Information Rate : P-Vaue [Acc > NIR] : Cass 1 Cass 2 Cass 3 Cass 5 Cass 6 Cass 7 18

26 Sensitivity Specificity Baanced Accuracy Another factor that can affect the training accuracy is the parameter in the kerne. In our case, gamma is the ony parameter which can affect the resut. For exampe, if we change gamma from 1 to 0.5, the outcome wi become Tabe 8. Summary of Adjusted Mode 2 for Data Gass Parameters: SVM-Type: C-cassification SVM-Kerne: radia cost: 100 gamma: 0.5 Cass Tota Number of support vector Same as the situation when we change the cost from 100 to 50, the number of support vector aso decreases when we ower the parameter gamma, since it is aso a factor which can affect the range of soft-margin. The summary of confusion matrix shows beow: Confusion matrix: true predict

27 Tabe 9. Statistics of Adjusted Mode 2 for Data Gass Overa Statistics: Accuracy : % CI : (0.4303, ) No Information Rate : Statistics by Cass: P-Vaue [Acc > NIR] : Cass 1 Cass 2 Cass 3 Cass 5 Cass 6 Cass 7 Sensitivity Specificity Baanced Accuracy We can reaized in these two cases the training accuracy is ower than the situation when we set the cost equa to 100 and gamma equa to 1. This is common, even though in both cases, the number of support vector do decrease. This situation is caed overfitting, which occurs when the mode is compex, i.e., has too many parameters reated to the observation. Our task is to fit the mode to the dataset as accurate as possibe, which means we need to find a certain combination of cost and gamma in the kerne. The approach which is commony used is cross-vaidation, a mode vaidation technique for assessing how the resuts of statistica anaysis wi generaize to an independent data set. It is mainy used in settings where the goa is prediction, and one wants to estimate 20

28 how accuratey a predictive mode wi perform in practice. Here we are going to take the approach 10-fod cross-vaidation, which randomy separate the origina dataset in to 10 subset, and on each time seect one as a test data and the other nine as training data. In R, we can write a for oop to buid this cross-vaidation. We coud set sequence which aows the cost runs from 100 to 1 for 100 times, and set gamma from 0.01 from 1 for 100 times. In each combination, the for oops aows R to test our dataset 10 times where randomy seected 1 subset from the origina dataset as the test dataset. This means the R system need to oop testing subgroup to get the best combination of parameter gamma and constant cost. In this test, the base case is the situation we set gamma equa to 1 and cost equa to 10. The resut after this 10-fod cross vaidation is showing beow. Tabe 10. Summary of Fittest Mode for Data Gass Parameters: SVM-Type: C-cassification SVM-Kerne: radia cost: 10 gamma: 0.43 Cass Tota Number of support vector The resuts shows that the highest accuracy occurs when cost equa to 10 and gamma is We can reaized the number of support vector does not decrease, the reason we mentioned before since the narrower soft margin coud cause the overfitting situation. Aso, the summary of confusion matrix is 21

29 Confusion matrix: true predict Tabe 11. Statistics of Fittest Mode for Data Gass Overa Statistics: Accuracy : % CI : (0.5283, ) No Information Rate : P-Vaue [Acc > NIR] : 7.251e-05 Statistics by Cass: Cass 1 Cass 2 Cass 3 Cass 5 Cass 6 Cass 7 Sensitivity N/A Specificity Baanced Accuracy N/A

30 The highest accuracy cacuated by 10-fod cross vaidation is , which is sighty higher than our base case. In fact, in this exampe, the combination of the origina gamma and cost is not the worst situation compare to other cases. In genera, in machine earning cases, we woud expect our confusion matrix for prediction has at east 70%-80% accuracy. From this point of view, we cannot concude that the resut of our experiment is inaccurate, however, it is not idea, technicay speaking. There is two possibe reason which cause this situation. The first is the compexity of our observation. The hyperpane coud become harder to set aong with the increase of the number of parameter. In this exampe, we have 9 parameters and we can reaize that in some situation, even though it shows the accuracy increases in the overa statistics, for a specific cass, the baanced accuracy for prediction coud possiby decrease. The second reason is the choice of test dataset. In our experiment, we separated the origina dataset into severa subset and chose one as a test dataset and oop approach. However, we do not have a unknown dataset for prediction, which means this part of dataset is ony treated as test dataset, not the training one. The design of our experiment has higher possibiity to cause overfitting, which can be an important factor which ower the prediction accuracy. Chapter 3 SUPPORT VECTOR REGRESSION This chapter wi discuss support vector machine appied to both inear and noninear regression probem. Our goa is to is to find a function f(x) that has at most ε deviation from actuay obtained target yi, for a the training data, and at the same time is 23

31 as fat as possibe. In other words, the data points ie in between the two borders of the margin which is maximized under suitabe conditions wi avoid outier incusion. 3.1 Linear regression Suppose we are given a set of data D = {(x 1,y 1 ),,(x,y )}, x R d, y R (3.1) with inear function in form of f(x) = < w, x > + b (3.2) The optima regression function is given by the minimum of the functiona, φ(ω, ζ) = 1 2 ω 2 + C (ξ i + ξ + i ) (3.3) where C > 0 is a pre-specified constant vaue, determined the trade-off between the fatness of f and the amount up to which deviations arger than ε are toerated, and ξ, ξ + are sack variabes representing upper and ower constraints on the outputs of the system ε-insensitive Loss Function The ε-insensitive oss function is the foowing L(y, f(x)) = y f(x) (3.4) 24

32 Figure 3.1 A pot of typica ε-insensitive oss function This oss function is idea when sma amounts of error are acceptabe, which is aso refer to a soft-margin. In ε-insensitive oss function, any points within some seected range ε are considered to have no error at a, which means the ε-insensitive oss function can be represented as L ℇ (y) = { 0 for f(x) y < ℇ f(x) y ℇ otherwise (3.5) This error-free margin makes the oss function an idea candidate for support vector regression. In inear function f(x) = < w, x > + b, < ω, x > denotes the dot product in R d. Since we mentioned before, one of our goa is to ensure the fatness, which means seeking a sma ω. One of our way is to minimize the norm, i.e. ω 2 = < ω, ω >. subject to { minimize 1 2 ω 2 (3.6) y i < ω,x i > b ε < ω,x i >+ b y i ε (3.7) function, Appying ε-insensitive oss function for minimizing the optima regression 1 minimize ω 2 + C 2 (ξ i + ξ + i ) (3.8) (3.9) subject to {y i < ω, x i > b ε + ξ i < ω, x i >+ b y i ε + ξ + i ξ i,ξ + i 0 25

33 Here, the key idea is to construct a Lagrange function (Smoa & Schokopf, 2003). We proceed as foows: L (η i,η i,α i,α i ) = 1 2 ω 2 + C (ξ i + ξ + i ) α i (ε + ξ i y i +< ω, x i >+ b α i (ε + ξ i y i +< ω, x i >+ b (3.10) (η i ξ i + η i ξ + i ) Here L is the Lagrangian and η i,η i,α i,α i are Lagrange mutipiers. Taking partia derivative of 3.10 with prima variabes (ω, b, ξ i, ξ + i ): b L = ω L = ω ξ + L = C α i η i = i 0 (3.13) max 1 2 min 1 2 (α i α i) = 0 (3.11) (α i α i )x i = 0 (3.12) Substituting the into L, the soution is given by, j = 1 or aternativey, j = 1 with constraints, (α i α i )(α j α j ) < x i, x j >+ (α i α i )(α j α j ) < x i, x j > (3.15) α i (y i ε) α i (y i + ε) (3.14) (α i α i )y i + (α i + α i )ε 26

34 0 α i,α i Equation 3.12 can be rewritten as Thus ω = f(x) = C,,, (3.16) (α i α i ) = 0 (3.17) (α i α i )x i (3.18) (α i α i ) < x i,x >+ b (3.19) This is what we caed Support Vector expansion, where ω can be described as a inear combination of x i. The Karush-Kuhn-Trcker conditions that are satisfied by the soution are, α i α i = 0,,, (3.20) which is aso caed compementary sackness (Smoa & Schokopf, 2003). This condition shows that there can never be a set of dua variabes α i,α i which are both simutaneousy nonzero, which aows us to concude that ε y i + < ω, x i >+ b 0 and ξ i = 0 if α i < C ε y i + < ω, x i >+ b 0 if α i < 0 (3.21) Therefore, the support vectors are points where exacty one of the Lagrange mutipiers is greater than zero. When ε = 0, we get L oss function and the optimization probem simpified, min 1 2 j = 1 β i β j < x i,x j > β i y i (3.22) 27

35 with constraints, C β i C,,, (3.23) β i = 0 and the regression function is given by Equation 3.2, where ω = β i x i (3.24) b = 1 2 < ω,(x r + x s) > (3.25) where x r and x s are support vectors ie on the edge of the margin. In addition, if Quadratic Loss Function ε = 0, the resut is just the median regression. The quadratic function is the foowing L(f(x) y) = (f(x) y) 2 (3.26) Figure 3.2 A pot of typica quadratic oss function 28

36 Quadratic oss is another function that is we-suited for the purpose of regression probems, athough it makes the outiers in the data punished very heaviy by the squaring of the error. As a resut, datasets must be fitered for outiers first, or ese the fit from this oss function may not be desirabe. given by, Same as what we did in 3.1.1, by appying quadratic oss function, the soution is max 1 2 j = 1 (3.27) (α i α i )(α j α j ) < x i, x j >+ y i (α i α i ) 1 2C The corresponding optimisation can be simpified by expoiting the KKT conditions, which impies β i = β i. The resutant optimization probems is, with constraints min 1 2 j = 1 β i β j < x i,x j > β i y i + 1 2C β i = 0 (3.29) and the regression function is given by Equation 3.2, where ω = β i x i (3.30) b = 1 2 < ω,(x r + x s) > (3.31) β 2 i (3.28) (α i 2 (α i ) 2 ) Huber Loss Function The Huber oss function is the foowing, 29

37 Figure 3.3 Huber oss and quadratic oss as a function of y f(x) L huber (f(x) y){1 2 (f(x) y)2 for f(x) y < μ μ f(x) y μ2 2 otherwise (3.32) Huber proposed the oss function as a robust oss function that has optima properties when the distribution of the data is unknown (Yeh, Huang & Lee, 2011). Compared to quadratic oss function, Huber oss is ess sensitive to outiers in data. As defined above, the Huber oss function is convex in a uniform neighborhood of its minimum f(x) y = 0, at the boundary of this uniform neighborhood, the Huber oss function has a differentiabe extension to an affine function at points f(x) y = μ and f(x) y = μ. These properties aow it to combine much of the sensitivity of the meanunbiased. 30

38 By appying Huber oss function, the soution is given by, maximize j = 1 (α i α i )(α j α j ) < x i, x j > y i (α i α i ) 1 2C The resutant optimization probem is, min 1 2 with constraints j = 1 β i β j < x i,x j > β i y i + 1 2C C β i C,,, (3.25) β i = 0 (3.26) and the regression function is given by Equation 3.2, where ω = β i x i (3.27) μ(α i 2 + (α i ) 2 ) (3.23) β 2 iμ (3.24) b = 1 2 < ω,(x r + x s) > (3.28) 3.2 Noninear Regression In the situation of noninear support vector regression approach, it is aways achieved by mapping x i into a high dimensiona feature space, i.e. there exist a map Φ: χ F, which F is a space the standard SV inear regression performs. The most widey adopted method is by using kerne approach. Same as inear SVR, we adopting ε-insensitive oss function, and the objective function and constraints are showing as 31

39 minimize 1 2 ω 2 + C (ξ i + ξ + i ) (3.29) subject to {y i < ω, Φ(x i ) > b ε + ξ i < ω, Φ(x i ) >+ b y i ε + ξ + i (3.30) ξ i,ξ + i 0 where the regression hyperpane is derived as f(x) = < w, Φ(x) > + b (3.31) To sove 3., one can aso introduce the Lagrange function as the inear SVR situation and taking partia derivatives with respect to the prima variabes and set the resuting derivatives to zero. The soution is given by max 1 2 with constraints, j = 1 (α i α i )(α j α j )K < x i, x j >+ (3.33) y i (α i α i ) ε(α i + α i ) 0 α i,α i C,,, (3.34) (α i α i ) = 0 (3.35) Here, K < x i, x j > is the kerne function which represents the inner product < Φ(x i ), Φ(x j ) >. The most widey used kerne function is the radia basis function, which is aso caed Gaussian kerne. The function is defined as K < x i, x j = < Φ(x i ), Φ(x j ) > = exp ( γ x i x j 2 ) (3.36) 32

40 where x i x j 2 is recognized as the squared Eucidean distance between the two feature vectors, and γ is the width parameter of Gaussian kerne. Soving 3.33 with the constraints equation, the regression function is given by, where < ω,x >= b = 1 2 f(x) = (α i α i )K < x i,x >+ b (3.37) (α i α i )K < x i, x j > (3.39) (α i α i )(K < x i, x r >+ K < x i,x s > ) (3.40) As with the SVC the equaity constraint may be dropped if the Kerne contains a bias term, b being accommodated within the Kerne function, and the regression function is given by, f(x) = (α i α i )K < x i,x > (3.41) 2.5 Appication of Support Vector Regression to Rea Word Data The data Ozone used in this chapter, taken from Department of Statistics, UC Berkeey, containing observation of Los Angees daiy ozone poution in 1976, and other factors that reated to ozone puution. It is a perfect data set for support vector regression and our task is to predict the daiy maximum one-hour-average ozone reading based on the dataset. Tabe 12. Los Angees daiy ozone poution in 1976 (Partia) Observation V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V NA NA

41 NA 38 NA NA -14 NA NA NA NA NA NA NA NA Format: V1. Month V2. Day of month V3. Day of week V4. Daiy maximum one-hour-average ozone reading V miibar pressure height (m) measured at Vandenberg AFB V6. Wind speed (mph) at Los Angees Internationa Airport (LAX) V7. Humidity (%) at LAX V8. Temperature (degrees F) measured at Sandburg, CA V9. Temperature (degrees F) measured at E Monte, CA 34

42 V10. Inversion base height (feet) at LAX V11. Pressure gradient (mm Hg) from LAX to Daggett, CA V12. Inversion base temperature (degrees F) at LAX V13. Visibiity (mies) measured at LAX We are aso going to take R package e1071 as what we did in support vector cassification to hep us construct regression modes and predict. We are going to separate our data into training and testing data, however, since we our task is to buid a regression mode, we need to omit the missing data in the dataset, which is the observation with parameters show not avaiabe in the tabe. After omitting the missing data, we separate the observation into 134 observations as training data and 69 observations as testing data, since we are going to construct a onethird test, where we set one-third of the dataset as testing data and another part as training data. We are going to construct a noninear regression mode with ε-insensitive Loss Function. Aso, since it is noninear regression, we need a kerne to hep us construct the mode and here we wi go with Gaussian kerne. Starting with setting cost as 1000 and gamma in radia kerne as and using code svm in package e1071, we got the summary of our mode showing beow: Tabe 13. Summary of Base Mode for Data Ozone Parameters: SVM-Type: eps-regression SVM-Kerne: radia cost: 1000 gamma: 1e-04 Number of Support Vectors:

43 Same as support vector cassification, the number of support vector is a measurement of the performance of a certain dataset fits this regression mode. The ower the number is, the better it fits in this mode. The next step is to predict the test vaue, and the way for us to quantify the performance of fitness of our prediction is to cacuate pseudo r square. nl(m R 2 fu ) = 1 nl(m intercept ) (3.43) The equation showing above is caed Mcfadden s pseudo r square, which we are going to use in our prediction, where nl(m fu ) represent og ikeihood for mode with predictors, and nl(m intercept ) represent og ikeihood for mode withour predictors (Yeh, Huang & Lee, 2011). The og ikeihood of the intercept mode is treated as a tota sum of squares, and the og ikeihood of the fu mode is treated as the sum of squared errors. The ratio of the ikeihoods suggests the eve of improvement over the intercept mode offered by the fu mode. A ikeihood fas between 0 and 1, and a smaer ratio of og ikeihoods indicates that the fu mode is a far better fit than the intercept mode. Thus, if comparing two modes on the same data, McFadden s woud be higher for the mode with the greater ikeihood. By computing ikeihood for both fu and intercept modes from package e1071, we obtained the Mcfadden s pseudo r square equas to be Since the r square is describe to be as arge as possibe, the next step we are going to do is to change the parameters in our mode, which is simiar to what we did in Chapter 2. The ony 2 36

44 parameters which decide r square is cost and gamma in the radia kerne. For exampe, if we change cost from 1000 to 500, the summary of our mode wi become Tabe 14. Summary of Adjusted Mode 1 for Data Ozone Parameters: SVM-Type: eps-regression SVM-Kerne: radia cost: 500 gamma: 1e-04 Number of Support Vectors: 116 We can see the number of support vectors decrease, which means the new regression mode fits our dataset sighty better. Aso, by computing the r square, we got the vaue to be , which aso confirm our concusion from summary. Aso, we can change gamma, for exampe, from to 0.001, and the summary is shown beow as Tabe 15. Summary of Adjusted Mode 2 for Data Ozone Parameters: SVM-Type: eps-regression SVM-Kerne: radia cost: 1000 gamma: Number of Support Vectors:

45 And the r square equas to , which is sighty ower than the origina mode. Since the number of support vector do decrease, we can consider it as an overfitting situation. Our goa is find a mode, which can fit our dataset as accurate as possibe, so the approach we are going to take is sti cross-vaidation as we did in Chapter 2. Here, we are going to take a 5-fod cross-vaidation, where we write a for oop to achieve it in R. We coud set a sequence which aows the cost runs from 1000 to 100 for 90 times, and set gamma from from 0.01 for 100 times. The base case is showing in our origina mode where cost is 1000 and gamma is The resut after the cross-vaidation shows beow: Tabe 16. Summary of Fittest Mode for Data Ozone Parameters: SVM-Type: eps-regression SVM-Kerne: radia cost: 100 gamma: Number of Support Vectors: 102 Here, the r square is equas to From the resut, we can reaize the number of support vector decreases significanty and meanwhie the r square increase. Since the cost in this mode is the owest mode in our sequence and the number of support vector do decrease when we change cost from 1000 to 500, we may assume that the mode fits better with the decrease of cost. 38

46 In fact, we can pot the r square for this cross-vaidation with parameter cost and gamma, as seen in Figure 3.4. Figure 3.4 R square pot for 5-fod cross vaidation We can see from this pot that the highest R square appears when cost and gamma are both reativey sma. Since we mentioned before the mode coud be too sensitive for the dataset when we set the cost undersize and the overfitting situation may aso appears, we can consider that this regression mode with cost 100 and gamma 1 an reativey idea resut. REFERENCE James, G., Witten, D., Hastie, T. & Tibshirani, R. (2009). An Introduction to Statistica Learning (2nd ed.). New York: Springer. Smoa, A.J. & Schokopf, B. (2003). A Tutoria on Support Vector Regression. Statistics and Computing, 14, Yeh, C. Y., Huang, C. W. & Lee, S. J. (2011). A Mutipe-kerne support vector regression approach for stock market price forecasting. Expert System with Appication, 38,

47 APPENDICES Appendix A. Gass Data Observation RI Na Mg A Si K Ca Ba Fe Type

48 Observation RI Na Mg A Si K Ca Ba Fe Type

49 Observation RI Na Mg A Si K Ca Ba Fe Type

50 Observation RI Na Mg A Si K Ca Ba Fe Type

SVM: Terminology 1(6) SVM: Terminology 2(6)

SVM: Terminology 1(6) SVM: Terminology 2(6) Andrew Kusiak Inteigent Systems Laboratory 39 Seamans Center he University of Iowa Iowa City, IA 54-57 SVM he maxima margin cassifier is simiar to the perceptron: It aso assumes that the data points are