Smoothing Spline ANOVA Models III. The Multicategory Support Vector Machine and the Polychotomous Penalized Likelihood Estimate

Size: px

Start display at page:

Download "Smoothing Spline ANOVA Models III. The Multicategory Support Vector Machine and the Polychotomous Penalized Likelihood Estimate"

Angel Scott
5 years ago
Views:

1 Smoothing Spline ANOVA Models III. The Multicategory Support Vector Machine and the Polychotomous Penalized Likelihood Estimate Grace Wahba Based on work of Yoonkyung Lee, joint works with Yoonkyung Lee,Yi Lin and Steven A. Ackerman (MSVM) and work of Xiwu Lin (Polychotomous Penalized Likelihood) wahba yklee yilin xiwu 2 lin@spbhrd.com (GlaxoSmithKline) [references up on authors websites] Joint Statistical Meetings San Francisco, CA August, 2003

2 Abstract We describe two modern methods for statistical model building and classification, penalized likelihood methods and and support vector machines (SVM s). Both are obtained as solutions to optimization problems in reproducing kernel Hilbert spaces (RKHS). A training set is given, and an algorithm for classifiying future observations is built from it. The (k-category) multichotomous penalized likelihood method returns a vector of probabilities (p (t), p k (t)) where t is the attribute vector of the object to be classified. The multicategory support vector vachine returns a classifier vector (f (t), f k (t)) satisfying l f l (t) = 0, where max l f l (t) identifies the category. The two category SVM s are very well known, while the multi-category SVM (MSVM) described here, which includes modifications for unequal misclassification costs and unrepresentative training sets, is new. 2

3 We describe applications of each method: For penalized likelihood, estimating the 0-year probability of death due to several causes, as a function of several risk factors observed in a demographic study, and for MSVM s, classifying radiance profiles from the MODIS instrument according to clear, water cloud or ice cloud. Some computational and tuning issues are noted. 3

4 OUTLINE 0. Review of the regularization problem in RKHS.. Optimal classification and the Neyman-Pearson Lemma. 2. Penalized likelihood estimation, two classes. 3. The Support Vector Machine, two classes. 4. Penalized likelihood and the SVM compared. 5. Multichotomous penalized likelihood estimates. 6. Multicategory support vector machines (MSVMs) 7. Tuning the estimates. 8. Application to cloud classification from MODIS data. 9. Closing remarks. 4

5 References V. Vapnik et al.on SVM s, [Lee02] Y. Lee Multicategory Support Vector Machines, Theory, and Application to the Classification of Microarray Data and Satellite Radiance Data PhD. thesis, also UW-Madison TR 063. [LeeLinWahba02] Y. Lee, Y. Lin and G. Wahba. Multicategory Support Vector Machines, Theory, and Application to the Classification of Microarray Data and Satellite Radiance Data, TR 064, subm. [YLin02] Y. Lin. Support vector machines and the Bayes rule in classification. Data Mining and Knowledge Discovery, 6: , [LWA03] Y. Lee, G. Wahba and S. Ackerman. Classification of Satellite Radiance Data by Multicategory Support Vector Machines, TR 075, subm. 5

6 [XLin98] X. Lin. Smoothing Spline Analysis of Variance for Polychotomous Response Data. PhD thesis, Department of Statistics, Uiversity of Wisconsin, Madison WI, 998. Also TR 003, available via G. Wahba home page. [Wahba02]G. Wahba. Soft and hard classification by reproducing kernel Hilbert space methods. Proc. NAS 2002, 99, [Wahba99] G. Wahba. Support Vector Machines, Reproducing Kernel Hilbert Spaces and the Randomized GACV. In Advances in Kernel Methods - Support Vector Learning, Schölkopf, Burges and Smola (eds.), MIT Press 999, [XiangWahba96]D. Xiang, D. and G. Wahba. A Generalized Approximate Cross Validation for Smoothing Splines with Non-Gaussian Data, Statistica Sinica, 6, 996, pp

7 0. Regularization Problems in RKHS from Lecture To set notation: The canonical regularization problem in RKHS we discussed in the first lecture was: Given {y i, t i }, y i S, t i T, and {φ,, φ M }, M special functions defined on T, find f of the form f = M ν= with h H K to minimize I λ {f, y} = n n i= d ν φ ν + h C(y i, f(t i )) + λ h 2 H K. Today y i will just be a class label, y i S = {, 2,, k} (k classes) 7

8 . Optimal Classification and the Neyman-Pearson Lemma: h A h B h A ( ), h B ( ) densities of t for class A and class B. NOTATION: π A = prob. next observation (Y ) is an A π B = π A = prob. next observation is a B p(t) = prob{y = A t} = π A h A (t) π A h A (t) + π B h B (t) 8

9 .Optimal Classification and the Neyman-Pearson Lemma (cont.). Let c A = cost to falsely call a B an A c B = cost to falsely call an A a B Bayes classification rule: Let φ(t) : t { A B } Optimum (Bayes) classifier: (Neyman-Pearson Lemma) Minimizes the expected cost: φ OPT (t) = A B if p(t) p(t) > c A otherwise. cb, 9

10 2. Penalized likelihood estimation, two classes. Let f(t) = log p(t)/( p(t)), the log odds ratio. Assume (for simplicity only) c A cb = Then the optimal classifier is f(t) > 0 (equivalently, p(t) 2 > 0) A f(t) < 0 (equivalently, p(t) 2 < 0) B To estimate f: Assume (again for simplicity only) that the relative frequency of A s in the training set is about the same as in the general population: 0

11 2. Penalized likelihood estimation,two classes(cont.). Code the data as: y i = 0 = A = B (important) The probability distribution function (likelihood) for y p: L(y, p) = p y ( p) y p if y = = ( p) if y = 0 and the negative log likelihood is log L = log[p y ( p) y ] = y log p ( y) log( p). Substituting p = e f /( + e f ) gives log L(y, f) = yf + log( + e f ). The penalized log likelihood estimate of f is obtained by setting C(y i, f(t i )) = y i f(t i ) + log( + e f(t i) ) in the optimization problem I λ (f, y).

12 3. The Support Vector Machine, two classes. y = + = A = B (note different coding) Find f(t) = d + h(t) with h H K to min n n i= ( y i f(t i )) + + λ h 2 H K ( ) where (τ) + = τ, τ > 0, = 0 otherwise. Then f λ (t) = d + n i= c i K(t, t i ). ( ) Substitute (*) into (**), choose λ, given λ, find c and d. The classifier is f λ (t) > 0 A f λ (t) < 0 B 2

13 4. Penalized likelihood estimation and the SVM compared: Let us relabel y in the likelihood ỹ i = Then + if A, if B. yf + log( + e f ) log( + e ỹf ) Figure compares log( + e yf ), ( yf) + and ( yf) where (τ) = if τ > 0, 0 otherwise. ( yf) is the misclassification counter. ( yf) + is known as the hinge function. 3

14 ( τ) * ( τ) + log 2 ( + e τ ) τ Figure. Let C(y i, f(t i )) = c(y i f(t i )) = c(τ). Comparison of c(τ) = ( τ), ( τ) + and log 2 (+ e τ ). Any strictly convex function that goes through at τ = 0 will be an upper bound on the missclassification counter ( τ ) and will be a looser bound than some SVM (hinge) function ( θτ) +. Many other large margin classifiers. (See [Wahba02]). 4

15 4. Penalized likelihood and the SVM compared(cont.). The penalized log likelihood estimate is tuned by a criteria which chooses λ to minimize a proxy for R(λ) = E n n i= y new i f λ (t i ) + log( + e f λ(t i ) ). [XiangWahba96]. R(λ) is the expected distance or negative log likelihood for a new observation with the same t i. f λopt estimates the log odds ratio log[p/( p)]. We say the SVM classifier is optimally tuned if we have a criteria which chooses λ to minimize a proxy for R(λ) = E n n i= ( y new i f λ (x i )) +. That is, it is choosing λ to minimize a proxy for an an upper bound on the misclassification rate [LeeLin- Wahba02][Wahba99]. 5

16 4. Penalized likelihood and the SVM compared(cont.). What is the SVM estimating? Lemma (Yi Lin [YLin02]) The minimizer of E( y new f(t)) + is sign f(t) (= sign (p(t) 2 ) = sign (2p(t) )) where f(t) = log p(t)/( p(t)). So the SVM, the solution of the problem: Find f λ = d + h which minimizes n n i= ( y i f(t i )) + + λ h 2 H K, where λ is chosen to minimize (a proxy for) R(λ), is estimating sign f(t) - not f(t) itself, but just what you need to minimize the misclassification rate. 6

17 4. Penalized likelihood and the SVM compared(cont.) truth penalized likelihood SVM t 300 Bernoulli random variables were generated, equally spaced t from p(t) = 0.4sin(0.4πt)+0.5 Solid line: (2p(t) ). Dotted line:(2p λ ), p λ is (optimally tuned) penalized likelihood estimate of p. Dashed line: f svm λ, is (optimally tuned) SVM. Observe f svm λ ±, thus p λ is estimating p(t), whereas f svm λ is estimating sign(2p ) = sign(p /2)= sign f. (based on Gaussian K) (plot: Yoonkyung Lee) 7

18 5. Multichotomous penalized likelihood[xlin98]. k + categories, k >. Let p j (t) be the probability that a subject with attribute vector t is in category j, kj=0 p j (t) =. From [XLin98]: Let Then: f j (t) = log p j (t)/p 0 (t), j =,, k. Coding: p j (t) = p 0 (t) = e fj (t) + k j= e fj (t), j =,, k + k j= e fj (t) y i = (y i,, y ik ), y ij = if the ith subject is in category j and 0 otherwise. 8

19 5. Multichotomous penalized likelihood (cont.). Letting f = (f,, f k ) the negative log likelihood can be written as logl(y, f) = where n i= { k j= y ij f j (t i ) + log( k j= + e f j (t i ) )}. f j = M ν j = d νj φ ν + h j. λ h 2 H K becomes k j= λ j h j 2 H K, and the optimization problem becomes: Minimize I λ (y, f) = logl(y, f) + k j= λ j h j 2 H K. 9

20 5. Multichotomous penalized likelihood (cont.). 0 year risk of mortality as a function of t = (x, x 2, x 3 ) = age, glycosylated hemoglobin, and systolic blood pressure[from XLin98]. prob age x 2 and x 3 set at their medians. The differences between adjacent curves (from bottom to top) are probabilities p j (t) for : 0:alive, : diabetes, 2: heart attack, 3: other causes. f j (x, x 2, x 3 ) = µ j + f j (x ) + f j 2 (x 2) + f j 3 (x 3) + f j 23 (x 2, x 3 ) (Smoothing Spline ANOVA model.) 20

21 6. Multicategory support vector machines (MSVMs). From [LeeLinWahba02],[LWA03], earlier reports. k > 2 categories. Coding: y i = (y i,, y ik ), k j= y ij = 0, in particular y ij = if the ith subject is in category j and y ij = k otherwise. y i = (, k,, indicates y i is from category. The MSVM produces f(t) = (f (t), f k (t)), with each f j = d j + h j with h j H K, required to satisfy a sum-to-zero constraint k j= f j (t) = 0, for all t in T. The largest component of f indicates the classification. k ) 2

22 6. Multicategory support vector machines (MSVMs)(cont.). Let L jr = for j r and 0 otherwise. The MSVM is defined as the vector of functions f λ = (f λ,, f k λ ), with each h k in H K satisfying the sum-to-zero constraint, which minimizes n n k i= r= equivalently n n L cat(i)r (f r (t i ) y ir ) + + λ i= r cat(i) (f r (t i ) + k ) + + λ where cat(i) is the category of y i. k j= k j= h j 2 H K h j 2 H K The k = 2 case reduces to the usual 2-category SVM. The target for the MSVM is f(t) = (f (t),, f k (t)) with f j (t) = if p j (t) is bigger than the other p l (t) and f j (t) = k otherwise. 22

23 Multicategory support vector machines (MSVMs)(cont.). p(x) p2(x) p3(x) x x x x Above: Probabilities and target f j s for three category SVM demonstration.(gaussian Kernel) x x x x The left panel above gives the estimated f, f 2 and f 3. λ and σ were optimally tuned. (i. e. with the knowledge of the right answer). In the second from left panel both λ and σ were chosen by 5-fold cross validation in the MSVM and in the third panel they were chosen by GACV. In the rightmost panel the classification is carried out by a one-vs-rest method. 23

24 6. Multicategory support vector machines(msvms)(cont.). The nonstandard MSVM: More generally, suppose the sample is not representative, and misclassification costs are not equal. Let L jr = (π j /π s j )C jr, j r C jr is the cost of misclassifying a j as an r, C rr = 0, π j is the prior probability of category j, and π s j is the fraction of samples from category j in the training set. Then the nonstandard MSVM has as its target the Bayes rule, which is to choose the j which minimizes k l= C lj p l (x) 24

25 7. Tuning the estimates. GACV (generalized approximate cross validation). Penalized likelihood: [XiangWahba96][XLin98]; SVM[Wahba99], MSVM[Lee02][LeeLinWahba02]. Leaving out one: V O (λ) = n n i= C(y i, f [i] λ (t i)) where f [i] λ is the estimate without the ith data point. where GACV (λ) = n D(y, f λ ) n n i= n i= C(y i, f(t i )) + D(y, f λ ) { C(y i, f [i] } λ (t i)) C(y i, f λ (t i )) is obtained by a tailored perturbation argument. Easy to compute for the SVM, use randomized trace techniques to estimate the perturbation in the likelihood case. 25

26 8. The classification of upwelling MODIS radiance data to clear sky, water clouds or ice clouds. From [LWA03].Classification of 2 channels of upwelling radiance data from the satellite- borne MODIS instrument. MODIS is a key part of the Earth Observing System (EOS). Classify each vertical profile as coming from clear sky, water clouds, or ice clouds. Next page: 744 simulated radiance profiles (8 clearblue, 202 water clouds-green, 46 ice clouds-purple). 0 samples from clear, from water and from ice: 26

27 clear liquid cloud ice cloud Reflectance wavelength in microns clear liquid cloud ice cloud Brightness Temperature wavelength in microns

28 BT channel 32 BT channel BT channel 3 R channel /R channel R channel 2 log 0 (R channel 5 /R channel 6 ) ice clouds water clouds clear R channel 2 Pairwise plots of three different variables (including composite variables.(purple = ice clouds, green = water clouds, blue = clear) 27

29 log 0 (R channel 5 /R channel 6 ) R channel 2 Classification boundaries on the 374 test set determined by the MSVM using 370 training examples, two variables, one is composite. Y. K. Lee Student poster prize AMet- Soc Satellite Meteorology and Oceanography session. 28

30 log 0 (R channel 5 /R channel 6 ) R channel 2 Classification boundaries determined by the nonstandard MSVM when the cost of misclassifying clear clouds is 4 times higher than other types of misclassifications. 29

31 9. Closing Remarks We have examined two class and multi-class penalized likelihood estimates and Support Vector Machines, both of which can be obtained as optimization problems in an RKHS. Non-representative samples and unequal misclassification costs can be handled. Newton- Raphson and Mathematical Programming are used to solve the optimization problems. Downhill simplex works well for searching multiple λ s. Convergence theory of various penalized likelihood models have been around a long time, convergence theory for the SVM (to sign f) and MSVM (to its target) is younger. Penalized likelihood estimates and SVM s are just two of the many examples of optimization problems related to regularization in RKHS, which have many useful scientific applications. 30

Statistical Properties and Adaptive Tuning of Support Vector Machines

Machine Learning, 48, 115 136, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Statistical Properties and Adaptive Tuning of Support Vector Machines YI LIN yilin@stat.wisc.edu