Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

Size: px

Start display at page:

Download "Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall"

Amos Phillips
5 years ago
Views:

1 Stats Classificatin Ji Zhu, Michigan Statistics 1 Classificatin Ji Zhu 445C West Hall jizhu@umich.edu

Stats 415 - Classificatin Ji Zhu, Michigan Statistics 2 Examples f Classificatin

Classifying credit card transactins as legitimate r fraudulent.

2 Stats Classificatin Ji Zhu, Michigan Statistics 2 Examples f Classificatin Predicting tumr cells as benign r malignant. Classifying credit card transactins as legitimate r fraudulent. Classifying secndary structures f prtein as alpha-helix, beta-sheet, r randm cil. Categrizing news stries as finance, weather, entertainment, sprts, etc.

3 Stats Classificatin Ji Zhu, Michigan Statistics 3 Classificatin: Definitin Given a cllectin f data pints Each data pint cntains a set f variables, ne f the variables is the class (categrical, qualitative). Find a mdel fr the class variable as a functin f the values f ther variables. Gal: previusly unseen data pints shuld be assigned a class as accurately as pssible. Usually, the given data set is divided int training and test sets, with training set used t build the mdel and test set used t validate it.

4 Stats Classificatin Ji Zhu, Michigan Statistics 4 Illustrating Classificatin

5 Stats Classificatin Ji Zhu, Michigan Statistics 5 Mathematical Setup Class label Y Input variables X = (X 1, X 2,..., X p ) Y takes values in a finite, unrdered set (survived/died, cancer class f tissue sample...). Tw-class: Y {c 1, c 2 } Multi-class: Y {c 1, c 2,..., c K } We have training data, which are bservatins (examples, instances) f these measurements.

6 Stats Classificatin Ji Zhu, Michigan Statistics 6 Objectives On the basis f the training data we wuld like t: Prduce a classifier Ĉ(x) that accurately predicts unseen test cases. Understand which inputs affect the utput, and hw.

7 Stats Classificatin Ji Zhu, Michigan Statistics 7 Optimal Classifier (X, Y) have a jint prbability distributin. Chse Ĉ(x) t have small misclassificatin errr: Bayes ptimal classifier: R(Ĉ) = Pr(Ĉ(X) = Y) C (x) = arg min R(C) C = arg max Pr(Y = c k X = x) k

8 Stats Classificatin Ji Zhu, Michigan Statistics 8 Generative Methds Estimate f (x Y = c k ), then use the Bayes rule Pr(Y = c k X = x) f (x Y = c k ) Pr(Y = c k ) Linear discriminant analysis (LDA); Quadratic discriminant analysis (QDA); Naive Bayes

9 Stats Classificatin Ji Zhu, Michigan Statistics 9 X 2 X 1

10 Stats Classificatin Ji Zhu, Michigan Statistics 10 Discriminative Methds Estimate Pr(Y = c k X = x) directly Lgistic regressin K-nearest neighbr (KNN) Supprt vectr machines (SVM) Classificatin tree (CART) Ensemble methds: bsting, randm frest

11 Stats Classificatin Ji Zhu, Michigan Statistics 11 Linear Discriminant Analysis Let π k be the prir prbability f class k. Let f k (x) be the class-cnditinal density f X in class k. The psterir prbability Pr(Y = c k X = x) = f k (x)π k K k=1 f k(x)π k

12 Stats Classificatin Ji Zhu, Michigan Statistics 12 We mdel each class density as multivariate Gaussian 1 N(µ k, Σ k ) : (2π) p/2 Σ k 1/2 e 1 2 (x µ k) T Σ 1 k (x µ k ). Assume Σ k = Σ fr all k. Fr each k, the discriminant functin is Decisin rule δ k (x) = x T Σ 1 µ k 1 2 µt k Σ 1 µ k + lg π k Ĉ(x) = arg max k δ k (x)

13 Stats Classificatin Ji Zhu, Michigan Statistics 13 Remarks Classify x t the class with the clsest centrid t x, using the squared Mahalanbis distance. Special case: Σ = I (then Euclidean distance is used) δ k (x) = 1 2 x µ k 2 + lg π k

14 Stats Classificatin Ji Zhu, Michigan Statistics 14 Cmparing class k and class k, the lg-rati is lg Pr(Y = c k X = x) Pr(Y = c k X = x) = x T Σ 1 (µ k µ k ) + lg π k π k 1 2 (µ k + µ k ) T Σ 1 (µ k µ k ) Linear decisin bundary, with the directinal vectr Σ 1 (µ k µ k ); generally nt in the directin f (µ k µ k ).

15 Stats Classificatin Ji Zhu, Michigan Statistics 15

16 Stats Classificatin Ji Zhu, Michigan Statistics 16

17 Stats Classificatin Ji Zhu, Michigan Statistics 17 Parameter Estimatin f LDA In practice, we estimate the parameters frm the training data. ˆπ k = n k /n, where n k is the number f bservatins in class k. ˆµ k = yi =c k x i /n k The pled cvariance ˆΣ = K k=1 y i =c k (x i ˆµ k )(x i ˆµ k ) T /(n K)

18 Stats Classificatin Ji Zhu, Michigan Statistics 18 Quadratic Discriminant Analysis Σ k s are allwed t be different. Discriminant functin δ k (x) = 1 2 lg Σ k 1 2 (x µ k) T Σ 1 k (x µ k ) + lg π k = x T W k x + x T w k + b k The decisin bundary between class k and class k is a quadratic functin {x : x T (W k W k )x + x T (w k w k ) + (b k b k ) = 0} Mre parameters in QDA than in LDA, especially when p is large.

Stats 415 - Classificatin Ji Zhu, Michigan Statistics 19

19 Stats Classificatin Ji Zhu, Michigan Statistics 19 Bth LDA and QDA perfrm well n many real classificatin prblems.

20 Stats Classificatin Ji Zhu, Michigan Statistics 20 Naive Bayes Classifier Assume independence amng input variables when class is given f (x 1,..., x p Y = c k ) = f (x 1 c k ) f (x 2 c k )... f (x p c k ) Estimate f (x j c k ) fr all j and c k. New pint is classified t c k if p j=1 f (xj c k ) π k is maximal. Strng assumptin, but ften classifies well when p is large.

21 Stats Classificatin Ji Zhu, Michigan Statistics 21 Generative methds Discriminative methds

22 Stats Classificatin Ji Zhu, Michigan Statistics 22 Discriminative Methds Lgistic regressin K-nearest neighbr (KNN) Supprt vectr machines (SVM) Classificatin tree (CART) Ensemble methds: bsting, randm frest

23 Stats Classificatin Ji Zhu, Michigan Statistics 23 Lgistic Regressin Tw-class case: Y {c 1, c 2 } (K = 2) Use the lgit transfrmatin Prbabilities lg Pr(Y = c 1 X = x) Pr(Y = c 2 X = x) = β 0 + β T x Pr(Y = c 1 X = x) = Pr(Y = c 2 X = x) = e β 0+β T x 1 + e β 0+β T x e β 0+β T x Ensure the prbabilities t be in [0, 1].

24 Stats Classificatin Ji Zhu, Michigan Statistics 24 Fitting the Lgistic Regressin Mdel Maximum likelihd estimatin Dente θ = (β 0, β) Let x (1, x) Cnditinal lg-likelihd f Y given X l(θ) = n lg Pr(Y = y i X = x i ; θ) i=1

25 Stats Classificatin Ji Zhu, Michigan Statistics 25 Cde {c 1, c 2 } as {0, 1}, then l(θ) = = n i=1 n i=1 [y i lg Pr(c 1 x i ; θ) + (1 y i ) lg Pr(c 2 x i ; θ)] [y i θ T x i lg(1 + e θt x i )]

26 Stats Classificatin Ji Zhu, Michigan Statistics 26 Partial derivative Scre Equatin l(θ) θ = n i=1 x i (y i p(x i ; θ)) = 0 where p(x i ; θ) = e θt x i /(1 + e θt x i ). There are (p + 1) equatins. Nnlinear in θ

27 Stats Classificatin Ji Zhu, Michigan Statistics 27 Newtn-Raphsn Algrithm The secnd-derivative (Hessian) matrix 2 l(θ) θ θ T = n i=1 1. Chse an initial value θ 0 2. Update θ by x i x T i p(x i; θ)[1 p(x i ; θ)] θ new = θ ld [ 2 ] 1 l(θ) l(θ) θ θ T θ

28 Stats Classificatin Ji Zhu, Michigan Statistics 28 Iteratively Reweighted Least Squares Using vectr and matrix ntatins Let W = diag[p(x i ; θ ld )(1 p(x i ; θ ld ))] l(θ) θ = X T (y p) 2 l(θ) θ θ T = = X T WX

29 Stats Classificatin Ji Zhu, Michigan Statistics 29 Newtn-Raphsn step θ new = θ ld + (X T WX) 1 X T (y p) = (X T WX) 1 X T Wz where z is the adjusted respnse z = Xθ ld + W 1 (y p) z i = x T i θ + (y i p i ) p i (1 p i )

30 Stats Classificatin Ji Zhu, Michigan Statistics 30 In each iteratin, we slve the weighted least squares prblem θ new = arg min θ (z Xθ) T W(z Xθ) θ = 0 can be used as a starting pint.

31 Stats Classificatin Ji Zhu, Michigan Statistics 31 Inference If the mdel is crrect, ˆθ is cnsistent. Using the central limit therem, the distributin f ˆθ cnverges t N(θ, (X T WX) 1 ).

32 Stats Classificatin Ji Zhu, Michigan Statistics 32 Multi-class Case Use class K as a reference lg Pr(Y = c 1 X = x) Pr(Y = c K X = x) lg Pr(Y = c 2 X = x) Pr(Y = c K X = x) = β 10 + β T 1 x = β 20 + β T 2 x. =. lg Pr(Y = c K 1 X = x) Pr(Y = c K X = x) = β (K 1)0 + β T (K 1) x Multinmial lgistic regressin

33 Stats Classificatin Ji Zhu, Michigan Statistics 33 Lgistic Regressin vs LDA Fr LDA, the lg-psterir dds between class k and class K is linear lg Pr(Y = c k X = x) Pr(Y = c K X = x) = x T Σ 1 (µ k µ K ) + lg π k π K 1 2 (µ k + µ K ) T Σ 1 (µ k µ K ) = α k0 + α T k x Lgistic mdel has linear lgits by cnstructin lg Pr(Y = c k X = x) Pr(Y = c K X = x) = β k0 + β T k x The same frm. Are they the same estimatr?

34 Stats Classificatin Ji Zhu, Michigan Statistics 34 Where is Linearity Frm Fr LDA, the linearity is a cnsequence f the Gaussian assumptin fr the class densities and the assumptin f a cmmn cvariance matrix. Fr lgistic regressin, the linearity cmes by cnstructin. The difference lies in the way the linear cefficients are estimated.

35 Stats Classificatin Ji Zhu, Michigan Statistics 35 Cmmn Cmpnent The jint density f (X, Y) is Pr(X, Y = c k ) = Pr(X) Pr(Y = c k X) where Pr(X) is the marginal density f the input X. Fr bth LDA and lgistic regressin, the secnd term Pr(Y = c k X) has the same lgit linear frm Pr(Y = c k X = x) = exp(θ k0 + θ T k x) 1 + K k =1 exp(θ k 0 + θ T k x)

36 Stats Classificatin Ji Zhu, Michigan Statistics 36 Which Mdel is Mre General Hwever, they make different assumptins abut Pr(X). The lgistic mdel leaves the marginal density f X arbitrary and unspecified. The LDA mdel assumes a Gaussian mixture density Pr(x) = K π k φ(x; µ k, Σ) k=1 Lgistic mdel makes fewer assumptins abut the data, and is mre general.

37 Stats Classificatin Ji Zhu, Michigan Statistics 37 Parameter Estimatin Lgistic regressin Maximizing the cnditinal likelihd, the multinmial likelihd with prbabilities Pr(Y = c k X). The marginal density Pr(X) is ignred (fully nnparametric using the empirical distributin functin which places 1/n at each bservatin).

38 Stats Classificatin Ji Zhu, Michigan Statistics 38 LDA Maximizing the full likelihd based n the jint density Pr(x, Y = c k ) = φ(x; µ k, Σ) π k Marginal density des play a rle.

39 Stats Classificatin Ji Zhu, Michigan Statistics 39 Remarks LDA is easier t cmpute than lgistic regressin. If the true f k (x) s are Gaussian, LDA is better. Lgistic regressin may lse efficiency arund 30% asympttically in errr rate (Efrn 1975). Rbustness LDA uses all the data pints t estimate the cvariance matrix mre infrmatin but nt rbust against utliers. Lgistic regressin dwn-weights pints far frm decisin bundary mre rbust.

40 Stats Classificatin Ji Zhu, Michigan Statistics 40 Discriminative Methds Lgistic regressin K-nearest neighbr (KNN) Supprt vectr machines (SVM) Classificatin tree (CART) Ensemble methds: bsting, randm frest

41 Stats Classificatin Ji Zhu, Michigan Statistics 41 K-nearest Neighbr Methd

42 Stats Classificatin Ji Zhu, Michigan Statistics 42 Cde Y = 1 if Red, and Y = 1 if Green. A natural way t classify a new pint is t have a lk at its neighbrs, and take a vte: ˆf (x ) = 1 K x i N K (x ) y i where N K (x ) cntains the K clsest pints t x in the training data (K-nearest neighbrhd).

43 Stats Classificatin Ji Zhu, Michigan Statistics 43 If there is a clear dminance f ne f the classes in the neighbrhd f an bservatin x, then it is likely that the bservatin itself wuld belng t that class, t. Thus the classificatin rule is the majrity vting amng the members f N K (x ). Thus, Ĉ(x ) = Red if ˆf (x ) > 0 Green if ˆf (x ) < 0

44 Stats Classificatin Ji Zhu, Michigan Statistics 44

45 Stats Classificatin Ji Zhu, Michigan Statistics 45 Oracle Oracle: The data in each class are generated frm a mixture f Gaussians. The density fr each class was an equal mixture f 10 Gaussians. Fr the Green class, its 10 means were generated frm a N((1, 0) T, I) distributin (and cnsidered fixed). Fr the Red class, the 10 means were generated frm a N((0, 1) T, I). The within cluster variances were 1/5.

46 Stats Classificatin Ji Zhu, Michigan Statistics 46

47 Stats Classificatin Ji Zhu, Michigan Statistics 47 K-NN tries t implement cnditinal expectatins directly, by apprximating expectatins by sample averages, relaxing the ntin f cnditining at a pint, t cnditining in a regin clse t the target pint. In thery, when n, K, such that K/n 0, the K-nearest neighbr estimate ˆf (x) f (x) = E(Y X = x) (cnsistent)

48 Stats Classificatin Ji Zhu, Michigan Statistics 48 Degrees f Freedm fr K-NN Hw many parameters des K-nearest neighbrs use t describe the fit? One, the value f K? Mre realistically, K-nearest neighbrs uses n/k effective number f parameters. K cntrls the mdel cmplexity: the smaller K, the mre cmplex the mdel. In general n/k > p, thus K-NN is mre flexible than linear mdels.

49 Stats Classificatin Ji Zhu, Michigan Statistics 49

50 Stats Classificatin Ji Zhu, Michigan Statistics 50 Hw t chse the ptimal K? Can we minimize the training errr? N. When K = 1, the training errr is zer. (Overfitting) Chse K t minimize the misclassificatin errr. Generate an independent test set, using the test errr t estimate the misclassificatin errr.

51 Stats Classificatin Ji Zhu, Michigan Statistics 51

52 Stats Classificatin Ji Zhu, Michigan Statistics 52 Mdel Selectin Suppse the data arise frm a mdel Y = f (X) + ɛ, with E(ɛ) = 0 and Var(ɛ) = σɛ 2. Let Γ = {(x i, y i ), i = 1,..., n} and ŷ = K 1 K l=1 y (l). The subscript (l) indicates the sequence f nearest neighbrs t x. Then the expected predictin errr at x is EPE(x ) = E y x E Γ(y ŷ ) 2 = σɛ 2 + ( f (x ) E Γ (ŷ )) 2 + Var Γ (ŷ )

53 Stats Classificatin Ji Zhu, Michigan Statistics 53 Fr simplicity, assume x i s in the sample are fixed (nnrandm). Then E Γ (y (l) ) = f (x (l) ) Var Γ (y (l) ) = σ 2 ɛ EPE(x ) = σ 2 ɛ + ( f (x ) 1 K K l=1 f (x (l) ) ) 2 + σ2 ɛ K The first term is an irreducible errr. The secnd and third terms make up the mean squared errr (MSE) at x.

54 Stats Classificatin Ji Zhu, Michigan Statistics 54 Bias-Variance Tradeff The squared bias term tends t increase with K. Fr small K, the clsest neighbrs have values f (x (l) ) similar t f (x ). Fr large K, mre further away pints are cunted as neighbrs. The variance term decreases as the inverse f K when K increases. Bias-variance tradeff: as the mdel cmplexity increases, the variance tends t increase and the squared bias tends t decrease. We chse the mdel cmplexity t minimize the test errr.

55 Stats Classificatin Ji Zhu, Michigan Statistics 55

56 Stats Classificatin Ji Zhu, Michigan Statistics 56 Objectives: Mdel Assessment 1. Chse a value f a tuning parameter fr a technique. 2. Estimate the predictin perfrmance f a given mdel. Fr bth f these purpses, the best apprach is t run the prcedure n an independent test set, if ne is available. If pssible ne shuld use different test data fr (1) and (2) abve: a validatin set fr (1) and a test set fr (2).

57 Stats Classificatin Ji Zhu, Michigan Statistics 57 Crss-Validatin Often there is insufficient data t create a separate validatin r test set; setting sme data aside fr validatin is pssible, but affects the accuracy f training estimates In this instance, V-fld crss-validatin is useful.

58 Stats Classificatin Ji Zhu, Michigan Statistics Train Train Test Train Train 1. Divide the data int V disjint subsets. 2. Use subsets 2,..., V as training data and subset 1 as validatin data. Cmpute the PE n subset Repeat fr each subset. 4. Average the result.

59 Stats Classificatin Ji Zhu, Michigan Statistics 59 Curse f Dimensinality K-nearest neighbrs can fail in high dimensins, because it becmes difficult t gather K bservatins clse t a target pint x : near neighbrhds tend t be spatially large, and estimates are biased; reducing the spatial size f the neighbrhd means reducing K, and the variance f the estimate increases.

60 Stats Classificatin Ji Zhu, Michigan Statistics 60 Illustrating Example Suppse the pints are unifrmly distributed in a p-dimensinal unit hypercube. T cnstruct a hypercube neighbrhd f x t capture a fractin ρ f the bservatins, what is the edge length f this cube? Since the vlume f cube l p = ρ, we have l = ρ 1/p. When p = 1: If ρ = 0.01, l = 0.01 and if ρ = 0.1, l = 0.1. When p = 10: If ρ = 0.01, l = 0.63 and if ρ = 0.1, l = When p = 10, in rder t capture 10% f the data, we must cver 80% f the range f each input.

61 Stats Classificatin Ji Zhu, Michigan Statistics 61

62 Stats Classificatin Ji Zhu, Michigan Statistics 62 Lcal methds are n lnger lcal when the dimensin p increases. Sampling density is prprtinal t n 1/p ; if 100 pints are sufficient t estimate a functin in R 1, are needed t achieve similar accuracy in R 10.

63 Stats Classificatin Ji Zhu, Michigan Statistics 63 Discriminative Methds Lgistic regressin K-nearest neighbr (KNN) Supprt vectr machines (SVM) Classificatin tree (CART) Ensemble methds: bsting, randm frest

64 Stats Classificatin Ji Zhu, Michigan Statistics 64 Cnstrained Optimizatin Cnstrained ptimizatin has the frm min subject t Q(θ) θ S R d Q(θ): bjective functin S: feasible set Cnvex ptimizatin: bth bjective functin and feasible set are cnvex.

65 Stats Classificatin Ji Zhu, Michigan Statistics 65 Cnsider Lagrange Multiplier min Q(θ) subject t R(θ) = 0 S = {θ : R(θ) = 0} is a (d 1)-dimensinal surface in R d. Fr every θ such that R(θ) = 0, R(θ) is rthgnal t the surface. If θ is a lcal minimum, then Q is rthgnal t the surface at θ.

66 Stats Classificatin Ji Zhu, Michigan Statistics 66 Cnclusin: at a lcal minimum, there exists λ R such that Q(θ ) = λ R(θ ) This leads us t intrduce the Lagrangian L(θ, λ) = Q(θ) λr(θ) where λ is the Lagrange multiplier. We have argued that a lcal minimum crrespnds t a statinary pint f the Lagrangian. Furthermre, we can reverse ur lgic t deduce that a statinary pint f the Lagrangian is a lcal ptimum.

67 Stats Classificatin Ji Zhu, Michigan Statistics 67 Nw cnsider (the primal prblem) min Q(θ) subject t R(θ) 0 Suppse θ is a lcal minimum. There are tw cases: Inactive cnstraint: R(θ ) > 0 Q(θ ) = 0 statinary pint f L(θ, λ) with λ = 0 Active cnstraint: R(θ ) = 0 same as equality cnstraint except we require λ > 0.

68 Stats Classificatin Ji Zhu, Michigan Statistics 68 In either case, we have λ R(θ ) = 0. Therefre, a lcal minimum satisfies (Karush-Kuhn-Tucker cnditins) L(θ ) = Q(θ ) λ R(θ ) = 0 λr(θ ) = 0 λ 0 Often the KKT cnditins may be used t transfrm the primal prblem t an equivalent dual prblem, where the variables being ptimized are the Lagrange multipliers.

69 Stats Classificatin Ji Zhu, Michigan Statistics 69 Outline Maximum margin classifier Kernel trick SVM & functin estimatin

70 Stats Classificatin Ji Zhu, Michigan Statistics 70 Maximum Margin Classifier cements β 0 + x T β = 0 m m margin s.t. max β,β 0 m 1 β y i(β 0 + x T i β) m Maximize the minimum distance Need cnstraint β = 1 Vapnik (1995)

71 Stats Classificatin Ji Zhu, Michigan Statistics 71 Signed Distance t Hyperplanes cements Hyperplane is defined by {x : β 0 + x T β = 0}. margin x 0 x Fr any pint x 0 in the hyperplane, x0 Tβ = β 0. β β β 0 + x T β = 0 Signed distance f pint x t β the plane is β, x x 0, where x 0 is any pint in the plane.

72 Stats Classificatin Ji Zhu, Michigan Statistics 72 Equivalently Quadratic Prgramming min β 0,β subject t 1 2 β 2 y i (β 0 + x T i β) 1, i = 1,..., n The Lagrange primal is where α i 0. L p = 1 2 β 2 n α i [y i (β 0 + xi T i=1 β) 1]

73 Stats Classificatin Ji Zhu, Michigan Statistics 73 Setting the derivatives t zer, we get n β : β = α i y i x i i=1 n : 0 = β 0 α i y i i=1

74 Stats Classificatin Ji Zhu, Michigan Statistics 74 Substituting int the Lagrange primal, we btain the Lagrange dual L D = n i=1 α i 1 2 n i=1 n α i α i y i y i xi T x i i =1 We maximize L D subject t α i 0 and n α i y i = 0 i=1

75 Stats Classificatin Ji Zhu, Michigan Statistics 75 Minimize L P with respect t primal variables β 0, β Maximize L D with respect t dual variables α i Maximizing the dual is ften a simpler cnvex QP than the primal.

76 Stats Classificatin Ji Zhu, Michigan Statistics 76 Supprt Vectrs The Karush-Kuhn-Tucker cnditins include [ ˆα i yi ( ˆβ 0 + xi T ˆβ) 1 ] = 0 These imply If y i ˆf (xi ) > 1, then ˆα i = 0. If ˆα i > 0, then y i ˆf (xi ) = 1, r in ther wrds, x i is n the bundary f the slab. The slutin ˆβ is defined in terms f a linear cmbinatin f the supprt pints.

77 Stats Classificatin Ji Zhu, Michigan Statistics 77 cements Overlapping Classes β 0 + x T β = 0 ξ ξ 3 1 ξ 2 ξ i = mξ i ξ 4 ξ 4 ξ 5 m m margin max β,β 0, β =1 m s.t. y i (β 0 + x T i β) m(1 ξ i) ξ i 0, i ξ i B ξ i : slack variables B: tuning parameter

78 Stats Classificatin Ji Zhu, Michigan Statistics 78 Equivalently Quadratic Prgramming n 1 min β 0,β,ξ i 2 β 2 + C ξ i i=1 subject t y i (β 0 + xi T β) 1 ξ i, ξ i 0 The Lagrange primal is L P = 1 2 β 2 + C where α i, γ i 0. n ξ i i=1 n α i [y i (β 0 + xi T β) (1 ξ i)] i=1 n γ i ξ i i=1

79 Stats Classificatin Ji Zhu, Michigan Statistics 79 Setting the derivatives t zer, we get n β : β = α i y i x i i=1 n : 0 = β 0 α i y i i=1 : α ξ i = C γ i i

80 Stats Classificatin Ji Zhu, Michigan Statistics 80 Substituting int the Lagrange primal, we btain the Lagrange dual L D = n i=1 α i 1 2 n i=1 n α i α i y i y i x i, x i i =1 We maximize L D subject t 0 α i C and n α i y i = 0 i=1

81 Stats Classificatin Ji Zhu, Michigan Statistics 81 Supprt Vectrs The Karush-Kuhn-Tucker cnditins include [ ˆα i yi ( ˆβ 0 + xi T ˆβ) (1 ξ i ) ] = 0 γ i ξ i = 0 These imply y i ˆf (xi ) > 1 ˆα i = 0 y i ˆf (xi ) < 1 ˆα i = C y i ˆf (xi ) = 1 0 ˆα i C

82 Stats Classificatin Ji Zhu, Michigan Statistics 82 Slutin The slutin is expressed in terms f fitted Lagrange multipliers ˆα i : ˆβ = n i=1 ˆα i y i x i Sme fractin f ˆα i are exactly zer (frm KKT cnditins); the x i fr which ˆα i = 0 are called supprt pints S. ˆf (x) = ˆβ 0 + x T ˆβ = ˆβ 0 + i S ˆα i y i x, x i

83 Stats Classificatin Ji Zhu, Michigan Statistics 83 Example Bayes Optimal Classifier Mixture f Gaussian. Red class: 10 centers µ k frm N(( 1, 1) T, I); then randmly pick ne center, and generate a data pint frm N(µ k, I/5). Green class is similar, with N((1, 1) T, I). Bayes errr: 0.21.

84 Stats Classificatin Ji Zhu, Michigan Statistics 84 Linear SVMs Training Errr: Test Errr: Bayes Errr: lacements C = C = Training Errr: 0.26 Test Errr: 0.30 Bayes Errr: 0.21 PSfrag replacements C = C = 0.01 Resulting classifier is sign( ˆβ 0 + x T ˆβ).

85 Stats Classificatin Ji Zhu, Michigan Statistics 85 Outline Maximum margin classifier Kernel trick SVM & functin estimatin

86 Stats Classificatin Ji Zhu, Michigan Statistics 86 Flexible Classifiers Enlarge the input space via basis expansin (p q): h(x) = ( h 1 (x), h 2 (x),..., h q (x) ) Lagrange dual and slutin becme L D = n i=1 α i 1 2 n i=1 n α i α i y i y i h(x i ), h(x i ) i =1 and ˆf (x) = ˆβ 0 + i S ˆα i y i h(x), h(x i )

87 Stats Classificatin Ji Zhu, Michigan Statistics 87 Example 2nd degree plynmial in R 2. We chse: h 1 (x) = 1 h 2 (x) = 2x 1 h 3 (x) = 2x 2 h 4 (x) = x1 2 h 5 (x) = x2 2 h 6 (x) = 2x 1 x 2

88 Stats Classificatin Ji Zhu, Michigan Statistics 88 Kernels L D and cnstraints invlve h(x) nly thrugh inner-prducts K(x, x ) = h(x), h(x ) Given a suitable kernel functin K(x, x ), dn t need h(x) at all. ˆf (x) = ˆβ 0 + i S ˆα i y i K(x, x i )

89 Stats Classificatin Ji Zhu, Michigan Statistics 89 Example Cntd If we chse K(x, x ) = (1 + x, x ) 2 then K(x, x ) = (1 + x 1 x 1 + x 2x 2 )2 = 1 + 2x 1 x 1 + 2x 2x 2 + (x 1x 1 )2 +(x 2 x 2 )2 + 2x 1 x 1 x 2x 2 = h(x), h(x )

90 Stats Classificatin Ji Zhu, Michigan Statistics 90 Ppular Kernels dth degree plynmial: K(x, x ) = (1 + x, x ) d radial basis: K(x, x ) = exp( x x 2 /σ 2 ) K(x, x ) is a symmetric, psitive (semi-) definite functin: Fr every n = 1, 2,..., and every set f real numbers {a 1, a 2,..., a n } and x 1, x 2,..., x n, we have i,i n =1 a ia i K(x i, x i ) 0.

91 Stats Classificatin Ji Zhu, Michigan Statistics 91 Nnlinear SVMs SVM - Degree-4 Plynmial in Feature Space Training Errr: Test Errr: Bayes Errr: SVM - Radial Kernel in Feature Space Training Errr: Test Errr: Bayes Errr: 0.210

92 Stats Classificatin Ji Zhu, Michigan Statistics 92 Outline Maximum margin classifier Kernel trick SVM & functin estimatin

93 Stats Classificatin Ji Zhu, Michigan Statistics 93 SVM via Lss + Penalty Lss cements Binmial Lg-likelihd Supprt Vectr y f (x) With f (x) = β 0 + x T β, cnsider min β 0,β n [1 y i f (x i )] + + λ 2 β 2 i=1 Slutin identical t SVM slutin, with λ = 1/C.

94 Stats Classificatin Ji Zhu, Michigan Statistics 94 SVM and Functin Estimatin SVM with general kernel K(, ) minimizes: n [1 y i f (x i )] + + λ 2 f 2 H K i=1 with f H K. H K is the reprducing kernel Hilbert space (RKHS) f functins generated by the kernel K(, ).

95 Stats Classificatin Ji Zhu, Michigan Statistics 95 RKHS Functin space H K generated by a psitive (semi-) definite functin K(x, x ). Eigen expansin (Mercer s therem) K(x, x ) = γ j φ j (x)φ j (x ) j=1 where γ j 0, γ 2 j < j=1

96 Stats Classificatin Ji Zhu, Michigan Statistics 96 Define H K t be the set f functins f the frm f (x) = θ j φ j (x) j=1 and define the inner prduct θ j φ j (x), δ j φ j (x) j=1 j =1 H K def = j=1 θ j δ j γ j Then the squared nrm f f is f (x) 2 H K = θ 2 j /γ j j=1 which is generally viewed as a rughness penalty.

97 Stats Classificatin Ji Zhu, Michigan Statistics 97 The Representer Therem Mre generally we can ptimize min f H K [ n i=1 L(y i, f (x i )) + λ 2 f 2 H K ] The slutin has the finite frm (Wahba 1990) ˆf (x) = n i=1 ˆα i K(x, x i ) a finite expansin in the representers K(x, x i ).

98 Stats Classificatin Ji Zhu, Michigan Statistics 98 Lss Functins SVM: L[y, f (x)] = (1 y f (x)) + Called hinge lss Estimates the classifier (threshld) sign (Pr(Y = 1 x) Pr(Y = 1 x))

99 Stats Classificatin Ji Zhu, Michigan Statistics 99 Binmial Deviance: L[y, f (x)] = lg (1 + e y f (x)) (Negative) binmial lg-likelihd Estimates the lgit lg Pr(Y = 1 x) Pr(Y = 1 x) Why nt the squared errr lss?

100 Stats Classificatin Ji Zhu, Michigan Statistics 100 Kernel Lgistic Regressin Replace (1 y f ) + with ln(1 + e y f ), the binmial deviance. Similar classificatin perfrmance as the SVM. Prvide estimates f class prbabilities. Natural generalizatin t the multi-class case.

101 Stats Classificatin Ji Zhu, Michigan Statistics 101 KLR vs SVM LR - Radial Kernel in Feature Space Training Errr: Test Errr: Bayes Errr: SVM - Radial Kernel in Feature Space Training Errr: Test Errr: Bayes Errr: 0.210

102 Stats Classificatin Ji Zhu, Michigan Statistics 102 Remark SVM can be viewed as regularized fitting with a particular lss functin: hinge lss. The hinge lss allws fr cmpressin in terms f basis functins, frm n t sme fractin f n. Regularized lgistic regressin gives very similar fit, using binmial deviance as the lss.

103 Stats Classificatin Ji Zhu, Michigan Statistics 103 Discriminative Methds Lgistic regressin K-nearest neighbr (KNN) Supprt vectr machines (SVM) Classificatin tree (CART) Ensemble methds: bsting, randm frest

104 Stats Classificatin Ji Zhu, Michigan Statistics 104 Example f a Classificatin Tree

105 Stats Classificatin Ji Zhu, Michigan Statistics 105 Classify Test Data

106 Stats Classificatin Ji Zhu, Michigan Statistics 106 Classify Test Data

107 Stats Classificatin Ji Zhu, Michigan Statistics 107 Classify Test Data

108 Stats Classificatin Ji Zhu, Michigan Statistics 108 Classify Test Data

109 Stats Classificatin Ji Zhu, Michigan Statistics 109 Classify Test Data

Pattern Recognition 2014 Support Vector Machines

Pattern Recognition 2014 Support Vector Machines Pattern Recgnitin 2014 Supprt Vectr Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 1 / 55 Overview 1 Separable Case 2 Kernel Functins 3 Allwing Errrs (Sft