Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall

Size: px
Start display at page:

Download "Stats Classification Ji Zhu, Michigan Statistics 1. Classification. Ji Zhu 445C West Hall"

Transcription

1 Stats Classificatin Ji Zhu, Michigan Statistics 1 Classificatin Ji Zhu 445C West Hall jizhu@umich.edu

2 Stats Classificatin Ji Zhu, Michigan Statistics 2 Examples f Classificatin Predicting tumr cells as benign r malignant. Classifying credit card transactins as legitimate r fraudulent. Classifying secndary structures f prtein as alpha-helix, beta-sheet, r randm cil. Categrizing news stries as finance, weather, entertainment, sprts, etc.

3 Stats Classificatin Ji Zhu, Michigan Statistics 3 Classificatin: Definitin Given a cllectin f data pints Each data pint cntains a set f variables, ne f the variables is the class (categrical, qualitative). Find a mdel fr the class variable as a functin f the values f ther variables. Gal: previusly unseen data pints shuld be assigned a class as accurately as pssible. Usually, the given data set is divided int training and test sets, with training set used t build the mdel and test set used t validate it.

4 Stats Classificatin Ji Zhu, Michigan Statistics 4 Illustrating Classificatin

5 Stats Classificatin Ji Zhu, Michigan Statistics 5 Mathematical Setup Class label Y Input variables X = (X 1, X 2,..., X p ) Y takes values in a finite, unrdered set (survived/died, cancer class f tissue sample...). Tw-class: Y {c 1, c 2 } Multi-class: Y {c 1, c 2,..., c K } We have training data, which are bservatins (examples, instances) f these measurements.

6 Stats Classificatin Ji Zhu, Michigan Statistics 6 Objectives On the basis f the training data we wuld like t: Prduce a classifier Ĉ(x) that accurately predicts unseen test cases. Understand which inputs affect the utput, and hw.

7 Stats Classificatin Ji Zhu, Michigan Statistics 7 Optimal Classifier (X, Y) have a jint prbability distributin. Chse Ĉ(x) t have small misclassificatin errr: Bayes ptimal classifier: R(Ĉ) = Pr(Ĉ(X) = Y) C (x) = arg min R(C) C = arg max Pr(Y = c k X = x) k

8 Stats Classificatin Ji Zhu, Michigan Statistics 8 Generative Methds Estimate f (x Y = c k ), then use the Bayes rule Pr(Y = c k X = x) f (x Y = c k ) Pr(Y = c k ) Linear discriminant analysis (LDA); Quadratic discriminant analysis (QDA); Naive Bayes

9 Stats Classificatin Ji Zhu, Michigan Statistics 9 X 2 X 1

10 Stats Classificatin Ji Zhu, Michigan Statistics 10 Discriminative Methds Estimate Pr(Y = c k X = x) directly Lgistic regressin K-nearest neighbr (KNN) Supprt vectr machines (SVM) Classificatin tree (CART) Ensemble methds: bsting, randm frest

11 Stats Classificatin Ji Zhu, Michigan Statistics 11 Linear Discriminant Analysis Let π k be the prir prbability f class k. Let f k (x) be the class-cnditinal density f X in class k. The psterir prbability Pr(Y = c k X = x) = f k (x)π k K k=1 f k(x)π k

12 Stats Classificatin Ji Zhu, Michigan Statistics 12 We mdel each class density as multivariate Gaussian 1 N(µ k, Σ k ) : (2π) p/2 Σ k 1/2 e 1 2 (x µ k) T Σ 1 k (x µ k ). Assume Σ k = Σ fr all k. Fr each k, the discriminant functin is Decisin rule δ k (x) = x T Σ 1 µ k 1 2 µt k Σ 1 µ k + lg π k Ĉ(x) = arg max k δ k (x)

13 Stats Classificatin Ji Zhu, Michigan Statistics 13 Remarks Classify x t the class with the clsest centrid t x, using the squared Mahalanbis distance. Special case: Σ = I (then Euclidean distance is used) δ k (x) = 1 2 x µ k 2 + lg π k

14 Stats Classificatin Ji Zhu, Michigan Statistics 14 Cmparing class k and class k, the lg-rati is lg Pr(Y = c k X = x) Pr(Y = c k X = x) = x T Σ 1 (µ k µ k ) + lg π k π k 1 2 (µ k + µ k ) T Σ 1 (µ k µ k ) Linear decisin bundary, with the directinal vectr Σ 1 (µ k µ k ); generally nt in the directin f (µ k µ k ).

15 Stats Classificatin Ji Zhu, Michigan Statistics 15

16 Stats Classificatin Ji Zhu, Michigan Statistics 16

17 Stats Classificatin Ji Zhu, Michigan Statistics 17 Parameter Estimatin f LDA In practice, we estimate the parameters frm the training data. ˆπ k = n k /n, where n k is the number f bservatins in class k. ˆµ k = yi =c k x i /n k The pled cvariance ˆΣ = K k=1 y i =c k (x i ˆµ k )(x i ˆµ k ) T /(n K)

18 Stats Classificatin Ji Zhu, Michigan Statistics 18 Quadratic Discriminant Analysis Σ k s are allwed t be different. Discriminant functin δ k (x) = 1 2 lg Σ k 1 2 (x µ k) T Σ 1 k (x µ k ) + lg π k = x T W k x + x T w k + b k The decisin bundary between class k and class k is a quadratic functin {x : x T (W k W k )x + x T (w k w k ) + (b k b k ) = 0} Mre parameters in QDA than in LDA, especially when p is large.

19 Stats Classificatin Ji Zhu, Michigan Statistics 19 Bth LDA and QDA perfrm well n many real classificatin prblems.

20 Stats Classificatin Ji Zhu, Michigan Statistics 20 Naive Bayes Classifier Assume independence amng input variables when class is given f (x 1,..., x p Y = c k ) = f (x 1 c k ) f (x 2 c k )... f (x p c k ) Estimate f (x j c k ) fr all j and c k. New pint is classified t c k if p j=1 f (xj c k ) π k is maximal. Strng assumptin, but ften classifies well when p is large.

21 Stats Classificatin Ji Zhu, Michigan Statistics 21 Generative methds Discriminative methds

22 Stats Classificatin Ji Zhu, Michigan Statistics 22 Discriminative Methds Lgistic regressin K-nearest neighbr (KNN) Supprt vectr machines (SVM) Classificatin tree (CART) Ensemble methds: bsting, randm frest

23 Stats Classificatin Ji Zhu, Michigan Statistics 23 Lgistic Regressin Tw-class case: Y {c 1, c 2 } (K = 2) Use the lgit transfrmatin Prbabilities lg Pr(Y = c 1 X = x) Pr(Y = c 2 X = x) = β 0 + β T x Pr(Y = c 1 X = x) = Pr(Y = c 2 X = x) = e β 0+β T x 1 + e β 0+β T x e β 0+β T x Ensure the prbabilities t be in [0, 1].

24 Stats Classificatin Ji Zhu, Michigan Statistics 24 Fitting the Lgistic Regressin Mdel Maximum likelihd estimatin Dente θ = (β 0, β) Let x (1, x) Cnditinal lg-likelihd f Y given X l(θ) = n lg Pr(Y = y i X = x i ; θ) i=1

25 Stats Classificatin Ji Zhu, Michigan Statistics 25 Cde {c 1, c 2 } as {0, 1}, then l(θ) = = n i=1 n i=1 [y i lg Pr(c 1 x i ; θ) + (1 y i ) lg Pr(c 2 x i ; θ)] [y i θ T x i lg(1 + e θt x i )]

26 Stats Classificatin Ji Zhu, Michigan Statistics 26 Partial derivative Scre Equatin l(θ) θ = n i=1 x i (y i p(x i ; θ)) = 0 where p(x i ; θ) = e θt x i /(1 + e θt x i ). There are (p + 1) equatins. Nnlinear in θ

27 Stats Classificatin Ji Zhu, Michigan Statistics 27 Newtn-Raphsn Algrithm The secnd-derivative (Hessian) matrix 2 l(θ) θ θ T = n i=1 1. Chse an initial value θ 0 2. Update θ by x i x T i p(x i; θ)[1 p(x i ; θ)] θ new = θ ld [ 2 ] 1 l(θ) l(θ) θ θ T θ

28 Stats Classificatin Ji Zhu, Michigan Statistics 28 Iteratively Reweighted Least Squares Using vectr and matrix ntatins Let W = diag[p(x i ; θ ld )(1 p(x i ; θ ld ))] l(θ) θ = X T (y p) 2 l(θ) θ θ T = = X T WX

29 Stats Classificatin Ji Zhu, Michigan Statistics 29 Newtn-Raphsn step θ new = θ ld + (X T WX) 1 X T (y p) = (X T WX) 1 X T Wz where z is the adjusted respnse z = Xθ ld + W 1 (y p) z i = x T i θ + (y i p i ) p i (1 p i )

30 Stats Classificatin Ji Zhu, Michigan Statistics 30 In each iteratin, we slve the weighted least squares prblem θ new = arg min θ (z Xθ) T W(z Xθ) θ = 0 can be used as a starting pint.

31 Stats Classificatin Ji Zhu, Michigan Statistics 31 Inference If the mdel is crrect, ˆθ is cnsistent. Using the central limit therem, the distributin f ˆθ cnverges t N(θ, (X T WX) 1 ).

32 Stats Classificatin Ji Zhu, Michigan Statistics 32 Multi-class Case Use class K as a reference lg Pr(Y = c 1 X = x) Pr(Y = c K X = x) lg Pr(Y = c 2 X = x) Pr(Y = c K X = x) = β 10 + β T 1 x = β 20 + β T 2 x. =. lg Pr(Y = c K 1 X = x) Pr(Y = c K X = x) = β (K 1)0 + β T (K 1) x Multinmial lgistic regressin

33 Stats Classificatin Ji Zhu, Michigan Statistics 33 Lgistic Regressin vs LDA Fr LDA, the lg-psterir dds between class k and class K is linear lg Pr(Y = c k X = x) Pr(Y = c K X = x) = x T Σ 1 (µ k µ K ) + lg π k π K 1 2 (µ k + µ K ) T Σ 1 (µ k µ K ) = α k0 + α T k x Lgistic mdel has linear lgits by cnstructin lg Pr(Y = c k X = x) Pr(Y = c K X = x) = β k0 + β T k x The same frm. Are they the same estimatr?

34 Stats Classificatin Ji Zhu, Michigan Statistics 34 Where is Linearity Frm Fr LDA, the linearity is a cnsequence f the Gaussian assumptin fr the class densities and the assumptin f a cmmn cvariance matrix. Fr lgistic regressin, the linearity cmes by cnstructin. The difference lies in the way the linear cefficients are estimated.

35 Stats Classificatin Ji Zhu, Michigan Statistics 35 Cmmn Cmpnent The jint density f (X, Y) is Pr(X, Y = c k ) = Pr(X) Pr(Y = c k X) where Pr(X) is the marginal density f the input X. Fr bth LDA and lgistic regressin, the secnd term Pr(Y = c k X) has the same lgit linear frm Pr(Y = c k X = x) = exp(θ k0 + θ T k x) 1 + K k =1 exp(θ k 0 + θ T k x)

36 Stats Classificatin Ji Zhu, Michigan Statistics 36 Which Mdel is Mre General Hwever, they make different assumptins abut Pr(X). The lgistic mdel leaves the marginal density f X arbitrary and unspecified. The LDA mdel assumes a Gaussian mixture density Pr(x) = K π k φ(x; µ k, Σ) k=1 Lgistic mdel makes fewer assumptins abut the data, and is mre general.

37 Stats Classificatin Ji Zhu, Michigan Statistics 37 Parameter Estimatin Lgistic regressin Maximizing the cnditinal likelihd, the multinmial likelihd with prbabilities Pr(Y = c k X). The marginal density Pr(X) is ignred (fully nnparametric using the empirical distributin functin which places 1/n at each bservatin).

38 Stats Classificatin Ji Zhu, Michigan Statistics 38 LDA Maximizing the full likelihd based n the jint density Pr(x, Y = c k ) = φ(x; µ k, Σ) π k Marginal density des play a rle.

39 Stats Classificatin Ji Zhu, Michigan Statistics 39 Remarks LDA is easier t cmpute than lgistic regressin. If the true f k (x) s are Gaussian, LDA is better. Lgistic regressin may lse efficiency arund 30% asympttically in errr rate (Efrn 1975). Rbustness LDA uses all the data pints t estimate the cvariance matrix mre infrmatin but nt rbust against utliers. Lgistic regressin dwn-weights pints far frm decisin bundary mre rbust.

40 Stats Classificatin Ji Zhu, Michigan Statistics 40 Discriminative Methds Lgistic regressin K-nearest neighbr (KNN) Supprt vectr machines (SVM) Classificatin tree (CART) Ensemble methds: bsting, randm frest

41 Stats Classificatin Ji Zhu, Michigan Statistics 41 K-nearest Neighbr Methd

42 Stats Classificatin Ji Zhu, Michigan Statistics 42 Cde Y = 1 if Red, and Y = 1 if Green. A natural way t classify a new pint is t have a lk at its neighbrs, and take a vte: ˆf (x ) = 1 K x i N K (x ) y i where N K (x ) cntains the K clsest pints t x in the training data (K-nearest neighbrhd).

43 Stats Classificatin Ji Zhu, Michigan Statistics 43 If there is a clear dminance f ne f the classes in the neighbrhd f an bservatin x, then it is likely that the bservatin itself wuld belng t that class, t. Thus the classificatin rule is the majrity vting amng the members f N K (x ). Thus, Ĉ(x ) = Red if ˆf (x ) > 0 Green if ˆf (x ) < 0

44 Stats Classificatin Ji Zhu, Michigan Statistics 44

45 Stats Classificatin Ji Zhu, Michigan Statistics 45 Oracle Oracle: The data in each class are generated frm a mixture f Gaussians. The density fr each class was an equal mixture f 10 Gaussians. Fr the Green class, its 10 means were generated frm a N((1, 0) T, I) distributin (and cnsidered fixed). Fr the Red class, the 10 means were generated frm a N((0, 1) T, I). The within cluster variances were 1/5.

46 Stats Classificatin Ji Zhu, Michigan Statistics 46

47 Stats Classificatin Ji Zhu, Michigan Statistics 47 K-NN tries t implement cnditinal expectatins directly, by apprximating expectatins by sample averages, relaxing the ntin f cnditining at a pint, t cnditining in a regin clse t the target pint. In thery, when n, K, such that K/n 0, the K-nearest neighbr estimate ˆf (x) f (x) = E(Y X = x) (cnsistent)

48 Stats Classificatin Ji Zhu, Michigan Statistics 48 Degrees f Freedm fr K-NN Hw many parameters des K-nearest neighbrs use t describe the fit? One, the value f K? Mre realistically, K-nearest neighbrs uses n/k effective number f parameters. K cntrls the mdel cmplexity: the smaller K, the mre cmplex the mdel. In general n/k > p, thus K-NN is mre flexible than linear mdels.

49 Stats Classificatin Ji Zhu, Michigan Statistics 49

50 Stats Classificatin Ji Zhu, Michigan Statistics 50 Hw t chse the ptimal K? Can we minimize the training errr? N. When K = 1, the training errr is zer. (Overfitting) Chse K t minimize the misclassificatin errr. Generate an independent test set, using the test errr t estimate the misclassificatin errr.

51 Stats Classificatin Ji Zhu, Michigan Statistics 51

52 Stats Classificatin Ji Zhu, Michigan Statistics 52 Mdel Selectin Suppse the data arise frm a mdel Y = f (X) + ɛ, with E(ɛ) = 0 and Var(ɛ) = σɛ 2. Let Γ = {(x i, y i ), i = 1,..., n} and ŷ = K 1 K l=1 y (l). The subscript (l) indicates the sequence f nearest neighbrs t x. Then the expected predictin errr at x is EPE(x ) = E y x E Γ(y ŷ ) 2 = σɛ 2 + ( f (x ) E Γ (ŷ )) 2 + Var Γ (ŷ )

53 Stats Classificatin Ji Zhu, Michigan Statistics 53 Fr simplicity, assume x i s in the sample are fixed (nnrandm). Then E Γ (y (l) ) = f (x (l) ) Var Γ (y (l) ) = σ 2 ɛ EPE(x ) = σ 2 ɛ + ( f (x ) 1 K K l=1 f (x (l) ) ) 2 + σ2 ɛ K The first term is an irreducible errr. The secnd and third terms make up the mean squared errr (MSE) at x.

54 Stats Classificatin Ji Zhu, Michigan Statistics 54 Bias-Variance Tradeff The squared bias term tends t increase with K. Fr small K, the clsest neighbrs have values f (x (l) ) similar t f (x ). Fr large K, mre further away pints are cunted as neighbrs. The variance term decreases as the inverse f K when K increases. Bias-variance tradeff: as the mdel cmplexity increases, the variance tends t increase and the squared bias tends t decrease. We chse the mdel cmplexity t minimize the test errr.

55 Stats Classificatin Ji Zhu, Michigan Statistics 55

56 Stats Classificatin Ji Zhu, Michigan Statistics 56 Objectives: Mdel Assessment 1. Chse a value f a tuning parameter fr a technique. 2. Estimate the predictin perfrmance f a given mdel. Fr bth f these purpses, the best apprach is t run the prcedure n an independent test set, if ne is available. If pssible ne shuld use different test data fr (1) and (2) abve: a validatin set fr (1) and a test set fr (2).

57 Stats Classificatin Ji Zhu, Michigan Statistics 57 Crss-Validatin Often there is insufficient data t create a separate validatin r test set; setting sme data aside fr validatin is pssible, but affects the accuracy f training estimates In this instance, V-fld crss-validatin is useful.

58 Stats Classificatin Ji Zhu, Michigan Statistics Train Train Test Train Train 1. Divide the data int V disjint subsets. 2. Use subsets 2,..., V as training data and subset 1 as validatin data. Cmpute the PE n subset Repeat fr each subset. 4. Average the result.

59 Stats Classificatin Ji Zhu, Michigan Statistics 59 Curse f Dimensinality K-nearest neighbrs can fail in high dimensins, because it becmes difficult t gather K bservatins clse t a target pint x : near neighbrhds tend t be spatially large, and estimates are biased; reducing the spatial size f the neighbrhd means reducing K, and the variance f the estimate increases.

60 Stats Classificatin Ji Zhu, Michigan Statistics 60 Illustrating Example Suppse the pints are unifrmly distributed in a p-dimensinal unit hypercube. T cnstruct a hypercube neighbrhd f x t capture a fractin ρ f the bservatins, what is the edge length f this cube? Since the vlume f cube l p = ρ, we have l = ρ 1/p. When p = 1: If ρ = 0.01, l = 0.01 and if ρ = 0.1, l = 0.1. When p = 10: If ρ = 0.01, l = 0.63 and if ρ = 0.1, l = When p = 10, in rder t capture 10% f the data, we must cver 80% f the range f each input.

61 Stats Classificatin Ji Zhu, Michigan Statistics 61

62 Stats Classificatin Ji Zhu, Michigan Statistics 62 Lcal methds are n lnger lcal when the dimensin p increases. Sampling density is prprtinal t n 1/p ; if 100 pints are sufficient t estimate a functin in R 1, are needed t achieve similar accuracy in R 10.

63 Stats Classificatin Ji Zhu, Michigan Statistics 63 Discriminative Methds Lgistic regressin K-nearest neighbr (KNN) Supprt vectr machines (SVM) Classificatin tree (CART) Ensemble methds: bsting, randm frest

64 Stats Classificatin Ji Zhu, Michigan Statistics 64 Cnstrained Optimizatin Cnstrained ptimizatin has the frm min subject t Q(θ) θ S R d Q(θ): bjective functin S: feasible set Cnvex ptimizatin: bth bjective functin and feasible set are cnvex.

65 Stats Classificatin Ji Zhu, Michigan Statistics 65 Cnsider Lagrange Multiplier min Q(θ) subject t R(θ) = 0 S = {θ : R(θ) = 0} is a (d 1)-dimensinal surface in R d. Fr every θ such that R(θ) = 0, R(θ) is rthgnal t the surface. If θ is a lcal minimum, then Q is rthgnal t the surface at θ.

66 Stats Classificatin Ji Zhu, Michigan Statistics 66 Cnclusin: at a lcal minimum, there exists λ R such that Q(θ ) = λ R(θ ) This leads us t intrduce the Lagrangian L(θ, λ) = Q(θ) λr(θ) where λ is the Lagrange multiplier. We have argued that a lcal minimum crrespnds t a statinary pint f the Lagrangian. Furthermre, we can reverse ur lgic t deduce that a statinary pint f the Lagrangian is a lcal ptimum.

67 Stats Classificatin Ji Zhu, Michigan Statistics 67 Nw cnsider (the primal prblem) min Q(θ) subject t R(θ) 0 Suppse θ is a lcal minimum. There are tw cases: Inactive cnstraint: R(θ ) > 0 Q(θ ) = 0 statinary pint f L(θ, λ) with λ = 0 Active cnstraint: R(θ ) = 0 same as equality cnstraint except we require λ > 0.

68 Stats Classificatin Ji Zhu, Michigan Statistics 68 In either case, we have λ R(θ ) = 0. Therefre, a lcal minimum satisfies (Karush-Kuhn-Tucker cnditins) L(θ ) = Q(θ ) λ R(θ ) = 0 λr(θ ) = 0 λ 0 Often the KKT cnditins may be used t transfrm the primal prblem t an equivalent dual prblem, where the variables being ptimized are the Lagrange multipliers.

69 Stats Classificatin Ji Zhu, Michigan Statistics 69 Outline Maximum margin classifier Kernel trick SVM & functin estimatin

70 Stats Classificatin Ji Zhu, Michigan Statistics 70 Maximum Margin Classifier cements β 0 + x T β = 0 m m margin s.t. max β,β 0 m 1 β y i(β 0 + x T i β) m Maximize the minimum distance Need cnstraint β = 1 Vapnik (1995)

71 Stats Classificatin Ji Zhu, Michigan Statistics 71 Signed Distance t Hyperplanes cements Hyperplane is defined by {x : β 0 + x T β = 0}. margin x 0 x Fr any pint x 0 in the hyperplane, x0 Tβ = β 0. β β β 0 + x T β = 0 Signed distance f pint x t β the plane is β, x x 0, where x 0 is any pint in the plane.

72 Stats Classificatin Ji Zhu, Michigan Statistics 72 Equivalently Quadratic Prgramming min β 0,β subject t 1 2 β 2 y i (β 0 + x T i β) 1, i = 1,..., n The Lagrange primal is where α i 0. L p = 1 2 β 2 n α i [y i (β 0 + xi T i=1 β) 1]

73 Stats Classificatin Ji Zhu, Michigan Statistics 73 Setting the derivatives t zer, we get n β : β = α i y i x i i=1 n : 0 = β 0 α i y i i=1

74 Stats Classificatin Ji Zhu, Michigan Statistics 74 Substituting int the Lagrange primal, we btain the Lagrange dual L D = n i=1 α i 1 2 n i=1 n α i α i y i y i xi T x i i =1 We maximize L D subject t α i 0 and n α i y i = 0 i=1

75 Stats Classificatin Ji Zhu, Michigan Statistics 75 Minimize L P with respect t primal variables β 0, β Maximize L D with respect t dual variables α i Maximizing the dual is ften a simpler cnvex QP than the primal.

76 Stats Classificatin Ji Zhu, Michigan Statistics 76 Supprt Vectrs The Karush-Kuhn-Tucker cnditins include [ ˆα i yi ( ˆβ 0 + xi T ˆβ) 1 ] = 0 These imply If y i ˆf (xi ) > 1, then ˆα i = 0. If ˆα i > 0, then y i ˆf (xi ) = 1, r in ther wrds, x i is n the bundary f the slab. The slutin ˆβ is defined in terms f a linear cmbinatin f the supprt pints.

77 Stats Classificatin Ji Zhu, Michigan Statistics 77 cements Overlapping Classes β 0 + x T β = 0 ξ ξ 3 1 ξ 2 ξ i = mξ i ξ 4 ξ 4 ξ 5 m m margin max β,β 0, β =1 m s.t. y i (β 0 + x T i β) m(1 ξ i) ξ i 0, i ξ i B ξ i : slack variables B: tuning parameter

78 Stats Classificatin Ji Zhu, Michigan Statistics 78 Equivalently Quadratic Prgramming n 1 min β 0,β,ξ i 2 β 2 + C ξ i i=1 subject t y i (β 0 + xi T β) 1 ξ i, ξ i 0 The Lagrange primal is L P = 1 2 β 2 + C where α i, γ i 0. n ξ i i=1 n α i [y i (β 0 + xi T β) (1 ξ i)] i=1 n γ i ξ i i=1

79 Stats Classificatin Ji Zhu, Michigan Statistics 79 Setting the derivatives t zer, we get n β : β = α i y i x i i=1 n : 0 = β 0 α i y i i=1 : α ξ i = C γ i i

80 Stats Classificatin Ji Zhu, Michigan Statistics 80 Substituting int the Lagrange primal, we btain the Lagrange dual L D = n i=1 α i 1 2 n i=1 n α i α i y i y i x i, x i i =1 We maximize L D subject t 0 α i C and n α i y i = 0 i=1

81 Stats Classificatin Ji Zhu, Michigan Statistics 81 Supprt Vectrs The Karush-Kuhn-Tucker cnditins include [ ˆα i yi ( ˆβ 0 + xi T ˆβ) (1 ξ i ) ] = 0 γ i ξ i = 0 These imply y i ˆf (xi ) > 1 ˆα i = 0 y i ˆf (xi ) < 1 ˆα i = C y i ˆf (xi ) = 1 0 ˆα i C

82 Stats Classificatin Ji Zhu, Michigan Statistics 82 Slutin The slutin is expressed in terms f fitted Lagrange multipliers ˆα i : ˆβ = n i=1 ˆα i y i x i Sme fractin f ˆα i are exactly zer (frm KKT cnditins); the x i fr which ˆα i = 0 are called supprt pints S. ˆf (x) = ˆβ 0 + x T ˆβ = ˆβ 0 + i S ˆα i y i x, x i

83 Stats Classificatin Ji Zhu, Michigan Statistics 83 Example Bayes Optimal Classifier Mixture f Gaussian. Red class: 10 centers µ k frm N(( 1, 1) T, I); then randmly pick ne center, and generate a data pint frm N(µ k, I/5). Green class is similar, with N((1, 1) T, I). Bayes errr: 0.21.

84 Stats Classificatin Ji Zhu, Michigan Statistics 84 Linear SVMs Training Errr: Test Errr: Bayes Errr: lacements C = C = Training Errr: 0.26 Test Errr: 0.30 Bayes Errr: 0.21 PSfrag replacements C = C = 0.01 Resulting classifier is sign( ˆβ 0 + x T ˆβ).

85 Stats Classificatin Ji Zhu, Michigan Statistics 85 Outline Maximum margin classifier Kernel trick SVM & functin estimatin

86 Stats Classificatin Ji Zhu, Michigan Statistics 86 Flexible Classifiers Enlarge the input space via basis expansin (p q): h(x) = ( h 1 (x), h 2 (x),..., h q (x) ) Lagrange dual and slutin becme L D = n i=1 α i 1 2 n i=1 n α i α i y i y i h(x i ), h(x i ) i =1 and ˆf (x) = ˆβ 0 + i S ˆα i y i h(x), h(x i )

87 Stats Classificatin Ji Zhu, Michigan Statistics 87 Example 2nd degree plynmial in R 2. We chse: h 1 (x) = 1 h 2 (x) = 2x 1 h 3 (x) = 2x 2 h 4 (x) = x1 2 h 5 (x) = x2 2 h 6 (x) = 2x 1 x 2

88 Stats Classificatin Ji Zhu, Michigan Statistics 88 Kernels L D and cnstraints invlve h(x) nly thrugh inner-prducts K(x, x ) = h(x), h(x ) Given a suitable kernel functin K(x, x ), dn t need h(x) at all. ˆf (x) = ˆβ 0 + i S ˆα i y i K(x, x i )

89 Stats Classificatin Ji Zhu, Michigan Statistics 89 Example Cntd If we chse K(x, x ) = (1 + x, x ) 2 then K(x, x ) = (1 + x 1 x 1 + x 2x 2 )2 = 1 + 2x 1 x 1 + 2x 2x 2 + (x 1x 1 )2 +(x 2 x 2 )2 + 2x 1 x 1 x 2x 2 = h(x), h(x )

90 Stats Classificatin Ji Zhu, Michigan Statistics 90 Ppular Kernels dth degree plynmial: K(x, x ) = (1 + x, x ) d radial basis: K(x, x ) = exp( x x 2 /σ 2 ) K(x, x ) is a symmetric, psitive (semi-) definite functin: Fr every n = 1, 2,..., and every set f real numbers {a 1, a 2,..., a n } and x 1, x 2,..., x n, we have i,i n =1 a ia i K(x i, x i ) 0.

91 Stats Classificatin Ji Zhu, Michigan Statistics 91 Nnlinear SVMs SVM - Degree-4 Plynmial in Feature Space Training Errr: Test Errr: Bayes Errr: SVM - Radial Kernel in Feature Space Training Errr: Test Errr: Bayes Errr: 0.210

92 Stats Classificatin Ji Zhu, Michigan Statistics 92 Outline Maximum margin classifier Kernel trick SVM & functin estimatin

93 Stats Classificatin Ji Zhu, Michigan Statistics 93 SVM via Lss + Penalty Lss cements Binmial Lg-likelihd Supprt Vectr y f (x) With f (x) = β 0 + x T β, cnsider min β 0,β n [1 y i f (x i )] + + λ 2 β 2 i=1 Slutin identical t SVM slutin, with λ = 1/C.

94 Stats Classificatin Ji Zhu, Michigan Statistics 94 SVM and Functin Estimatin SVM with general kernel K(, ) minimizes: n [1 y i f (x i )] + + λ 2 f 2 H K i=1 with f H K. H K is the reprducing kernel Hilbert space (RKHS) f functins generated by the kernel K(, ).

95 Stats Classificatin Ji Zhu, Michigan Statistics 95 RKHS Functin space H K generated by a psitive (semi-) definite functin K(x, x ). Eigen expansin (Mercer s therem) K(x, x ) = γ j φ j (x)φ j (x ) j=1 where γ j 0, γ 2 j < j=1

96 Stats Classificatin Ji Zhu, Michigan Statistics 96 Define H K t be the set f functins f the frm f (x) = θ j φ j (x) j=1 and define the inner prduct θ j φ j (x), δ j φ j (x) j=1 j =1 H K def = j=1 θ j δ j γ j Then the squared nrm f f is f (x) 2 H K = θ 2 j /γ j j=1 which is generally viewed as a rughness penalty.

97 Stats Classificatin Ji Zhu, Michigan Statistics 97 The Representer Therem Mre generally we can ptimize min f H K [ n i=1 L(y i, f (x i )) + λ 2 f 2 H K ] The slutin has the finite frm (Wahba 1990) ˆf (x) = n i=1 ˆα i K(x, x i ) a finite expansin in the representers K(x, x i ).

98 Stats Classificatin Ji Zhu, Michigan Statistics 98 Lss Functins SVM: L[y, f (x)] = (1 y f (x)) + Called hinge lss Estimates the classifier (threshld) sign (Pr(Y = 1 x) Pr(Y = 1 x))

99 Stats Classificatin Ji Zhu, Michigan Statistics 99 Binmial Deviance: L[y, f (x)] = lg (1 + e y f (x)) (Negative) binmial lg-likelihd Estimates the lgit lg Pr(Y = 1 x) Pr(Y = 1 x) Why nt the squared errr lss?

100 Stats Classificatin Ji Zhu, Michigan Statistics 100 Kernel Lgistic Regressin Replace (1 y f ) + with ln(1 + e y f ), the binmial deviance. Similar classificatin perfrmance as the SVM. Prvide estimates f class prbabilities. Natural generalizatin t the multi-class case.

101 Stats Classificatin Ji Zhu, Michigan Statistics 101 KLR vs SVM LR - Radial Kernel in Feature Space Training Errr: Test Errr: Bayes Errr: SVM - Radial Kernel in Feature Space Training Errr: Test Errr: Bayes Errr: 0.210

102 Stats Classificatin Ji Zhu, Michigan Statistics 102 Remark SVM can be viewed as regularized fitting with a particular lss functin: hinge lss. The hinge lss allws fr cmpressin in terms f basis functins, frm n t sme fractin f n. Regularized lgistic regressin gives very similar fit, using binmial deviance as the lss.

103 Stats Classificatin Ji Zhu, Michigan Statistics 103 Discriminative Methds Lgistic regressin K-nearest neighbr (KNN) Supprt vectr machines (SVM) Classificatin tree (CART) Ensemble methds: bsting, randm frest

104 Stats Classificatin Ji Zhu, Michigan Statistics 104 Example f a Classificatin Tree

105 Stats Classificatin Ji Zhu, Michigan Statistics 105 Classify Test Data

106 Stats Classificatin Ji Zhu, Michigan Statistics 106 Classify Test Data

107 Stats Classificatin Ji Zhu, Michigan Statistics 107 Classify Test Data

108 Stats Classificatin Ji Zhu, Michigan Statistics 108 Classify Test Data

109 Stats Classificatin Ji Zhu, Michigan Statistics 109 Classify Test Data

Pattern Recognition 2014 Support Vector Machines

Pattern Recognition 2014 Support Vector Machines Pattern Recgnitin 2014 Supprt Vectr Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recgnitin 1 / 55 Overview 1 Separable Case 2 Kernel Functins 3 Allwing Errrs (Sft

More information

What is Statistical Learning?

What is Statistical Learning? What is Statistical Learning? Sales 5 10 15 20 25 Sales 5 10 15 20 25 Sales 5 10 15 20 25 0 50 100 200 300 TV 0 10 20 30 40 50 Radi 0 20 40 60 80 100 Newspaper Shwn are Sales vs TV, Radi and Newspaper,

More information

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines

COMP 551 Applied Machine Learning Lecture 11: Support Vector Machines COMP 551 Applied Machine Learning Lecture 11: Supprt Vectr Machines Instructr: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/cmp551 Unless therwise nted, all material psted fr this curse

More information

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017 Resampling Methds Crss-validatin, Btstrapping Marek Petrik 2/21/2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins in R (Springer, 2013) with

More information

Support-Vector Machines

Support-Vector Machines Supprt-Vectr Machines Intrductin Supprt vectr machine is a linear machine with sme very nice prperties. Haykin chapter 6. See Alpaydin chapter 13 fr similar cntent. Nte: Part f this lecture drew material

More information

IAML: Support Vector Machines

IAML: Support Vector Machines 1 / 22 IAML: Supprt Vectr Machines Charles Suttn and Victr Lavrenk Schl f Infrmatics Semester 1 2 / 22 Outline Separating hyperplane with maimum margin Nn-separable training data Epanding the input int

More information

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d)

COMP 551 Applied Machine Learning Lecture 9: Support Vector Machines (cont d) COMP 551 Applied Machine Learning Lecture 9: Supprt Vectr Machines (cnt d) Instructr: Herke van Hf (herke.vanhf@mail.mcgill.ca) Slides mstly by: Class web page: www.cs.mcgill.ca/~hvanh2/cmp551 Unless therwise

More information

Support Vector Machines and Flexible Discriminants

Support Vector Machines and Flexible Discriminants 12 Supprt Vectr Machines and Flexible Discriminants This is page 417 Printer: Opaque this 12.1 Intrductin In this chapter we describe generalizatins f linear decisin bundaries fr classificatin. Optimal

More information

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification COMP 551 Applied Machine Learning Lecture 5: Generative mdels fr linear classificatin Instructr: Herke van Hf (herke.vanhf@mail.mcgill.ca) Slides mstly by: Jelle Pineau Class web page: www.cs.mcgill.ca/~hvanh2/cmp551

More information

COMP 551 Applied Machine Learning Lecture 4: Linear classification

COMP 551 Applied Machine Learning Lecture 4: Linear classification COMP 551 Applied Machine Learning Lecture 4: Linear classificatin Instructr: Jelle Pineau (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/cmp551 Unless therwise nted, all material psted

More information

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 STATS 202: Data mining and analysis September 27, 2017 1 / 20 Supervised vs. unsupervised learning In unsupervised

More information

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff

Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeoff Lecture 2: Supervised vs. unsupervised learning, bias-variance tradeff Reading: Chapter 2 STATS 202: Data mining and analysis September 27, 2017 1 / 20 Supervised vs. unsupervised learning In unsupervised

More information

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw:

In SMV I. IAML: Support Vector Machines II. This Time. The SVM optimization problem. We saw: In SMV I IAML: Supprt Vectr Machines II Nigel Gddard Schl f Infrmatics Semester 1 We sa: Ma margin trick Gemetry f the margin and h t cmpute it Finding the ma margin hyperplane using a cnstrained ptimizatin

More information

Resampling Methods. Chapter 5. Chapter 5 1 / 52

Resampling Methods. Chapter 5. Chapter 5 1 / 52 Resampling Methds Chapter 5 Chapter 5 1 / 52 1 51 Validatin set apprach 2 52 Crss validatin 3 53 Btstrap Chapter 5 2 / 52 Abut Resampling An imprtant statistical tl Pretending the data as ppulatin and

More information

Lecture 8: Multiclass Classification (I)

Lecture 8: Multiclass Classification (I) Bayes Rule fr Multiclass Prblems Traditinal Methds fr Multiclass Prblems Linear Regressin Mdels Lecture 8: Multiclass Classificatin (I) Ha Helen Zhang Fall 07 Ha Helen Zhang Lecture 8: Multiclass Classificatin

More information

Linear programming III

Linear programming III Linear prgramming III Review 1/33 What have cvered in previus tw classes LP prblem setup: linear bjective functin, linear cnstraints. exist extreme pint ptimal slutin. Simplex methd: g thrugh extreme pint

More information

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data Outline IAML: Lgistic Regressin Charles Suttn and Victr Lavrenk Schl f Infrmatics Semester Lgistic functin Lgistic regressin Learning lgistic regressin Optimizatin The pwer f nn-linear basis functins Least-squares

More information

The blessing of dimensionality for kernel methods

The blessing of dimensionality for kernel methods fr kernel methds Building classifiers in high dimensinal space Pierre Dupnt Pierre.Dupnt@ucluvain.be Classifiers define decisin surfaces in sme feature space where the data is either initially represented

More information

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels

k-nearest Neighbor How to choose k Average of k points more reliable when: Large k: noise in attributes +o o noise in class labels Mtivating Example Memry-Based Learning Instance-Based Learning K-earest eighbr Inductive Assumptin Similar inputs map t similar utputs If nt true => learning is impssible If true => learning reduces t

More information

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter

Midwest Big Data Summer School: Machine Learning I: Introduction. Kris De Brabanter Midwest Big Data Summer Schl: Machine Learning I: Intrductin Kris De Brabanter kbrabant@iastate.edu Iwa State University Department f Statistics Department f Cmputer Science June 24, 2016 1/24 Outline

More information

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) >

Bootstrap Method > # Purpose: understand how bootstrap method works > obs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(obs) > Btstrap Methd > # Purpse: understand hw btstrap methd wrks > bs=c(11.96, 5.03, 67.40, 16.07, 31.50, 7.73, 11.10, 22.38) > n=length(bs) > mean(bs) [1] 21.64625 > # estimate f lambda > lambda = 1/mean(bs);

More information

SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES

SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES 1 SURVIVAL ANALYSIS WITH SUPPORT VECTOR MACHINES Wlfgang HÄRDLE Ruslan MORO Center fr Applied Statistics and Ecnmics (CASE), Humbldt-Universität zu Berlin Mtivatin 2 Applicatins in Medicine estimatin f

More information

Contents. This is page i Printer: Opaque this

Contents. This is page i Printer: Opaque this Cntents This is page i Printer: Opaque this Supprt Vectr Machines and Flexible Discriminants. Intrductin............. The Supprt Vectr Classifier.... Cmputing the Supprt Vectr Classifier........ Mixture

More information

Distributions, spatial statistics and a Bayesian perspective

Distributions, spatial statistics and a Bayesian perspective Distributins, spatial statistics and a Bayesian perspective Dug Nychka Natinal Center fr Atmspheric Research Distributins and densities Cnditinal distributins and Bayes Thm Bivariate nrmal Spatial statistics

More information

Smoothing, penalized least squares and splines

Smoothing, penalized least squares and splines Smthing, penalized least squares and splines Duglas Nychka, www.image.ucar.edu/~nychka Lcally weighted averages Penalized least squares smthers Prperties f smthers Splines and Reprducing Kernels The interplatin

More information

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001) CN700 Additive Mdels and Trees Chapter 9: Hastie et al. (2001) Madhusudana Shashanka Department f Cgnitive and Neural Systems Bstn University CN700 - Additive Mdels and Trees March 02, 2004 p.1/34 Overview

More information

Chapter 3: Cluster Analysis

Chapter 3: Cluster Analysis Chapter 3: Cluster Analysis } 3.1 Basic Cncepts f Clustering 3.1.1 Cluster Analysis 3.1. Clustering Categries } 3. Partitining Methds 3..1 The principle 3.. K-Means Methd 3..3 K-Medids Methd 3..4 CLARA

More information

A Matrix Representation of Panel Data

A Matrix Representation of Panel Data web Extensin 6 Appendix 6.A A Matrix Representatin f Panel Data Panel data mdels cme in tw brad varieties, distinct intercept DGPs and errr cmpnent DGPs. his appendix presents matrix algebra representatins

More information

Tree Structured Classifier

Tree Structured Classifier Tree Structured Classifier Reference: Classificatin and Regressin Trees by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stne, Chapman & Hall, 98. A Medical Eample (CART): Predict high risk patients

More information

Simple Linear Regression (single variable)

Simple Linear Regression (single variable) Simple Linear Regressin (single variable) Intrductin t Machine Learning Marek Petrik January 31, 2017 Sme f the figures in this presentatin are taken frm An Intrductin t Statistical Learning, with applicatins

More information

Linear Classification

Linear Classification Linear Classificatin CS 54: Machine Learning Slides adapted frm Lee Cper, Jydeep Ghsh, and Sham Kakade Review: Linear Regressin CS 54 [Spring 07] - H Regressin Given an input vectr x T = (x, x,, xp), we

More information

Support Vector Machines and Flexible Discriminants

Support Vector Machines and Flexible Discriminants Supprt Vectr Machines and Flexible Discriminants This is page Printer: Opaque this. Intrductin In this chapter we describe generalizatins f linear decisin bundaries fr classificatin. Optimal separating

More information

Computational modeling techniques

Computational modeling techniques Cmputatinal mdeling techniques Lecture 4: Mdel checing fr ODE mdels In Petre Department f IT, Åb Aademi http://www.users.ab.fi/ipetre/cmpmd/ Cntent Stichimetric matrix Calculating the mass cnservatin relatins

More information

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours

STATS216v Introduction to Statistical Learning Stanford University, Summer Practice Final (Solutions) Duration: 3 hours STATS216v Intrductin t Statistical Learning Stanfrd University, Summer 2016 Practice Final (Slutins) Duratin: 3 hurs Instructins: (This is a practice final and will nt be graded.) Remember the university

More information

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics

Module 3: Gaussian Process Parameter Estimation, Prediction Uncertainty, and Diagnostics Mdule 3: Gaussian Prcess Parameter Estimatin, Predictin Uncertainty, and Diagnstics Jerme Sacks and William J Welch Natinal Institute f Statistical Sciences and University f British Clumbia Adapted frm

More information

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction

T Algorithmic methods for data mining. Slide set 6: dimensionality reduction T-61.5060 Algrithmic methds fr data mining Slide set 6: dimensinality reductin reading assignment LRU bk: 11.1 11.3 PCA tutrial in mycurses (ptinal) ptinal: An Elementary Prf f a Therem f Jhnsn and Lindenstrauss,

More information

Comparing Several Means: ANOVA. Group Means and Grand Mean

Comparing Several Means: ANOVA. Group Means and Grand Mean STAT 511 ANOVA and Regressin 1 Cmparing Several Means: ANOVA Slide 1 Blue Lake snap beans were grwn in 12 pen-tp chambers which are subject t 4 treatments 3 each with O 3 and SO 2 present/absent. The ttal

More information

Overview of Supervised Learning

Overview of Supervised Learning 2 Overview f Supervised Learning 2.1 Intrductin The first three examples described in Chapter 1 have several cmpnents in cmmn. Fr each there is a set f variables that might be dented as inputs, which are

More information

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9. Sectin 7 Mdel Assessment This sectin is based n Stck and Watsn s Chapter 9. Internal vs. external validity Internal validity refers t whether the analysis is valid fr the ppulatin and sample being studied.

More information

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank

MATCHING TECHNIQUES. Technical Track Session VI. Emanuela Galasso. The World Bank MATCHING TECHNIQUES Technical Track Sessin VI Emanuela Galass The Wrld Bank These slides were develped by Christel Vermeersch and mdified by Emanuela Galass fr the purpse f this wrkshp When can we use

More information

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression

3.4 Shrinkage Methods Prostate Cancer Data Example (Continued) Ridge Regression 3.3.4 Prstate Cancer Data Example (Cntinued) 3.4 Shrinkage Methds 61 Table 3.3 shws the cefficients frm a number f different selectin and shrinkage methds. They are best-subset selectin using an all-subsets

More information

Inference in the Multiple-Regression

Inference in the Multiple-Regression Sectin 5 Mdel Inference in the Multiple-Regressin Kinds f hypthesis tests in a multiple regressin There are several distinct kinds f hypthesis tests we can run in a multiple regressin. Suppse that amng

More information

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression 4th Indian Institute f Astrphysics - PennState Astrstatistics Schl July, 2013 Vainu Bappu Observatry, Kavalur Crrelatin and Regressin Rahul Ry Indian Statistical Institute, Delhi. Crrelatin Cnsider a tw

More information

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank

MATCHING TECHNIQUES Technical Track Session VI Céline Ferré The World Bank MATCHING TECHNIQUES Technical Track Sessin VI Céline Ferré The Wrld Bank When can we use matching? What if the assignment t the treatment is nt dne randmly r based n an eligibility index, but n the basis

More information

Elements of Machine Intelligence - I

Elements of Machine Intelligence - I ECE-175A Elements f Machine Intelligence - I Ken Kreutz-Delgad Nun Vascncels ECE Department, UCSD Winter 2011 The curse The curse will cver basic, but imprtant, aspects f machine learning and pattern recgnitin

More information

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came. MATH 1342 Ch. 24 April 25 and 27, 2013 Page 1 f 5 CHAPTER 24: INFERENCE IN REGRESSION Chapters 4 and 5: Relatinships between tw quantitative variables. Be able t Make a graph (scatterplt) Summarize the

More information

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition

The Kullback-Leibler Kernel as a Framework for Discriminant and Localized Representations for Visual Recognition The Kullback-Leibler Kernel as a Framewrk fr Discriminant and Lcalized Representatins fr Visual Recgnitin Nun Vascncels Purdy H Pedr Mren ECE Department University f Califrnia, San Dieg HP Labs Cambridge

More information

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa

PSU GISPOPSCI June 2011 Ordinary Least Squares & Spatial Linear Regression in GeoDa There are tw parts t this lab. The first is intended t demnstrate hw t request and interpret the spatial diagnstics f a standard OLS regressin mdel using GeDa. The diagnstics prvide infrmatin abut the

More information

Localized Model Selection for Regression

Localized Model Selection for Regression Lcalized Mdel Selectin fr Regressin Yuhng Yang Schl f Statistics University f Minnesta Church Street S.E. Minneaplis, MN 5555 May 7, 007 Abstract Research n mdel/prcedure selectin has fcused n selecting

More information

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis

SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical model for microarray data analysis SUPPLEMENTARY MATERIAL GaGa: a simple and flexible hierarchical mdel fr micrarray data analysis David Rssell Department f Bistatistics M.D. Andersn Cancer Center, Hustn, TX 77030, USA rsselldavid@gmail.cm

More information

Part 3 Introduction to statistical classification techniques

Part 3 Introduction to statistical classification techniques Part 3 Intrductin t statistical classificatin techniques Machine Learning, Part 3, March 07 Fabi Rli Preamble ØIn Part we have seen that if we knw: Psterir prbabilities P(ω i / ) Or the equivalent terms

More information

Statistical classifiers: Bayesian decision theory and density estimation

Statistical classifiers: Bayesian decision theory and density estimation 3 rd NOSE Shrt Curse Alpbach, st 6 th Mar 004 Statistical classifiers: Bayesian decisin thery and density estimatin Ricard Gutierrez- Department f Cmputer Science rgutier@cs.tamu.edu http://research.cs.tamu.edu/prism

More information

Lyapunov Stability Stability of Equilibrium Points

Lyapunov Stability Stability of Equilibrium Points Lyapunv Stability Stability f Equilibrium Pints 1. Stability f Equilibrium Pints - Definitins In this sectin we cnsider n-th rder nnlinear time varying cntinuus time (C) systems f the frm x = f ( t, x),

More information

Chapter 15 & 16: Random Forests & Ensemble Learning

Chapter 15 & 16: Random Forests & Ensemble Learning Chapter 15 & 16: Randm Frests & Ensemble Learning DD3364 Nvember 27, 2012 Ty Prblem fr Bsted Tree Bsted Tree Example Estimate this functin with a sum f trees with 9-terminal ndes by minimizing the sum

More information

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme

Enhancing Performance of MLP/RBF Neural Classifiers via an Multivariate Data Distribution Scheme Enhancing Perfrmance f / Neural Classifiers via an Multivariate Data Distributin Scheme Halis Altun, Gökhan Gelen Nigde University, Electrical and Electrnics Engineering Department Nigde, Turkey haltun@nigde.edu.tr

More information

, which yields. where z1. and z2

, which yields. where z1. and z2 The Gaussian r Nrmal PDF, Page 1 The Gaussian r Nrmal Prbability Density Functin Authr: Jhn M Cimbala, Penn State University Latest revisin: 11 September 13 The Gaussian r Nrmal Prbability Density Functin

More information

Margin Distribution and Learning Algorithms

Margin Distribution and Learning Algorithms ICML 03 Margin Distributin and Learning Algrithms Ashutsh Garg IBM Almaden Research Center, San Jse, CA 9513 USA Dan Rth Department f Cmputer Science, University f Illinis, Urbana, IL 61801 USA ASHUTOSH@US.IBM.COM

More information

CS 109 Lecture 23 May 18th, 2016

CS 109 Lecture 23 May 18th, 2016 CS 109 Lecture 23 May 18th, 2016 New Datasets Heart Ancestry Netflix Our Path Parameter Estimatin Machine Learning: Frmally Many different frms f Machine Learning We fcus n the prblem f predictin Want

More information

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons

Slide04 (supplemental) Haykin Chapter 4 (both 2nd and 3rd ed): Multi-Layer Perceptrons Slide04 supplemental) Haykin Chapter 4 bth 2nd and 3rd ed): Multi-Layer Perceptrns CPSC 636-600 Instructr: Ynsuck Che Heuristic fr Making Backprp Perfrm Better 1. Sequential vs. batch update: fr large

More information

NUMBERS, MATHEMATICS AND EQUATIONS

NUMBERS, MATHEMATICS AND EQUATIONS AUSTRALIAN CURRICULUM PHYSICS GETTING STARTED WITH PHYSICS NUMBERS, MATHEMATICS AND EQUATIONS An integral part t the understanding f ur physical wrld is the use f mathematical mdels which can be used t

More information

7 TH GRADE MATH STANDARDS

7 TH GRADE MATH STANDARDS ALGEBRA STANDARDS Gal 1: Students will use the language f algebra t explre, describe, represent, and analyze number expressins and relatins 7 TH GRADE MATH STANDARDS 7.M.1.1: (Cmprehensin) Select, use,

More information

Math Foundations 20 Work Plan

Math Foundations 20 Work Plan Math Fundatins 20 Wrk Plan Units / Tpics 20.8 Demnstrate understanding f systems f linear inequalities in tw variables. Time Frame December 1-3 weeks 6-10 Majr Learning Indicatrs Identify situatins relevant

More information

Sequential Allocation with Minimal Switching

Sequential Allocation with Minimal Switching In Cmputing Science and Statistics 28 (1996), pp. 567 572 Sequential Allcatin with Minimal Switching Quentin F. Stut 1 Janis Hardwick 1 EECS Dept., University f Michigan Statistics Dept., Purdue University

More information

Lecture 3: Principal Components Analysis (PCA)

Lecture 3: Principal Components Analysis (PCA) Lecture 3: Principal Cmpnents Analysis (PCA) Reading: Sectins 6.3.1, 10.1, 10.2, 10.4 STATS 202: Data mining and analysis Jnathan Taylr, 9/28 Slide credits: Sergi Bacallad 1 / 24 The bias variance decmpsitin

More information

Checking the resolved resonance region in EXFOR database

Checking the resolved resonance region in EXFOR database Checking the reslved resnance regin in EXFOR database Gttfried Bertn Sciété de Calcul Mathématique (SCM) Oscar Cabells OECD/NEA Data Bank JEFF Meetings - Sessin JEFF Experiments Nvember 0-4, 017 Bulgne-Billancurt,

More information

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016

Maximum A Posteriori (MAP) CS 109 Lecture 22 May 16th, 2016 Maximum A Psteriri (MAP) CS 109 Lecture 22 May 16th, 2016 Previusly in CS109 Game f Estimatrs Maximum Likelihd Nn spiler: this didn t happen Side Plt argmax argmax f lg Mther f ptimizatins? Reviving an

More information

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank

CAUSAL INFERENCE. Technical Track Session I. Phillippe Leite. The World Bank CAUSAL INFERENCE Technical Track Sessin I Phillippe Leite The Wrld Bank These slides were develped by Christel Vermeersch and mdified by Phillippe Leite fr the purpse f this wrkshp Plicy questins are causal

More information

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp

1996 Engineering Systems Design and Analysis Conference, Montpellier, France, July 1-4, 1996, Vol. 7, pp THE POWER AND LIMIT OF NEURAL NETWORKS T. Y. Lin Department f Mathematics and Cmputer Science San Jse State University San Jse, Califrnia 959-003 tylin@cs.ssu.edu and Bereley Initiative in Sft Cmputing*

More information

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur

initially lcated away frm the data set never win the cmpetitin, resulting in a nnptimal nal cdebk, [2] [3] [4] and [5]. Khnen's Self Organizing Featur Cdewrd Distributin fr Frequency Sensitive Cmpetitive Learning with One Dimensinal Input Data Aristides S. Galanpuls and Stanley C. Ahalt Department f Electrical Engineering The Ohi State University Abstract

More information

MATHEMATICS SYLLABUS SECONDARY 5th YEAR

MATHEMATICS SYLLABUS SECONDARY 5th YEAR Eurpean Schls Office f the Secretary-General Pedaggical Develpment Unit Ref. : 011-01-D-8-en- Orig. : EN MATHEMATICS SYLLABUS SECONDARY 5th YEAR 6 perid/week curse APPROVED BY THE JOINT TEACHING COMMITTEE

More information

cfl Cpyright by Ji Zhu 2003 All Rights Reserved ii

cfl Cpyright by Ji Zhu 2003 All Rights Reserved ii FLEXIBLE STATISTICAL MODELING a dissertatin submitted t the department f statistics and the cmmittee n graduate studies f stanfrd university in partial fulfillment f the requirements fr the degree f dctr

More information

February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA

February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA February 28, 2013 COMMENTS ON DIFFUSION, DIFFUSIVITY AND DERIVATION OF HYPERBOLIC EQUATIONS DESCRIBING THE DIFFUSION PHENOMENA Mental Experiment regarding 1D randm walk Cnsider a cntainer f gas in thermal

More information

SAMPLING DYNAMICAL SYSTEMS

SAMPLING DYNAMICAL SYSTEMS SAMPLING DYNAMICAL SYSTEMS Melvin J. Hinich Applied Research Labratries The University f Texas at Austin Austin, TX 78713-8029, USA (512) 835-3278 (Vice) 835-3259 (Fax) hinich@mail.la.utexas.edu ABSTRACT

More information

Statistical Learning. 2.1 What Is Statistical Learning?

Statistical Learning. 2.1 What Is Statistical Learning? 2 Statistical Learning 2.1 What Is Statistical Learning? In rder t mtivate ur study f statistical learning, we begin with a simple example. Suppse that we are statistical cnsultants hired by a client t

More information

Eric Klein and Ning Sa

Eric Klein and Ning Sa Week 12. Statistical Appraches t Netwrks: p1 and p* Wasserman and Faust Chapter 15: Statistical Analysis f Single Relatinal Netwrks There are fur tasks in psitinal analysis: 1) Define Equivalence 2) Measure

More information

Admissibility Conditions and Asymptotic Behavior of Strongly Regular Graphs

Admissibility Conditions and Asymptotic Behavior of Strongly Regular Graphs Admissibility Cnditins and Asympttic Behavir f Strngly Regular Graphs VASCO MOÇO MANO Department f Mathematics University f Prt Oprt PORTUGAL vascmcman@gmailcm LUÍS ANTÓNIO DE ALMEIDA VIEIRA Department

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

ECEN 4872/5827 Lecture Notes

ECEN 4872/5827 Lecture Notes ECEN 4872/5827 Lecture Ntes Lecture #5 Objectives fr lecture #5: 1. Analysis f precisin current reference 2. Appraches fr evaluating tlerances 3. Temperature Cefficients evaluatin technique 4. Fundamentals

More information

Fall 2013 Physics 172 Recitation 3 Momentum and Springs

Fall 2013 Physics 172 Recitation 3 Momentum and Springs Fall 03 Physics 7 Recitatin 3 Mmentum and Springs Purpse: The purpse f this recitatin is t give yu experience wrking with mmentum and the mmentum update frmula. Readings: Chapter.3-.5 Learning Objectives:.3.

More information

EDA Engineering Design & Analysis Ltd

EDA Engineering Design & Analysis Ltd EDA Engineering Design & Analysis Ltd THE FINITE ELEMENT METHOD A shrt tutrial giving an verview f the histry, thery and applicatin f the finite element methd. Intrductin Value f FEM Applicatins Elements

More information

You need to be able to define the following terms and answer basic questions about them:

You need to be able to define the following terms and answer basic questions about them: CS440/ECE448 Sectin Q Fall 2017 Midterm Review Yu need t be able t define the fllwing terms and answer basic questins abut them: Intr t AI, agents and envirnments Pssible definitins f AI, prs and cns f

More information

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb

Logistic Regression. and Maximum Likelihood. Marek Petrik. Feb Lgistic Regressin and Maximum Likelihd Marek Petrik Feb 09 2017 S Far in ML Regressin vs Classificatin Linear regressin Bias-variance decmpsitin Practical methds fr linear regressin Simple Linear Regressin

More information

and the Doppler frequency rate f R , can be related to the coefficients of this polynomial. The relationships are:

and the Doppler frequency rate f R , can be related to the coefficients of this polynomial. The relationships are: Algrithm fr Estimating R and R - (David Sandwell, SIO, August 4, 2006) Azimith cmpressin invlves the alignment f successive eches t be fcused n a pint target Let s be the slw time alng the satellite track

More information

AP Statistics Notes Unit Two: The Normal Distributions

AP Statistics Notes Unit Two: The Normal Distributions AP Statistics Ntes Unit Tw: The Nrmal Distributins Syllabus Objectives: 1.5 The student will summarize distributins f data measuring the psitin using quartiles, percentiles, and standardized scres (z-scres).

More information

The Solution Path of the Slab Support Vector Machine

The Solution Path of the Slab Support Vector Machine CCCG 2008, Mntréal, Québec, August 3 5, 2008 The Slutin Path f the Slab Supprt Vectr Machine Michael Eigensatz Jachim Giesen Madhusudan Manjunath Abstract Given a set f pints in a Hilbert space that can

More information

A Scalable Recurrent Neural Network Framework for Model-free

A Scalable Recurrent Neural Network Framework for Model-free A Scalable Recurrent Neural Netwrk Framewrk fr Mdel-free POMDPs April 3, 2007 Zhenzhen Liu, Itamar Elhanany Machine Intelligence Lab Department f Electrical and Cmputer Engineering The University f Tennessee

More information

Hypothesis Tests for One Population Mean

Hypothesis Tests for One Population Mean Hypthesis Tests fr One Ppulatin Mean Chapter 9 Ala Abdelbaki Objective Objective: T estimate the value f ne ppulatin mean Inferential statistics using statistics in rder t estimate parameters We will be

More information

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 11 (3/11/04) Neutron Diffusion

22.54 Neutron Interactions and Applications (Spring 2004) Chapter 11 (3/11/04) Neutron Diffusion .54 Neutrn Interactins and Applicatins (Spring 004) Chapter (3//04) Neutrn Diffusin References -- J. R. Lamarsh, Intrductin t Nuclear Reactr Thery (Addisn-Wesley, Reading, 966) T study neutrn diffusin

More information

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers

LHS Mathematics Department Honors Pre-Calculus Final Exam 2002 Answers LHS Mathematics Department Hnrs Pre-alculus Final Eam nswers Part Shrt Prblems The table at the right gives the ppulatin f Massachusetts ver the past several decades Using an epnential mdel, predict the

More information

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets

Department of Economics, University of California, Davis Ecn 200C Micro Theory Professor Giacomo Bonanno. Insurance Markets Department f Ecnmics, University f alifrnia, Davis Ecn 200 Micr Thery Prfessr Giacm Bnann Insurance Markets nsider an individual wh has an initial wealth f. ith sme prbability p he faces a lss f x (0

More information

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must

the results to larger systems due to prop'erties of the projection algorithm. First, the number of hidden nodes must M.E. Aggune, M.J. Dambrg, M.A. El-Sharkawi, R.J. Marks II and L.E. Atlas, "Dynamic and static security assessment f pwer systems using artificial neural netwrks", Prceedings f the NSF Wrkshp n Applicatins

More information

Kinetic Model Completeness

Kinetic Model Completeness 5.68J/10.652J Spring 2003 Lecture Ntes Tuesday April 15, 2003 Kinetic Mdel Cmpleteness We say a chemical kinetic mdel is cmplete fr a particular reactin cnditin when it cntains all the species and reactins

More information

Lecture 10, Principal Component Analysis

Lecture 10, Principal Component Analysis Principal Cmpnent Analysis Lecture 10, Principal Cmpnent Analysis Ha Helen Zhang Fall 2017 Ha Helen Zhang Lecture 10, Principal Cmpnent Analysis 1 / 16 Principal Cmpnent Analysis Lecture 10, Principal

More information

ON-LINE PROCEDURE FOR TERMINATING AN ACCELERATED DEGRADATION TEST

ON-LINE PROCEDURE FOR TERMINATING AN ACCELERATED DEGRADATION TEST Statistica Sinica 8(1998), 207-220 ON-LINE PROCEDURE FOR TERMINATING AN ACCELERATED DEGRADATION TEST Hng-Fwu Yu and Sheng-Tsaing Tseng Natinal Taiwan University f Science and Technlgy and Natinal Tsing-Hua

More information

ENSC Discrete Time Systems. Project Outline. Semester

ENSC Discrete Time Systems. Project Outline. Semester ENSC 49 - iscrete Time Systems Prject Outline Semester 006-1. Objectives The gal f the prject is t design a channel fading simulatr. Upn successful cmpletin f the prject, yu will reinfrce yur understanding

More information

INSTRUMENTAL VARIABLES

INSTRUMENTAL VARIABLES INSTRUMENTAL VARIABLES Technical Track Sessin IV Sergi Urzua University f Maryland Instrumental Variables and IE Tw main uses f IV in impact evaluatin: 1. Crrect fr difference between assignment f treatment

More information

Chapter 11: Neural Networks

Chapter 11: Neural Networks Chapter 11: Neural Netwrks DD3364 December 16, 2012 Prjectin Pursuit Regressin Prjectin Pursuit Regressin mdel: Prjectin Pursuit Regressin f(x) = M g m (wmx) t i=1 where X R p and have targets Y R. Additive

More information

Computational Statistics

Computational Statistics Cmputatinal Statistics Spring 2008 Peter Bühlmann and Martin Mächler Seminar für Statistik ETH Zürich February 2008 (February 23, 2011) ii Cntents 1 Multiple Linear Regressin 1 1.1 Intrductin....................................

More information

Linear Methods for Regression

Linear Methods for Regression 3 Linear Methds fr Regressin This is page 43 Printer: Opaque this 3.1 Intrductin A linear regressin mdel assumes that the regressin functin E(Y X) is linear in the inputs X 1,...,X p. Linear mdels were

More information

Determining the Accuracy of Modal Parameter Estimation Methods

Determining the Accuracy of Modal Parameter Estimation Methods Determining the Accuracy f Mdal Parameter Estimatin Methds by Michael Lee Ph.D., P.E. & Mar Richardsn Ph.D. Structural Measurement Systems Milpitas, CA Abstract The mst cmmn type f mdal testing system

More information