Machine Learning in Computer Vision MVA Lecture Notes. Iasonas Kokkinos Ecole Centrale Paris

Size: px
Start display at page:

Download "Machine Learning in Computer Vision MVA Lecture Notes. Iasonas Kokkinos Ecole Centrale Paris"

Transcription

1 Machine Learning in Computer Vision MVA 20 - Lecture Notes Iasonas Kokkinos Ecole Centrale Paris iasonas.kokkinos@ecp.fr October 8, 20

2 Chapter Introduction This is an informal set of notes for the first part of the class. They are by no means complete, and do not substitute attending class. My main intent is to give you a skimmed version of the class slides, which you will hopefully find easier to consult. These notes were compiled based on several standard textbooks including: T. Hastie, R. Tibshirani and J. Friedman, Elements of Statistical Learning, Springer V. Vapnik, The Nature of Statistical Learning, Springer C. Bishop, Pattern Recognition and Machine Learning, Springer Apart from these, helpful resources can be found online; in particular Andrew Ng has a highly-recommended set of handouts here: while the more theoretically inclined can consult Robert Schapire s course: and Peter Bartlett s course: bartlett/courses/28b-sp08/ for more extensive introductions to learning theory, boosting and SVMs. As this is the first edition of these lecture notes, there will inevitably be inconsistencies in notation with the slides, typos, etc; your feedback will be most valuable in improving them for the next year.

3 Chapter 2 Linear Regression Assumed form of discriminant function: f(x, w) w j x j + w 0 w j x j, x 0 (2.) In the more general case: j f(x, w) j0 w j B j (x) (2.2) j0 where B j (x) can be a nonlinear function of x. We want to minimize the empirical loss (training loss) of our classifier quantified in terms of the following criterion: L(w) i l(y i, f(x i, w)) (y i f(x i ; w)) 2 y i w j x i j We introduce vector/matrix notation: y y 2 y. N X y N and the cost becomes: j 2 x x 2... x M.. x N L(w) (y Xw) T (y Xw) e T e, x N N N M, 2

4 where e y Xw is called the residual vector. This is why the criterion is also known as residual-sum-of-squares cost. Optimizing the training criterion gives: 2. Ridge regression L(w) w 2X T (y Xw) 0 X T y X T Xŵ ŵ (X T X) X T y (2.3) ŷ Xŵ X(X T X) X T y (2.4) What happens when the regression coefficients are correlated? As an extreme case, consider x i x i 2, i. Then the matrix (X T X) becomes ill-conditioned and the corresponding coefficients can take on arbitrary values: a large positive value of one coefficient can be canceled by a large negative one. Therefore we regularize our problem by introducing a penalty C(w) on the expansion coefficients: L(w, λ) e T e + λc(w) (2.5) where λ determines the tradeoff between the data term e T e and the regularizer. For the case of ridge regression, we penalize the l 2 norm of the weight vector, i.e. w 2 2 K k w2 k wt w. The cost function becomes: L ridge (w) e T e + λw T w (2.6) The condition for an optimum of L(w) gives: So the estimate labels will be: 2.. Lasso y T y 2w T X T y + w T ( X T X + λi ) w w ( X T X + λi ) X T y (2.7) ŷ Xw (2.8) Instead of the l 2 we now penalize the l norm of the vector w: E Lasso (w, λ) (y i Xw) T (y Xw) + λ w k (2.9) The derivative of E Lasso w.r.t. w i at any point will be given by: (E Lasso (w)) w i λsign(w i ) + 3 k 2(y i Xw) T w i

5 If we use for instance Gradient descent to minimize E Lasspo, the rate at which w i will be updated will not be affected by the magnitude of w i, but only on its sign. Note that is in contrast to the logistic regression case: (E Ridge (w)) w i λ(w i ) + 2(y i Xw) T w i, where small-in-magnitude values of w i can survive as they become more slowly suppressed as they decrease. This difference, due to the different norms used, leads to sparser solutions in Lasso. 2.2 Selection of λ If we optimize w.r.t λ directly, λ 0 is the obvious, but trivial solution. This solution will suffer from ill-posedness: as mentioned earlier, the matrix X T X may be poorly conditioned for correlated X. So, the parameters of the model may exhibit high variance, depending on the specific training set used. What λ does is regularize our problem, i.e. impose some constraints that render it well-posed. We first see why this is so Bias variance decomposition of error Assume actual generation process Our model: y ĝ(x) fŵ(x), and ŵ h(λ, S) Expected generalization error at x 0 : y g(x) + ϵ (2.0) x 0 (2.) Err(x 0 ) E[(y ĝ(x 0 )) 2 ] E[(y g(x 0 ) + g(x 0 ) ĝ(x 0 )) 2 ] E[((y g(x 0 )) + (g(x 0 ) E[ĝ(x 0 )]) + (E[ĝ(x 0 )] ĝ(x 0 ))) 2 ] σ 2 + (g(x 0 ) E[ĝ(x 0 ])) 2 + E[(E[ĝ(x 0 )] ĝ(x 0 )) 2 ] }{{}}{{} Bias V ariance Bias is a decreasing function of model complexity. Variance is increasing Cross Validation (2.2) Setting λ allows us to control the tradeoff between bias and variance. Consider two extreme cases: λ 0: no regularization - ill posed problem, and high variance. λ inf: constant model - high bias. The best is somewhere in between. Use cross-validation to estimate λ. 4

6 2.3 SVD-based interpretations (optional reading) Consider the following Singular Value Decomposition for the N M matrix X : X }{{} N M U }{{} D }{{} V T }{{} N M M M M M D diag(d, d 2,..., d min(m,n), 0,..., 0 }{{} ), d i d i+, d i 0 max(m N,0) while the columns of U and V are orthogonal and unit-norm. We remind that N is the number of training points and M the dimensionality of the features. A first thing to note is that if we have zero-mean ( centered ) data, X T X will be the covariance matrix of the data. Moreover we have: X T X VD T U T UDV T VD T DV T VD 2 V T (2.4) so V will contain the eigenvalues of the covariance matrix of the data. Further, we can use the SVD decomposition interpret certain quantities obtained in the previous sections. First, for (2.4), which gives the estimated output y in linear regression, we have: ŷ X ( X T X ) X T y ŷ UDV T ( VD T U T UDV T) V T D T Uy U ( U T y ) We can interpret this expression as follows: U: forms an orthonormal basis for a subspace of R M. C H U T h: gives us the expansion coefficients for the projection of a vector h on this basis. ĥ UC h : is the reconstruction of h on basis U. We therefore have that U T y gives the projection coefficients for y on U. So ŷ UU T y, gives us the projection of y on basis U For the case of ridge regression, using the expression in (2.8) we get: ŷ X ( X T X + λi ) X T y UD ( D 2 + λi ) DU T y d 2 j u m d 2 j + my λut This version of SVD is slightly non-typical; it is the one used in Hastie, Tibshirani and Friedman. The default is for D to be N M: Typical SVD, for N > M: X }{{} N M U D }{{}}{{} N N N M V }{{} T M M D [diag(d, d 2,..., d N ) (0) N N M ], d i d i+, d i 0 (2.3) 5

7 Hence the name shrinkage : the expansion coefficients for the reconstruction of y on the subspace spanned by U are shrunk by regression. d2 j d 2 j +λ < w.r.t. the values used in linear 6

8 Chapter 3 Logistic Regression 3. Probabilistic transition from linear to logistic regression Consider that the labels are generated according to the linear model + gaussian noise assumption: ( P (y i x i, w) exp yi xw ( ) ) T 2 2πσ 2 2σ 2. (3.) We consider as a merit function the likelihood of the data: C(w) P (y X, w) i P (y i x i, w) or more conveniently, the log-likelihood: log(c(w)) log P (y X) i log P (y i x i ) c 2σ 2 (y i w T x i ) 2 i c 2σ 2 (y Xw)T (y Xw) c 2σ 2 et e (3.2) So minimizing the least squares criterion amounts to maximizing the log Likelihood of the labels under the model in (3.). But this model make sense only for continuous observations. In classification the observations are discrete (labels). We therefore consider using a Bernoulli distribution instead: P (y x, w) f(x, w) P (y 0 x, w) f(x, w). 7

9 We can compactly express these two equations as: P (y x) (f(x, w)) y ( f(x, w) y (3.3) We can now write the (log-)likelihood of the data as: C(w) P (y X, w) i log C(w) log P (y X, w) i i P (y i x i, w) log P (y i x i, w) y i log f(x i, w) + ( y i ) log( f(x i, w)) (3.4) Now consider using the following expression for f(x, w): f(x, w) g(w T x) + exp( w T x) (3.5) g(a) +exp( a) is called the sigmoidal function and has several interesting properties, the most useful ones being: dg g(a)( g(a)), da g(a) g( a) Plugging (3.5) into (3.4), this is the criterion we should be maximizing: log C(w) log P (y X, w) y i log g(w T x i ) + ( y i ) log( g(w T (3.6) x i )) A simple approach to maximize it is to use Gradient ascent: w t+ k wk t + α C(w) w k w t (3.7) Using the expression in (3.6), we have: C(w) w k [ y i g(w T x i ) g(w T x i + ( y i ) ) w k [ y i g(w T x i ) ( yi ) g(w T x i ) [ y i ( g(w T x i )) ( y i )g(w T x i ) ] x i k [ y i g(w T x i ) ] x i k ] ) w k x i ) g(w T x i ) ( g(wt ] g(w T x i )( g(w T x i )) wt x i w k 8

10 We can write this more compactly by using vector notation; using the N vector g with elements: g i g(w T x i ) (3.8) and the N -dimensional k-th column X k of the N M matrix X, we have: C(w) w k [y g] T X k X T k [y g] Stacking the partial derivatives together we form the Jacobian: J(w) dc(w) dw XT [y g]... X T [y g] [y g] X T M The gradient-ascen update for w will be: w t+ w t + αj T w A more elaborate, but more efficient technique is the Newton-Raphson method. Consider first the Newton method for finding the roots of a -D function f(θ): θ θ f(θ) f (θ) If we want to maximize the function, we need to find roots of its derivative f ; so we apply the Newton method to find the roots of f (θ): θ θ f (θ) f (θ) In our case we want to maximize a multidimensional function, which leads us to the Newton Raphson method: w w (H(w)) J(w). (3.9) We already saw how to compute the Jacobian. For the k, j-th element of the Hessian we have H(w)(k, j) 2 log C(w) w k w j N [ y i g(x i w T ) ] x i k w j g(x i w T ) x i k w j g(x i w T )( g(x i w T ))x i jx i k X T RX, where R i,i g(x i )( g(x i )) (3.0) 9

11 This Newton-Raphson update is called Iteratively Reweighted Least Squares in Statistics; to see why this is so, we rewrite the update as: w w + ( X T RX ) X T (y g) ( X T RX ) [( X T RX ) w 0 + X T (y g) ] ( X T RX ) X T R [ Xw + R (y g) ] ( X T RX ) X T Rz where we have introduced the adjusted response : z Xw + R (y g) z i x i }{{} w Previous value + g i ( g i ) (yi g i ), (3.) Using this new vector z, we can interpret the vector w as being the solution to a weighted least squares problem: w argmin w N R i (z i x i w T ) 2 (3.2) Note that apart from z, also the matrix R is updated at each iteration; this explains the iteratively re- prefix in the IRLS name. 3.2 Log-loss We simplify the expression for our training criterion to see what logistic regression does. We therefore introduce: y ± 2y b (3.3) y b {0, } (3.4) y ± {, } (3.5) y b y ± +, 2 (3.6) where y b is the label we have been using so far and y ± is a new label that makes certain expressions easier. In specific we have: P (y f(x)) + exp( f(x)) (3.7) P (y f(x)) P (y f(x)) (3.8) exp(f(x)) + (3.9) P (y i f(x i )) + exp( y i f(x i )) (3.20) 0

12 So we can write: L(w) C(w) log( + exp( y±f(x i i ))) (3.2) The quantity l(y i, x i ) log( + exp( y±f(x i i ))) (3.22) is called the log-loss.

13 Chapter 4 Support Vector Machines 4. Large Margin classifiers In linear classification, we are concerned with construction of separating hyperplane: w T x 0. For instance, logistic regression provides an estimate of: P (y x; w) g(w T x) and provide a decision rule using: + exp( w T x) ŷ i { ifp (y x; w).5 0 ifp (y x; w) <.5 (4.) Intuitively, a confident decision is made when w T x is large. So we would like to have: or, combined: (w T x i ) 0 if y i (w T x i ) 0 if y i y i (w T x i ) 0. Given a training set, we want to find a decision boundary that allows us to make correct and confident predictions on all training examples. For points near the decision boundary, even though we may be correct, small changes in the training set could cause different classifications; so our classifier will not be confident. If a point is far from the separating hyperplane then we are more confident in our predictions. In other words, we would like to have all points being far (and on the correct side of) the decision plane. 2

14 4.. Notation Instead of we now consider classifiers of the form: h(x) g(w T x) (4.2) h(x) g(w T x + b) (4.3) (b w 0 ), where g(z) if z > 0, and g(z), if z < 0. Unlike logistic regression, where g gives a probability estimate, here the classifier directly outputs a decision Margins The confidence scores described above are called functional margins: scaling w, b by two doubles the margin, without essentially changing the classifier. We therefore now turn to geometric margins, which are defined in terms of unitnorm vector w w. We express first any point as: x x + γ w w where x is the projection of x on the decision plane, w T w x + b 0, w is a unitnorm vector perpendicular to the decision plane and γ is thus the distance of x from the decision plane. Solving for γ we have: w T x w T x + w T γ w w γ wt x + b w w T x b + γ w wt w x + b w The above analysis works for the case where y. If y, we would like w T x+b to be negative, we therefore write: ( w γ i y i T w xi + b ) w For a training set x i, y i and a classifier w we define its geometric margin as: γ min i γ i (4.4) which indicates the worst performance of the classifier on the training set. 3

15 4..3 Introducing margins in classification Our task is to classify all points with maximal geometric margin. If we consider the case where the points are separable, we have the following optimization problem: max γ,w,b γ s.t. y i (w T x i + b) γ, i... M w In words, maximize γ, where all points have at least γ geometric margin. It is hard to incorporate the w constraint (it is non-convex; for two dimensional inputs, w should lie on a circle). We therefore divide both sides by w : max γ,w,b γ w s.t. y i (w T x i + b) γ, i... M From the constraint, since w is no longer constrained to have unit norm, we realize that γ is a functional margin. Consider we find a combination of w, b that solves the optimization problem. We get a one-dimensional family of solutions by scaling w, b, γ by a common constant. We discard this ambiguity by consider only those values of w, b for which γ, which gives us the following problem: max w,b w s.t. y i (w T x i + b), i... M Equivalently: min w,b 2 w 2 s.t. y i (w T x i + b), i... M So we end up with a quadratic optimization problem, involving a Convex (quadratic) function of w and linear constraints on w. 4.2 Duality Consider a general constrained optimization problem: min w f(w) s.t. h i (w) 0, i... l g i (w) 0, This is equivalent to unconstrained optimization: i... m min w f uc (w) f(w) + l I 0(h i (w)) + m I +(g i (w)) (4.5) 4

16 where I 0, I : hard constraints I 0 (x) { 0, x 0, x 0, I +(x) { 0, x 0, x > 0 We form a lower bound of this function, by replacing each hard constraint with a soft one: λ, µ: Penalty for violating constraints. This gives us the Lagrangian: We observe that Therefore L(w, λ, µ) f(w) + I 0 (x i ) λ i x i I + (x i ) µ i x i, µ > 0 l m λ i h i (w) + µ i g i (w), µ i > 0 i f uc (w) f(w ) min w max L(w, λ, µ). (4.6) λ,µ:µ i >0 max λ,µ:µ i>0 Now consider the Lagrange Dual Function: L(w, λ, µ). (4.7) θ(λ, µ) inf L(w, λ, µ). (4.8) w θ(λ, µ) is always a lower bound on the optimal value p of the original problem. If the original problem has a (feasible) solution w, then: So L(w, λ, µ) f(w ) + w :feasible f(w ) + µ i>0 f(w ). l m λ i h i (w ) + µ i g i (w ) l m λ i 0 + µ i g i (w ) }{{} <0 θ(λ, µ) inf w L(w, λ, µ) L(w, λ, µ) f(w ). (4.9) So we have a function θ(λ, µ) that lower bounds the minimum cost of our original problem. We want to make this bound as tight as possible, i.e. maximize it. 5

17 New task (dual problem): max λ,µ θ(λ, µ) (4.0) s.t. µ i > 0 i In general, d max θ(λ, µ) λ,µ max L(w, λ, µ) λ,µ:µ i >0 w min w max λ,µ:µ i>0 min f uc(w) p w L(w, λ, µ) If f(w), h i (w), f i (w) are convex functions, and the primal and dual problems are feasible, then the bound is tight, i.e. d p, which means that there exist w, λ, µ meeting all constraints while satisfying: At that point we have: f(w ) θ(λ, µ ) f(w ) θ(λ, µ ) (4.) inf w f(w) + l m λ i h i (w) + µ i f i (w) m f(w ) + λ i h i (w ) + µ i f i (w ) f(w ) Therefore: µ i f i(w ) 0, i ( Complementary Slackness ) Karush-Kuhn-Tucker Conditions L(w, λ, µ ) is a saddle point (minimum for w, maximum for λ, µ). From the minimum w.r.t w we have that f(w ) + l m λ i h i (w ) + f i (w ) 0 6

18 From the maximum w.r.t. λ, µ we have that the constraints are satisfied. Combining with the complementary slackness conditions we have the KKT conditions: f(w ) + h i (w ) 0 f i (w ) 0 µ i f i (w ) 0 µ i 0 l m λ i h i (w ) + f i (w ) Dual Problem for Large-Margin training The primal problem is the objective we setup for max-margin training: min w,b Its Lagrangian will be: 2 w 2 s.t. y i (w T x i + b) + 0, i... M L(w, b, µ) 2 w 2 µ i [y i (w T x i + b) ] (4.2) At an optimum we have a minimum w.r.t. w, which gives: 0 w and a minimum w.r.t. b, which gives: w 0 µ i [y i x i ] µ i y i x i, µ i y i Plugging in the optimal values into the Lagrangian we get: θ(µ) L(w, b, µ) 2 w 2 µ i [y i (w T x i + b) ] M 2 ( µ i y i x i ) T ( µ j y j x j ) µ i [y i (( µ j y j x j ) T x i + b) ] j j j µ i µ i µ j y i y j (x i ) T (x j ) b µ i y i 2 7

19 We thereby eliminated w from the problem. For reasons that will become clear later, we write (x i ) T (x j ) < x i, x j > (inner product). The new optimization problem becomes: max µ θ(µ) µ i µ i µ j y i y j < x i, x j > 2 s.t. µ i > 0, i µ i y i 0 j If we solve the problem above w.r.t. µ, we can use µ to determine w. One important note is that as w µ i y i x i, (4.3) the final decision boundary orientation is a linear combination of the training points. Moreover, from complementary slackness we have: where: µ i g i (x) 0, (4.4) g i (x) y i (w T x i + b) 0 y i (w T x i + b) (4.5) Therefore if µ i 0 then y i (w T x i + b). As w is formed from the set of points for which µ i 0, this means that only the points that lie on the margin influence the solution. These points are called Support Vectors. From the set S of support vectors we can determine best b : Non-separable data y i (w T x i + b), i S b ( y i w T x i) (4.6) N S i S To deal with the case where the data cannot be separated, we introduce positive slack variables, ξ i, i... M in the constraints: or equivalently: w T x i + b ξ i, y i w T x i + b + ξ i, y i y i (w T x i + b) ξ i 8

20 If ξ i > an error occurs, and i ξi is an upper bound on the number of errors. We consider a new problem: min w,b s.t. 2 w 2 + C ξ i y i (w T x i + b) ξ i ξ i 0 (4.7) where C determines the tradeoff between the large margin (low w ) and low error constraint. The Lagrangian of this problem is: L(w, b, ξ, µ, ν) 2 wt w + C ξ i µ i [y i (w T x + b) + ξ] ν i ξ i, and its dual becomes: max µ s.t. µ i y i y j µ i µ j < x i, x j > 2 0 µ i C µ i y i 0. j The KKT conditions are similar; an interesting change is that we now have: C µ i ν i 0 (4.8) y i (< x i, w > +b) + ξ i 0 (4.9) µ i [ y i (< x i, w > +b) + ξ i] 0 (4.20) ν i ξ i 0 (4.2) ξ i 0 (4.22) µ i 0 (4.23) ν i 0 (4.24) which gives y i (< x i, w > +b) > y i (< x i, w > +b) < y i (< x i, w > +b) 3:ξ i 0 µ i 0 2:ξ i >0 4:ν i 0 : µ i C 3:µ i ξ i 0 : µ i [0, C] 9

21 4.2.4 Kernels µ i 0 (3) ξ i y i (< x i, w > +b) (4.25) µ i 0 () ν i C (4) ξ i 0 (4.26) ξ i 0 (4.27) ξ i max(0, y i (< x i, w > +b)) (4.28) Up to now, the decision was based on the sign of y w T x w i x i (4.29) What if we used nonlinear transformations of the data? As a concrete example, consider x y w T ϕ(x) [ x K w i ϕ k (x) (4.30) x 2 ], ϕ(x) x x 2 x 2 x 2 2 x x 2 Using such features would allow us to go from linear to quadratic decision boundaries, defined by: w i ϕ i (x) 0 (4.3) i We observe that, in the linear case, for any new point x, the decision will have the form: y sign(< w, x > +b) µ i µ iy i sign( µ i < x i, x > +b) (4.32) So all we need to know in order to compute the response at a new x is its inner product with the Support Vectors. So if we plug in our new features, we guess that we should get a classifier of the form: y sign( µ i < ϕ(x i ), ϕ(x) > +b) (4.33) 20

22 Actually everything in the original algorithm was written in terms of inner products between input vectors. So we can write everything in terms of inner products of our new features ϕ; in specific, the dual optimization problem becomes: max µ s.t. µ i y i y j µ i µ j < ϕ(x i ), ϕ(x j ) > 2 0 µ i C µ i y i 0 j If now we define a Kernel between two points x and y as: K(x, y) < ϕ(x), ϕ(y) > and replace < x, y > with K(x, y), we can write the same algorithm: y sign( µ i y i K(x i, x) + b) (4.34) max µ s.t. µ i y i y j µ i µ j K(x i, x j ) 2 0 µ i C µ i y i 0 j Then As a concrete example, consider the mapping: [ ] x x x, ϕ(x) x x 2 2 2x x 2 < ϕ(x), ϕ(y) > x 2 y 2 + 2x x 2 y y 2 + x 2 2y2 2 (x y + x 2 y 2 ) 2 (x T y) 2 K(x, y) In ( the same) way, we can use K(x, y) (x T y + c) d to implement a mapping into N + d space. d 2

23 Another case: x y 2 K(x, y) exp( 2σ 2 ) (4.35) It can be shown that this amounts to an inner product in an infinite-dimensional feature space. K can de a distance-like measure among inputs, even though we may not have a straightforward feature-space interpretation; it allows us to apply SVMs to strings, graphs, documents etc. Most importantly, it allows us to work in very high dimensional spaces without necessarily finding and/or computing the mappings. Under what conditions does a kernel function represent an inner product in some space? We want to have K ij K(x i, x j ) K(x j, x i ) K ji. K: Gram matrix. For m points this should be valid i... m, j... m. Consider now the quantity: z T Kz z i K ij z j i j z i ϕ T (x i )ϕ(x j )z j i j z i ϕ k (x i )ϕ k (x j )z j i j k z i ϕ k (x i )ϕ k (x j )z j k i j ( 2 z i ϕ k (x )) i 0 k i K is therefore positive semi-definite and symmetric. Mercer Kernel consider a Kernel K : R N R N R; K represents an inner product if and only if K is symmetric positive semi-definite. 4.3 Hinge vs. Log-loss Consider the SVM training criterion: L(w) m [ξ i ] + 2C wt w m [ y i (w T x i )] + + λw T w (4.36) [ y i (t i )] + : hinge loss 22

24 Compared to regularized Logistic regression L(w) m log( + exp( y(w T x i )) + λw T w, the hinge loss is more abrupt: as soon as as y i (w T x i ) >, the i-th point is declared well-classified and no longer contributes to the solution (i.e. is not a support vector). 23

25 Chapter 5 Adaboost If many cases it easy to find some rules of thumb that are often correct but hard to find a single highly accurate prediction rule Boosting is a general method for converting a set of rules of thumb into highly accurate prediction rules; assuming that a given weak learning algorithm can consistently find classifiers (rules of thumb) at least slightly better than random, say, accuracy 5(in two-class setting), then given sufficient data a boosting algorithm can provably construct a single classifier with very high accuracy. 5. AdaBoost Adaboost is a boosting algorithm that provides justifiable answers to the following questions: First, how to choose examples on each round? Adaboost concentrates on hardest examples, i.e. those most often misclassified by previous rules of thumb. Second how to combine rules of thumb into single prediction rule? Adaboost proposes to take a weighted majority vote of these rules of thumb. In specific we consider that we are provided with a training set: (x i, y i ), x i X, y i {, }, i,..., N. Adaboost maintains a probability density on these data which is initially uniform, and on the way places larger weight on harder examples. At each round a weak classifier h t ( rule of thumb ) is identified that has the best performance on the weighted data, and used to update (i) the classifier s expression (ii) the probability density on the data. 24

26 In specific Adaboost can be described in terms of the following algorithm: Initially, set D (i) N, i For t... T : Find weak classifier h t : X {, } with smallest error ϵ t N Di t[y i h t (x i )] i Di t (5.) Set a t 2 log ( ϵ t) ϵ t Update distribution D i t+ Di t Z t { exp( αt ), ify i h t (x i ) exp(α t ), ify i h t (x i ) Di t Z t exp( α t y i h t (x i )), where to guarantee that D t+ remains a proper density we normalize by: Z t i D i t exp ( α ty i h t (x i )) Combine classifier results (weighted vote) f(x) t a t h t (x) Output final classifier: H(x) sign (f(x)) (5.2) 5.. Training Error Analysis We first introduce first the edge of the weak learner at the t-th round as: We will prove that: N [yi H(x i )] }{{} Final Training Error ϵ t 2 γ t }{{} Edge 25 T t T t exp [ 2 ] ϵ t ( ϵ t ) 4γ 2 t ( 2 T t γ 2 t )

27 The challenge lies in proving the first inequality; the second comes from the definition of the edge, while the third from the fact that 2x exp( x), x [0, 2 ]. If we We have required that the weak learner is always correct more than half of the times on the training data, i.e. that γ t > 0 t, so if t γ t > γ > 0 we will then have: N [yi H(x i )] exp( 2T γ 2 ) (5.3) which means that the number of training errors will be exponentially decreasing in time. First, by recursing, we have that: D i t+ Di t Z t exp( α t y i h t (x i )) D i t Z t exp( α t y i h t (x i )) exp( α t y i h T (x i )) Z t D t exp ( [ α t y i h t (x i ) + α t y i h t (x i ) ]) Z t Z t... (5.4) ( exp y i ) t k α kh k (x i ) N t k Z k or: D i T + N exp ( y i f(x i ) ) T k Z k (5.5) which expresses the distribution on the data in terms of the classifier s score on them. We use this result to obtain that: N [yi H(x i )] H(x)signf(x) N T i { y i f(x i ) 0 0 y i f(x i ) > 0 N exp( yi f(x i )) T DT i + Z k k k Z k which relates the number of errors with the normalizing constants Z k at each round. 26

28 We set out to prove that: N [yi H(x i )] }{{} Final Training Error T t [ 2 ] ϵ t ( ϵ t ) (5.6) so, given the last result, all we need to prove is that: 2 ϵ t ( ϵ t ) Z t The key to this is the choice we made for α: In specific, we have: Z t α 2 ϵ α log( ϵ ) (5.7) Dt i exp( α t y i h t (x i )) i:y i h t(x i ) i:y i h t (x i ) D i t exp( α t y i h t (x i )) + D i t exp(α t ) + i:y i h t (x i ) (ϵ t ) exp(α t ) + ( ϵ t ) exp( α t ) ϵt ϵ log( ϵ ) ϵt ϵ t + ( ϵ t ) ϵt ϵt 2 ϵ t ( ϵ t ) This concludes the proof Adaboost as coordinate descent i:y i h t(x i ) D i t exp( α t ) D i t exp( α t y i h t (x i )) We demonstrated that boosting optimizes the following upper bound to the misclassification cost: N [yi H(x i )] m exp( y if(x i )) (5.8) T t Z t T [ϵ t exp(α t ) + ( ϵ t ) exp( α t )] (5.9) t 27

29 Actually the choice of α t is such that Z t is minimized at each iteration: which gives ϵ t exp(α t ) + ( ϵ t ) exp( α t ) α t 0 ϵ t exp(α t ) ( ϵ t ) exp( α) 0 ϵ t exp(2α t ) ( ϵ t ) α t 2 log ( ( ϵt ) ϵ t ϵ t exp(α t ) + ( ϵ t ) exp( α t ) 2 ϵ t ( ϵ t ) (5.0) So choosing the best (in the sense of minimizing Z t, and thence (5.8)) value of α tells us that Z t will equal 2 ϵ t ( ϵ t ), where ϵ t < /2 is the weighted error of the weak learner. Moreover ϵ t ( ϵ t ) is increasing in ϵ t in [0, /2], so finding the weak learner with the minimal ϵ t guarantees that we most quickly decrease (5.8). We can conceptually describe what this amounts to as follows: we have at our disposal the responses of all weak learners (an infinite number of functions in general), and consider combining them with a weight vector, initially set to zero. In this setting, the Adaboost algorithm can be interpreted as coordinate descent: we pick the dimension (weak learner, h t ), and step (voting strength, α t ) that result in the sharpest decrease in the cost. So we can see Adaboost as a greedy, but particularly effective optimization procedure Adaboost as building an additive model A similar interpretation of Adaboost is in terms of fitting a linear model by finding at each step the optimal perturbation of a previously constructed decision rule. In specific consider that we have f(x) at round t and we want to find how to optimally perturb it by: f (x) f(x) + αh(x) (5.) Coordinate descent is an optimization technique that changes a single dimension of an optimized vector at each iteration ) 28

30 where h(x) is a weak learner and α is the perturbation scale. To consider the impact that this perturbation will have on the cost we use a Taylor expansion as follows : exp( y i f (x i )) Z exp( y i [f(x i ) + αh(x i )]) exp( y i f(x i ))[ αy i h(x i ) + α2 2 (yi ) 2 h 2 (x i )] exp( y i f(x i ))[ αy i h(x i ) + α2 2 ] exp( y i f(x i )) [ αy i h(x i ) + α2 }{{ Z } 2 ] D i where Z N exp( yi f(x i )). So our goal is to minimize: Finding the optimal h amounts to solving: h argmax D i [ αy i h(x i ) + α2 2 ] (5.2) D i y i h(x i ) ϵ (5.3) So picking the classifier with minimal ϵ amounts to finding the optimal h And when optimizing wrt α we get: α argmax α N So, once more, we get the Adaboost algorithm. D i exp( αy i h(x i )) ϵ exp(α) + ( ϵ) exp( α) (5.4) 5..4 Adaboost as a method for estimating the posterior Consider the expectation of the exponential loss: E X,Y exp( yf(x)) exp( yf(x))p (x, y)dxdy x x x y exp( yf(x))p (y x)dyp (x)dx y [[P (y x) exp( f(x)) + P (y x) exp(f(x))]dy] dx 29

31 If a function f(x) where to minimize this cost, at a point x it should satisfy or f(x) argmax g [P (y x) exp( g) + P (y x) exp(g) f(x) log P (y x) P (y x) P (y x) + exp( 2yf(x)) In Adaboost work with functions that are expressed as sums of weak learners. So the optimal f(x) will not necessarily be given by the expression log P (y x) P (y x), but rather will be an approximation to it formed by a linear combination of weak learners. 5.2 Learning Theory For pattern recognition: Small sample analysis. Bounds on worst performance. Hoeffding inequality: Let Z,..., Z m be iid, drawn from Bernoulli distribution: P (Z i ) ϕ Z i ( ϕ) Z i. Consider their mean: ˆϕ m m Z i. Then for any γ > 0, P ( ϕ ˆϕ > γ) 2 exp( 2γ 2 m) On training set {x i, y i }, i... m, consider training error for hypothesis h: Generalization error ê(h) m m [h(x i ) y i ] e(h) P (x,y) D ([h(x) y]) How different will e be from ê if training set is drawn from D?. Consider an Hypothesis Class: the set of all classifiers considered. E.g. for linear classification, all linear classifiers. Consider that we have a finite class: H {h,... h K } (5.5) e.g. a small set of decision stumps for a few thresholds. No parameters to estimate. Take any of these classifiers, h i. Consider Bernoulli variable Z generated by sampling (x, y) D and indicating whether h i (x) y. We can write the empirical error in terms of this Bernoulli variable: ê(h i ) m Z j m j 30

32 The mean of this variable is e(h i ), i.e. the generalization error for the i-th classifier. We then have from the Hoeffding inequality: P ( ê(h i ) e(h i ) > γ) 2 exp( 2γ 2 m) For large m, the training error will be close to the generalization error. Consider all possible classifiers: P ( h H : ê(h) e(h) > γ) K P ( ê(h i ) e(h) > γ) K 2 exp( 2γ 2 m) k K2 exp( 2γ 2 m) Union Bound: P (A A 2... A k ) P (A ) +... P (A k ) Consider we have m training points, k hypotheses and want to find a γ such that P ( ê e > γ) δ. Plugging in the above equation we have γ 2m log 2k δ. So with probability δ we have for all h H: ê(h) e(h) 2m log 2k δ (5.6) What does this mean? Consider h to be the best possible hypothesis, in terms of generalization error and ĥ to be the best in terms of training error. We then have e(ĥ) ê(ĥ) + γ ê(h ) + γ e(h ) + 2γ (5.7) Generalization error of chosen hypothesis is not much larger than that of the optimal one. Bias/Variance: If we have small hypothesis set, we may have large e(ĥ), but it will not differ much from e(h ). If we have large hypothesis set, e(ĥ) becomes smaller, but γ increases. 3

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2016 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2015 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Sridhar Mahadevan mahadeva@cs.umass.edu University of Massachusetts Sridhar Mahadevan: CMPSCI 689 p. 1/32 Margin Classifiers margin b = 0 Sridhar Mahadevan: CMPSCI 689 p.

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

ICS-E4030 Kernel Methods in Machine Learning

ICS-E4030 Kernel Methods in Machine Learning ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

CS-E4830 Kernel Methods in Machine Learning

CS-E4830 Kernel Methods in Machine Learning CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Support Vector Machines for Classification and Regression

Support Vector Machines for Classification and Regression CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may

More information

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1396 1 / 31 Table of contents 1 Introduction

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction Linear vs Non-linear classifier CS789: Machine Learning and Neural Network Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Linear classifier is in the

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2018 CS 551, Fall

More information

Announcements - Homework

Announcements - Homework Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

CSC 411 Lecture 17: Support Vector Machine

CSC 411 Lecture 17: Support Vector Machine CSC 411 Lecture 17: Support Vector Machine Ethan Fetaya, James Lucas and Emad Andrews University of Toronto CSC411 Lec17 1 / 1 Today Max-margin classification SVM Hard SVM Duality Soft SVM CSC411 Lec17

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Support Vector Machines CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 A Linearly Separable Problem Consider the binary classification

More information

L5 Support Vector Classification

L5 Support Vector Classification L5 Support Vector Classification Support Vector Machine Problem definition Geometrical picture Optimization problem Optimization Problem Hard margin Convexity Dual problem Soft margin problem Alexander

More information

Machine Learning A Geometric Approach

Machine Learning A Geometric Approach Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron

More information

Support Vector Machines

Support Vector Machines EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable

More information

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 11 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Boosting Mehryar Mohri - Introduction to Machine Learning page 2 Boosting Ideas Main idea:

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

(Kernels +) Support Vector Machines

(Kernels +) Support Vector Machines (Kernels +) Support Vector Machines Machine Learning Torsten Möller Reading Chapter 5 of Machine Learning An Algorithmic Perspective by Marsland Chapter 6+7 of Pattern Recognition and Machine Learning

More information

Support Vector Machines

Support Vector Machines CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning algorithm. SVMs are among the best (and many believe is indeed the best)

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Lecture Notes on Support Vector Machine

Lecture Notes on Support Vector Machine Lecture Notes on Support Vector Machine Feng Li fli@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional space, a hyper plane is defined by ω T x + b = 0 (1) where ω R n is

More information

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem CS 189 Introduction to Machine Learning Spring 2018 Note 22 1 Duality As we have seen in our discussion of kernels, ridge regression can be viewed in two ways: (1) an optimization problem over the weights

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Support vector machines (SVMs) are one of the central concepts in all of machine learning. They are simply a combination of two ideas: linear classification via maximum (or optimal

More information

LECTURE 7 Support vector machines

LECTURE 7 Support vector machines LECTURE 7 Support vector machines SVMs have been used in a multitude of applications and are one of the most popular machine learning algorithms. We will derive the SVM algorithm from two perspectives:

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods Foundation of Intelligent Systems, Part I SVM s & Kernel Methods mcuturi@i.kyoto-u.ac.jp FIS - 2013 1 Support Vector Machines The linearly-separable case FIS - 2013 2 A criterion to select a linear classifier:

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes About this class Maximum margin classifiers SVMs: geometric derivation of the primal problem Statement of the dual problem The kernel trick SVMs as the solution to a regularization problem Maximizing the

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning And Applications: Supervised Learning-SVM Machine Learning And Applications: Supervised Learning-SVM Raphaël Bournhonesque École Normale Supérieure de Lyon, Lyon, France raphael.bournhonesque@ens-lyon.fr 1 Supervised vs unsupervised learning Machine

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Ryan M. Rifkin Google, Inc. 2008 Plan Regularization derivation of SVMs Geometric derivation of SVMs Optimality, Duality and Large Scale SVMs The Regularization Setting (Again)

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Bingyu Wang, Virgil Pavlu March 30, 2015 based on notes by Andrew Ng. 1 What s SVM The original SVM algorithm was invented by Vladimir N. Vapnik 1 and the current standard incarnation

More information

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask Machine Learning and Data Mining Support Vector Machines Kalev Kask Linear classifiers Which decision boundary is better? Both have zero training error (perfect training accuracy) But, one of them seems

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

10-701/ Machine Learning - Midterm Exam, Fall 2010

10-701/ Machine Learning - Midterm Exam, Fall 2010 10-701/15-781 Machine Learning - Midterm Exam, Fall 2010 Aarti Singh Carnegie Mellon University 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 15 numbered pages in this exam

More information

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam. CS 189 Spring 2013 Introduction to Machine Learning Midterm You have 1 hour 20 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. Please use non-programmable calculators

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

COMP 875 Announcements

COMP 875 Announcements Announcements Tentative presentation order is out Announcements Tentative presentation order is out Remember: Monday before the week of the presentation you must send me the final paper list (for posting

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Support vector machines

Support vector machines Support vector machines Guillaume Obozinski Ecole des Ponts - ParisTech SOCN course 2014 SVM, kernel methods and multiclass 1/23 Outline 1 Constrained optimization, Lagrangian duality and KKT 2 Support

More information

Convex Optimization M2

Convex Optimization M2 Convex Optimization M2 Lecture 8 A. d Aspremont. Convex Optimization M2. 1/57 Applications A. d Aspremont. Convex Optimization M2. 2/57 Outline Geometrical problems Approximation problems Combinatorial

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague AdaBoost Lecturer: Jan Šochman Authors: Jan Šochman, Jiří Matas Center for Machine Perception Czech Technical University, Prague http://cmp.felk.cvut.cz Motivation Presentation 2/17 AdaBoost with trees

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

CIS 520: Machine Learning Oct 09, Kernel Methods

CIS 520: Machine Learning Oct 09, Kernel Methods CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed

More information

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016 Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Polyhedral Computation. Linear Classifiers & the SVM

Polyhedral Computation. Linear Classifiers & the SVM Polyhedral Computation Linear Classifiers & the SVM mcuturi@i.kyoto-u.ac.jp Nov 26 2010 1 Statistical Inference Statistical: useful to study random systems... Mutations, environmental changes etc. life

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Pattern Recognition 2018 Support Vector Machines

Pattern Recognition 2018 Support Vector Machines Pattern Recognition 2018 Support Vector Machines Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Pattern Recognition 1 / 48 Support Vector Machines Ad Feelders ( Universiteit Utrecht

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

SVM and Kernel machine

SVM and Kernel machine SVM and Kernel machine Stéphane Canu stephane.canu@litislab.eu Escuela de Ciencias Informáticas 20 July 3, 20 Road map Linear SVM Linear classification The margin Linear SVM: the problem Optimization in

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information