Machine Learning in Computer Vision MVA 20 - Lecture Notes Iasonas Kokkinos Ecole Centrale Paris iasonas.kokkinos@ecp.fr October 8, 20
Chapter Introduction This is an informal set of notes for the first part of the class. They are by no means complete, and do not substitute attending class. My main intent is to give you a skimmed version of the class slides, which you will hopefully find easier to consult. These notes were compiled based on several standard textbooks including: T. Hastie, R. Tibshirani and J. Friedman, Elements of Statistical Learning, Springer 2009. V. Vapnik, The Nature of Statistical Learning, Springer 2000. C. Bishop, Pattern Recognition and Machine Learning, Springer 2006. Apart from these, helpful resources can be found online; in particular Andrew Ng has a highly-recommended set of handouts here: http://cs229.stanford.edu/materials.html while the more theoretically inclined can consult Robert Schapire s course: http://www.cs.princeton.edu/courses/archive/spring08/cos5/schedule.html and Peter Bartlett s course: http://www.cs.berkeley.edu/ bartlett/courses/28b-sp08/ for more extensive introductions to learning theory, boosting and SVMs. As this is the first edition of these lecture notes, there will inevitably be inconsistencies in notation with the slides, typos, etc; your feedback will be most valuable in improving them for the next year.
Chapter 2 Linear Regression Assumed form of discriminant function: f(x, w) w j x j + w 0 w j x j, x 0 (2.) In the more general case: j f(x, w) j0 w j B j (x) (2.2) j0 where B j (x) can be a nonlinear function of x. We want to minimize the empirical loss (training loss) of our classifier quantified in terms of the following criterion: L(w) i l(y i, f(x i, w)) (y i f(x i ; w)) 2 y i w j x i j We introduce vector/matrix notation: y y 2 y. N X y N and the cost becomes: j 2 x x 2... x M.. x N L(w) (y Xw) T (y Xw) e T e, x N N N M, 2
where e y Xw is called the residual vector. This is why the criterion is also known as residual-sum-of-squares cost. Optimizing the training criterion gives: 2. Ridge regression L(w) w 2X T (y Xw) 0 X T y X T Xŵ ŵ (X T X) X T y (2.3) ŷ Xŵ X(X T X) X T y (2.4) What happens when the regression coefficients are correlated? As an extreme case, consider x i x i 2, i. Then the matrix (X T X) becomes ill-conditioned and the corresponding coefficients can take on arbitrary values: a large positive value of one coefficient can be canceled by a large negative one. Therefore we regularize our problem by introducing a penalty C(w) on the expansion coefficients: L(w, λ) e T e + λc(w) (2.5) where λ determines the tradeoff between the data term e T e and the regularizer. For the case of ridge regression, we penalize the l 2 norm of the weight vector, i.e. w 2 2 K k w2 k wt w. The cost function becomes: L ridge (w) e T e + λw T w (2.6) The condition for an optimum of L(w) gives: So the estimate labels will be: 2.. Lasso y T y 2w T X T y + w T ( X T X + λi ) w w ( X T X + λi ) X T y (2.7) ŷ Xw (2.8) Instead of the l 2 we now penalize the l norm of the vector w: E Lasso (w, λ) (y i Xw) T (y Xw) + λ w k (2.9) The derivative of E Lasso w.r.t. w i at any point will be given by: (E Lasso (w)) w i λsign(w i ) + 3 k 2(y i Xw) T w i
If we use for instance Gradient descent to minimize E Lasspo, the rate at which w i will be updated will not be affected by the magnitude of w i, but only on its sign. Note that is in contrast to the logistic regression case: (E Ridge (w)) w i λ(w i ) + 2(y i Xw) T w i, where small-in-magnitude values of w i can survive as they become more slowly suppressed as they decrease. This difference, due to the different norms used, leads to sparser solutions in Lasso. 2.2 Selection of λ If we optimize w.r.t λ directly, λ 0 is the obvious, but trivial solution. This solution will suffer from ill-posedness: as mentioned earlier, the matrix X T X may be poorly conditioned for correlated X. So, the parameters of the model may exhibit high variance, depending on the specific training set used. What λ does is regularize our problem, i.e. impose some constraints that render it well-posed. We first see why this is so. 2.2. Bias variance decomposition of error Assume actual generation process Our model: y ĝ(x) fŵ(x), and ŵ h(λ, S) Expected generalization error at x 0 : y g(x) + ϵ (2.0) x 0 (2.) Err(x 0 ) E[(y ĝ(x 0 )) 2 ] E[(y g(x 0 ) + g(x 0 ) ĝ(x 0 )) 2 ] E[((y g(x 0 )) + (g(x 0 ) E[ĝ(x 0 )]) + (E[ĝ(x 0 )] ĝ(x 0 ))) 2 ] σ 2 + (g(x 0 ) E[ĝ(x 0 ])) 2 + E[(E[ĝ(x 0 )] ĝ(x 0 )) 2 ] }{{}}{{} Bias V ariance Bias is a decreasing function of model complexity. Variance is increasing. 2.2.2 Cross Validation (2.2) Setting λ allows us to control the tradeoff between bias and variance. Consider two extreme cases: λ 0: no regularization - ill posed problem, and high variance. λ inf: constant model - high bias. The best is somewhere in between. Use cross-validation to estimate λ. 4
2.3 SVD-based interpretations (optional reading) Consider the following Singular Value Decomposition for the N M matrix X : X }{{} N M U }{{} D }{{} V T }{{} N M M M M M D diag(d, d 2,..., d min(m,n), 0,..., 0 }{{} ), d i d i+, d i 0 max(m N,0) while the columns of U and V are orthogonal and unit-norm. We remind that N is the number of training points and M the dimensionality of the features. A first thing to note is that if we have zero-mean ( centered ) data, X T X will be the covariance matrix of the data. Moreover we have: X T X VD T U T UDV T VD T DV T VD 2 V T (2.4) so V will contain the eigenvalues of the covariance matrix of the data. Further, we can use the SVD decomposition interpret certain quantities obtained in the previous sections. First, for (2.4), which gives the estimated output y in linear regression, we have: ŷ X ( X T X ) X T y ŷ UDV T ( VD T U T UDV T) V T D T Uy U ( U T y ) We can interpret this expression as follows: U: forms an orthonormal basis for a subspace of R M. C H U T h: gives us the expansion coefficients for the projection of a vector h on this basis. ĥ UC h : is the reconstruction of h on basis U. We therefore have that U T y gives the projection coefficients for y on U. So ŷ UU T y, gives us the projection of y on basis U For the case of ridge regression, using the expression in (2.8) we get: ŷ X ( X T X + λi ) X T y UD ( D 2 + λi ) DU T y d 2 j u m d 2 j + my λut This version of SVD is slightly non-typical; it is the one used in Hastie, Tibshirani and Friedman. The default is for D to be N M: Typical SVD, for N > M: X }{{} N M U D }{{}}{{} N N N M V }{{} T M M D [diag(d, d 2,..., d N ) (0) N N M ], d i d i+, d i 0 (2.3) 5
Hence the name shrinkage : the expansion coefficients for the reconstruction of y on the subspace spanned by U are shrunk by regression. d2 j d 2 j +λ < w.r.t. the values used in linear 6
Chapter 3 Logistic Regression 3. Probabilistic transition from linear to logistic regression Consider that the labels are generated according to the linear model + gaussian noise assumption: ( P (y i x i, w) exp yi xw ( ) ) T 2 2πσ 2 2σ 2. (3.) We consider as a merit function the likelihood of the data: C(w) P (y X, w) i P (y i x i, w) or more conveniently, the log-likelihood: log(c(w)) log P (y X) i log P (y i x i ) c 2σ 2 (y i w T x i ) 2 i c 2σ 2 (y Xw)T (y Xw) c 2σ 2 et e (3.2) So minimizing the least squares criterion amounts to maximizing the log Likelihood of the labels under the model in (3.). But this model make sense only for continuous observations. In classification the observations are discrete (labels). We therefore consider using a Bernoulli distribution instead: P (y x, w) f(x, w) P (y 0 x, w) f(x, w). 7
We can compactly express these two equations as: P (y x) (f(x, w)) y ( f(x, w) y (3.3) We can now write the (log-)likelihood of the data as: C(w) P (y X, w) i log C(w) log P (y X, w) i i P (y i x i, w) log P (y i x i, w) y i log f(x i, w) + ( y i ) log( f(x i, w)) (3.4) Now consider using the following expression for f(x, w): f(x, w) g(w T x) + exp( w T x) (3.5) g(a) +exp( a) is called the sigmoidal function and has several interesting properties, the most useful ones being: dg g(a)( g(a)), da g(a) g( a) Plugging (3.5) into (3.4), this is the criterion we should be maximizing: log C(w) log P (y X, w) y i log g(w T x i ) + ( y i ) log( g(w T (3.6) x i )) A simple approach to maximize it is to use Gradient ascent: w t+ k wk t + α C(w) w k w t (3.7) Using the expression in (3.6), we have: C(w) w k [ y i g(w T x i ) g(w T x i + ( y i ) ) w k [ y i g(w T x i ) ( yi ) g(w T x i ) [ y i ( g(w T x i )) ( y i )g(w T x i ) ] x i k [ y i g(w T x i ) ] x i k ] ) w k x i ) g(w T x i ) ( g(wt ] g(w T x i )( g(w T x i )) wt x i w k 8
We can write this more compactly by using vector notation; using the N vector g with elements: g i g(w T x i ) (3.8) and the N -dimensional k-th column X k of the N M matrix X, we have: C(w) w k [y g] T X k X T k [y g] Stacking the partial derivatives together we form the Jacobian: J(w) dc(w) dw XT [y g]... X T [y g] [y g] X T M The gradient-ascen update for w will be: w t+ w t + αj T w A more elaborate, but more efficient technique is the Newton-Raphson method. Consider first the Newton method for finding the roots of a -D function f(θ): θ θ f(θ) f (θ) If we want to maximize the function, we need to find roots of its derivative f ; so we apply the Newton method to find the roots of f (θ): θ θ f (θ) f (θ) In our case we want to maximize a multidimensional function, which leads us to the Newton Raphson method: w w (H(w)) J(w). (3.9) We already saw how to compute the Jacobian. For the k, j-th element of the Hessian we have H(w)(k, j) 2 log C(w) w k w j N [ y i g(x i w T ) ] x i k w j g(x i w T ) x i k w j g(x i w T )( g(x i w T ))x i jx i k X T RX, where R i,i g(x i )( g(x i )) (3.0) 9
This Newton-Raphson update is called Iteratively Reweighted Least Squares in Statistics; to see why this is so, we rewrite the update as: w w + ( X T RX ) X T (y g) ( X T RX ) [( X T RX ) w 0 + X T (y g) ] ( X T RX ) X T R [ Xw + R (y g) ] ( X T RX ) X T Rz where we have introduced the adjusted response : z Xw + R (y g) z i x i }{{} w Previous value + g i ( g i ) (yi g i ), (3.) Using this new vector z, we can interpret the vector w as being the solution to a weighted least squares problem: w argmin w N R i (z i x i w T ) 2 (3.2) Note that apart from z, also the matrix R is updated at each iteration; this explains the iteratively re- prefix in the IRLS name. 3.2 Log-loss We simplify the expression for our training criterion to see what logistic regression does. We therefore introduce: y ± 2y b (3.3) y b {0, } (3.4) y ± {, } (3.5) y b y ± +, 2 (3.6) where y b is the label we have been using so far and y ± is a new label that makes certain expressions easier. In specific we have: P (y f(x)) + exp( f(x)) (3.7) P (y f(x)) P (y f(x)) (3.8) exp(f(x)) + (3.9) P (y i f(x i )) + exp( y i f(x i )) (3.20) 0
So we can write: L(w) C(w) log( + exp( y±f(x i i ))) (3.2) The quantity l(y i, x i ) log( + exp( y±f(x i i ))) (3.22) is called the log-loss.
Chapter 4 Support Vector Machines 4. Large Margin classifiers In linear classification, we are concerned with construction of separating hyperplane: w T x 0. For instance, logistic regression provides an estimate of: P (y x; w) g(w T x) and provide a decision rule using: + exp( w T x) ŷ i { ifp (y x; w).5 0 ifp (y x; w) <.5 (4.) Intuitively, a confident decision is made when w T x is large. So we would like to have: or, combined: (w T x i ) 0 if y i (w T x i ) 0 if y i y i (w T x i ) 0. Given a training set, we want to find a decision boundary that allows us to make correct and confident predictions on all training examples. For points near the decision boundary, even though we may be correct, small changes in the training set could cause different classifications; so our classifier will not be confident. If a point is far from the separating hyperplane then we are more confident in our predictions. In other words, we would like to have all points being far (and on the correct side of) the decision plane. 2
4.. Notation Instead of we now consider classifiers of the form: h(x) g(w T x) (4.2) h(x) g(w T x + b) (4.3) (b w 0 ), where g(z) if z > 0, and g(z), if z < 0. Unlike logistic regression, where g gives a probability estimate, here the classifier directly outputs a decision. 4..2 Margins The confidence scores described above are called functional margins: scaling w, b by two doubles the margin, without essentially changing the classifier. We therefore now turn to geometric margins, which are defined in terms of unitnorm vector w w. We express first any point as: x x + γ w w where x is the projection of x on the decision plane, w T w x + b 0, w is a unitnorm vector perpendicular to the decision plane and γ is thus the distance of x from the decision plane. Solving for γ we have: w T x w T x + w T γ w w γ wt x + b w w T x b + γ w wt w x + b w The above analysis works for the case where y. If y, we would like w T x+b to be negative, we therefore write: ( w γ i y i T w xi + b ) w For a training set x i, y i and a classifier w we define its geometric margin as: γ min i γ i (4.4) which indicates the worst performance of the classifier on the training set. 3
4..3 Introducing margins in classification Our task is to classify all points with maximal geometric margin. If we consider the case where the points are separable, we have the following optimization problem: max γ,w,b γ s.t. y i (w T x i + b) γ, i... M w In words, maximize γ, where all points have at least γ geometric margin. It is hard to incorporate the w constraint (it is non-convex; for two dimensional inputs, w should lie on a circle). We therefore divide both sides by w : max γ,w,b γ w s.t. y i (w T x i + b) γ, i... M From the constraint, since w is no longer constrained to have unit norm, we realize that γ is a functional margin. Consider we find a combination of w, b that solves the optimization problem. We get a one-dimensional family of solutions by scaling w, b, γ by a common constant. We discard this ambiguity by consider only those values of w, b for which γ, which gives us the following problem: max w,b w s.t. y i (w T x i + b), i... M Equivalently: min w,b 2 w 2 s.t. y i (w T x i + b), i... M So we end up with a quadratic optimization problem, involving a Convex (quadratic) function of w and linear constraints on w. 4.2 Duality Consider a general constrained optimization problem: min w f(w) s.t. h i (w) 0, i... l g i (w) 0, This is equivalent to unconstrained optimization: i... m min w f uc (w) f(w) + l I 0(h i (w)) + m I +(g i (w)) (4.5) 4
where I 0, I : hard constraints I 0 (x) { 0, x 0, x 0, I +(x) { 0, x 0, x > 0 We form a lower bound of this function, by replacing each hard constraint with a soft one: λ, µ: Penalty for violating constraints. This gives us the Lagrangian: We observe that Therefore L(w, λ, µ) f(w) + I 0 (x i ) λ i x i I + (x i ) µ i x i, µ > 0 l m λ i h i (w) + µ i g i (w), µ i > 0 i f uc (w) f(w ) min w max L(w, λ, µ). (4.6) λ,µ:µ i >0 max λ,µ:µ i>0 Now consider the Lagrange Dual Function: L(w, λ, µ). (4.7) θ(λ, µ) inf L(w, λ, µ). (4.8) w θ(λ, µ) is always a lower bound on the optimal value p of the original problem. If the original problem has a (feasible) solution w, then: So L(w, λ, µ) f(w ) + w :feasible f(w ) + µ i>0 f(w ). l m λ i h i (w ) + µ i g i (w ) l m λ i 0 + µ i g i (w ) }{{} <0 θ(λ, µ) inf w L(w, λ, µ) L(w, λ, µ) f(w ). (4.9) So we have a function θ(λ, µ) that lower bounds the minimum cost of our original problem. We want to make this bound as tight as possible, i.e. maximize it. 5
New task (dual problem): max λ,µ θ(λ, µ) (4.0) s.t. µ i > 0 i In general, d max θ(λ, µ) λ,µ max L(w, λ, µ) λ,µ:µ i >0 w min w max λ,µ:µ i>0 min f uc(w) p w L(w, λ, µ) If f(w), h i (w), f i (w) are convex functions, and the primal and dual problems are feasible, then the bound is tight, i.e. d p, which means that there exist w, λ, µ meeting all constraints while satisfying: At that point we have: f(w ) θ(λ, µ ) f(w ) θ(λ, µ ) (4.) inf w f(w) + l m λ i h i (w) + µ i f i (w) m f(w ) + λ i h i (w ) + µ i f i (w ) f(w ) Therefore: µ i f i(w ) 0, i ( Complementary Slackness ). 4.2. Karush-Kuhn-Tucker Conditions L(w, λ, µ ) is a saddle point (minimum for w, maximum for λ, µ). From the minimum w.r.t w we have that f(w ) + l m λ i h i (w ) + f i (w ) 0 6
From the maximum w.r.t. λ, µ we have that the constraints are satisfied. Combining with the complementary slackness conditions we have the KKT conditions: f(w ) + h i (w ) 0 f i (w ) 0 µ i f i (w ) 0 µ i 0 l m λ i h i (w ) + f i (w ) 0 4.2.2 Dual Problem for Large-Margin training The primal problem is the objective we setup for max-margin training: min w,b Its Lagrangian will be: 2 w 2 s.t. y i (w T x i + b) + 0, i... M L(w, b, µ) 2 w 2 µ i [y i (w T x i + b) ] (4.2) At an optimum we have a minimum w.r.t. w, which gives: 0 w and a minimum w.r.t. b, which gives: w 0 µ i [y i x i ] µ i y i x i, µ i y i Plugging in the optimal values into the Lagrangian we get: θ(µ) L(w, b, µ) 2 w 2 µ i [y i (w T x i + b) ] M 2 ( µ i y i x i ) T ( µ j y j x j ) µ i [y i (( µ j y j x j ) T x i + b) ] j j j µ i µ i µ j y i y j (x i ) T (x j ) b µ i y i 2 7
We thereby eliminated w from the problem. For reasons that will become clear later, we write (x i ) T (x j ) < x i, x j > (inner product). The new optimization problem becomes: max µ θ(µ) µ i µ i µ j y i y j < x i, x j > 2 s.t. µ i > 0, i µ i y i 0 j If we solve the problem above w.r.t. µ, we can use µ to determine w. One important note is that as w µ i y i x i, (4.3) the final decision boundary orientation is a linear combination of the training points. Moreover, from complementary slackness we have: where: µ i g i (x) 0, (4.4) g i (x) y i (w T x i + b) 0 y i (w T x i + b) (4.5) Therefore if µ i 0 then y i (w T x i + b). As w is formed from the set of points for which µ i 0, this means that only the points that lie on the margin influence the solution. These points are called Support Vectors. From the set S of support vectors we can determine best b : 4.2.3 Non-separable data y i (w T x i + b), i S b ( y i w T x i) (4.6) N S i S To deal with the case where the data cannot be separated, we introduce positive slack variables, ξ i, i... M in the constraints: or equivalently: w T x i + b ξ i, y i w T x i + b + ξ i, y i y i (w T x i + b) ξ i 8
If ξ i > an error occurs, and i ξi is an upper bound on the number of errors. We consider a new problem: min w,b s.t. 2 w 2 + C ξ i y i (w T x i + b) ξ i ξ i 0 (4.7) where C determines the tradeoff between the large margin (low w ) and low error constraint. The Lagrangian of this problem is: L(w, b, ξ, µ, ν) 2 wt w + C ξ i µ i [y i (w T x + b) + ξ] ν i ξ i, and its dual becomes: max µ s.t. µ i y i y j µ i µ j < x i, x j > 2 0 µ i C µ i y i 0. j The KKT conditions are similar; an interesting change is that we now have: C µ i ν i 0 (4.8) y i (< x i, w > +b) + ξ i 0 (4.9) µ i [ y i (< x i, w > +b) + ξ i] 0 (4.20) ν i ξ i 0 (4.2) ξ i 0 (4.22) µ i 0 (4.23) ν i 0 (4.24) which gives y i (< x i, w > +b) > y i (< x i, w > +b) < y i (< x i, w > +b) 3:ξ i 0 µ i 0 2:ξ i >0 4:ν i 0 : µ i C 3:µ i ξ i 0 : µ i [0, C] 9
4.2.4 Kernels µ i 0 (3) ξ i y i (< x i, w > +b) (4.25) µ i 0 () ν i C (4) ξ i 0 (4.26) ξ i 0 (4.27) ξ i max(0, y i (< x i, w > +b)) (4.28) Up to now, the decision was based on the sign of y w T x w i x i (4.29) What if we used nonlinear transformations of the data? As a concrete example, consider x y w T ϕ(x) [ x K w i ϕ k (x) (4.30) x 2 ], ϕ(x) x x 2 x 2 x 2 2 x x 2 Using such features would allow us to go from linear to quadratic decision boundaries, defined by: w i ϕ i (x) 0 (4.3) i We observe that, in the linear case, for any new point x, the decision will have the form: y sign(< w, x > +b) µ i µ iy i sign( µ i < x i, x > +b) (4.32) So all we need to know in order to compute the response at a new x is its inner product with the Support Vectors. So if we plug in our new features, we guess that we should get a classifier of the form: y sign( µ i < ϕ(x i ), ϕ(x) > +b) (4.33) 20
Actually everything in the original algorithm was written in terms of inner products between input vectors. So we can write everything in terms of inner products of our new features ϕ; in specific, the dual optimization problem becomes: max µ s.t. µ i y i y j µ i µ j < ϕ(x i ), ϕ(x j ) > 2 0 µ i C µ i y i 0 j If now we define a Kernel between two points x and y as: K(x, y) < ϕ(x), ϕ(y) > and replace < x, y > with K(x, y), we can write the same algorithm: y sign( µ i y i K(x i, x) + b) (4.34) max µ s.t. µ i y i y j µ i µ j K(x i, x j ) 2 0 µ i C µ i y i 0 j Then As a concrete example, consider the mapping: [ ] x x x, ϕ(x) x x 2 2 2x x 2 < ϕ(x), ϕ(y) > x 2 y 2 + 2x x 2 y y 2 + x 2 2y2 2 (x y + x 2 y 2 ) 2 (x T y) 2 K(x, y) In ( the same) way, we can use K(x, y) (x T y + c) d to implement a mapping into N + d space. d 2
Another case: x y 2 K(x, y) exp( 2σ 2 ) (4.35) It can be shown that this amounts to an inner product in an infinite-dimensional feature space. K can de a distance-like measure among inputs, even though we may not have a straightforward feature-space interpretation; it allows us to apply SVMs to strings, graphs, documents etc. Most importantly, it allows us to work in very high dimensional spaces without necessarily finding and/or computing the mappings. Under what conditions does a kernel function represent an inner product in some space? We want to have K ij K(x i, x j ) K(x j, x i ) K ji. K: Gram matrix. For m points this should be valid i... m, j... m. Consider now the quantity: z T Kz z i K ij z j i j z i ϕ T (x i )ϕ(x j )z j i j z i ϕ k (x i )ϕ k (x j )z j i j k z i ϕ k (x i )ϕ k (x j )z j k i j ( 2 z i ϕ k (x )) i 0 k i K is therefore positive semi-definite and symmetric. Mercer Kernel consider a Kernel K : R N R N R; K represents an inner product if and only if K is symmetric positive semi-definite. 4.3 Hinge vs. Log-loss Consider the SVM training criterion: L(w) m [ξ i ] + 2C wt w m [ y i (w T x i )] + + λw T w (4.36) [ y i (t i )] + : hinge loss 22
Compared to regularized Logistic regression L(w) m log( + exp( y(w T x i )) + λw T w, the hinge loss is more abrupt: as soon as as y i (w T x i ) >, the i-th point is declared well-classified and no longer contributes to the solution (i.e. is not a support vector). 23
Chapter 5 Adaboost If many cases it easy to find some rules of thumb that are often correct but hard to find a single highly accurate prediction rule Boosting is a general method for converting a set of rules of thumb into highly accurate prediction rules; assuming that a given weak learning algorithm can consistently find classifiers (rules of thumb) at least slightly better than random, say, accuracy 5(in two-class setting), then given sufficient data a boosting algorithm can provably construct a single classifier with very high accuracy. 5. AdaBoost Adaboost is a boosting algorithm that provides justifiable answers to the following questions: First, how to choose examples on each round? Adaboost concentrates on hardest examples, i.e. those most often misclassified by previous rules of thumb. Second how to combine rules of thumb into single prediction rule? Adaboost proposes to take a weighted majority vote of these rules of thumb. In specific we consider that we are provided with a training set: (x i, y i ), x i X, y i {, }, i,..., N. Adaboost maintains a probability density on these data which is initially uniform, and on the way places larger weight on harder examples. At each round a weak classifier h t ( rule of thumb ) is identified that has the best performance on the weighted data, and used to update (i) the classifier s expression (ii) the probability density on the data. 24
In specific Adaboost can be described in terms of the following algorithm: Initially, set D (i) N, i For t... T : Find weak classifier h t : X {, } with smallest error ϵ t N Di t[y i h t (x i )] i Di t (5.) Set a t 2 log ( ϵ t) ϵ t Update distribution D i t+ Di t Z t { exp( αt ), ify i h t (x i ) exp(α t ), ify i h t (x i ) Di t Z t exp( α t y i h t (x i )), where to guarantee that D t+ remains a proper density we normalize by: Z t i D i t exp ( α ty i h t (x i )) Combine classifier results (weighted vote) f(x) t a t h t (x) Output final classifier: H(x) sign (f(x)) (5.2) 5.. Training Error Analysis We first introduce first the edge of the weak learner at the t-th round as: We will prove that: N [yi H(x i )] }{{} Final Training Error ϵ t 2 γ t }{{} Edge 25 T t T t exp [ 2 ] ϵ t ( ϵ t ) 4γ 2 t ( 2 T t γ 2 t )
The challenge lies in proving the first inequality; the second comes from the definition of the edge, while the third from the fact that 2x exp( x), x [0, 2 ]. If we We have required that the weak learner is always correct more than half of the times on the training data, i.e. that γ t > 0 t, so if t γ t > γ > 0 we will then have: N [yi H(x i )] exp( 2T γ 2 ) (5.3) which means that the number of training errors will be exponentially decreasing in time. First, by recursing, we have that: D i t+ Di t Z t exp( α t y i h t (x i )) D i t Z t exp( α t y i h t (x i )) exp( α t y i h T (x i )) Z t D t exp ( [ α t y i h t (x i ) + α t y i h t (x i ) ]) Z t Z t... (5.4) ( exp y i ) t k α kh k (x i ) N t k Z k or: D i T + N exp ( y i f(x i ) ) T k Z k (5.5) which expresses the distribution on the data in terms of the classifier s score on them. We use this result to obtain that: N [yi H(x i )] H(x)signf(x) N T i { y i f(x i ) 0 0 y i f(x i ) > 0 N exp( yi f(x i )) T DT i + Z k k k Z k which relates the number of errors with the normalizing constants Z k at each round. 26
We set out to prove that: N [yi H(x i )] }{{} Final Training Error T t [ 2 ] ϵ t ( ϵ t ) (5.6) so, given the last result, all we need to prove is that: 2 ϵ t ( ϵ t ) Z t The key to this is the choice we made for α: In specific, we have: Z t α 2 ϵ α log( ϵ ) (5.7) Dt i exp( α t y i h t (x i )) i:y i h t(x i ) i:y i h t (x i ) D i t exp( α t y i h t (x i )) + D i t exp(α t ) + i:y i h t (x i ) (ϵ t ) exp(α t ) + ( ϵ t ) exp( α t ) ϵt ϵ log( ϵ ) ϵt ϵ t + ( ϵ t ) ϵt ϵt 2 ϵ t ( ϵ t ) This concludes the proof. 5..2 Adaboost as coordinate descent i:y i h t(x i ) D i t exp( α t ) D i t exp( α t y i h t (x i )) We demonstrated that boosting optimizes the following upper bound to the misclassification cost: N [yi H(x i )] m exp( y if(x i )) (5.8) T t Z t T [ϵ t exp(α t ) + ( ϵ t ) exp( α t )] (5.9) t 27
Actually the choice of α t is such that Z t is minimized at each iteration: which gives ϵ t exp(α t ) + ( ϵ t ) exp( α t ) α t 0 ϵ t exp(α t ) ( ϵ t ) exp( α) 0 ϵ t exp(2α t ) ( ϵ t ) α t 2 log ( ( ϵt ) ϵ t ϵ t exp(α t ) + ( ϵ t ) exp( α t ) 2 ϵ t ( ϵ t ) (5.0) So choosing the best (in the sense of minimizing Z t, and thence (5.8)) value of α tells us that Z t will equal 2 ϵ t ( ϵ t ), where ϵ t < /2 is the weighted error of the weak learner. Moreover ϵ t ( ϵ t ) is increasing in ϵ t in [0, /2], so finding the weak learner with the minimal ϵ t guarantees that we most quickly decrease (5.8). We can conceptually describe what this amounts to as follows: we have at our disposal the responses of all weak learners (an infinite number of functions in general), and consider combining them with a weight vector, initially set to zero. In this setting, the Adaboost algorithm can be interpreted as coordinate descent: we pick the dimension (weak learner, h t ), and step (voting strength, α t ) that result in the sharpest decrease in the cost. So we can see Adaboost as a greedy, but particularly effective optimization procedure. 5..3 Adaboost as building an additive model A similar interpretation of Adaboost is in terms of fitting a linear model by finding at each step the optimal perturbation of a previously constructed decision rule. In specific consider that we have f(x) at round t and we want to find how to optimally perturb it by: f (x) f(x) + αh(x) (5.) Coordinate descent is an optimization technique that changes a single dimension of an optimized vector at each iteration ) 28
where h(x) is a weak learner and α is the perturbation scale. To consider the impact that this perturbation will have on the cost we use a Taylor expansion as follows : exp( y i f (x i )) Z exp( y i [f(x i ) + αh(x i )]) exp( y i f(x i ))[ αy i h(x i ) + α2 2 (yi ) 2 h 2 (x i )] exp( y i f(x i ))[ αy i h(x i ) + α2 2 ] exp( y i f(x i )) [ αy i h(x i ) + α2 }{{ Z } 2 ] D i where Z N exp( yi f(x i )). So our goal is to minimize: Finding the optimal h amounts to solving: h argmax D i [ αy i h(x i ) + α2 2 ] (5.2) D i y i h(x i ) ϵ (5.3) So picking the classifier with minimal ϵ amounts to finding the optimal h And when optimizing wrt α we get: α argmax α N So, once more, we get the Adaboost algorithm. D i exp( αy i h(x i )) ϵ exp(α) + ( ϵ) exp( α) (5.4) 5..4 Adaboost as a method for estimating the posterior Consider the expectation of the exponential loss: E X,Y exp( yf(x)) exp( yf(x))p (x, y)dxdy x x x y exp( yf(x))p (y x)dyp (x)dx y [[P (y x) exp( f(x)) + P (y x) exp(f(x))]dy] dx 29
If a function f(x) where to minimize this cost, at a point x it should satisfy or f(x) argmax g [P (y x) exp( g) + P (y x) exp(g) f(x) log P (y x) P (y x) P (y x) + exp( 2yf(x)) In Adaboost work with functions that are expressed as sums of weak learners. So the optimal f(x) will not necessarily be given by the expression log P (y x) P (y x), but rather will be an approximation to it formed by a linear combination of weak learners. 5.2 Learning Theory For pattern recognition: Small sample analysis. Bounds on worst performance. Hoeffding inequality: Let Z,..., Z m be iid, drawn from Bernoulli distribution: P (Z i ) ϕ Z i ( ϕ) Z i. Consider their mean: ˆϕ m m Z i. Then for any γ > 0, P ( ϕ ˆϕ > γ) 2 exp( 2γ 2 m) On training set {x i, y i }, i... m, consider training error for hypothesis h: Generalization error ê(h) m m [h(x i ) y i ] e(h) P (x,y) D ([h(x) y]) How different will e be from ê if training set is drawn from D?. Consider an Hypothesis Class: the set of all classifiers considered. E.g. for linear classification, all linear classifiers. Consider that we have a finite class: H {h,... h K } (5.5) e.g. a small set of decision stumps for a few thresholds. No parameters to estimate. Take any of these classifiers, h i. Consider Bernoulli variable Z generated by sampling (x, y) D and indicating whether h i (x) y. We can write the empirical error in terms of this Bernoulli variable: ê(h i ) m Z j m j 30
The mean of this variable is e(h i ), i.e. the generalization error for the i-th classifier. We then have from the Hoeffding inequality: P ( ê(h i ) e(h i ) > γ) 2 exp( 2γ 2 m) For large m, the training error will be close to the generalization error. Consider all possible classifiers: P ( h H : ê(h) e(h) > γ) K P ( ê(h i ) e(h) > γ) K 2 exp( 2γ 2 m) k K2 exp( 2γ 2 m) Union Bound: P (A A 2... A k ) P (A ) +... P (A k ) Consider we have m training points, k hypotheses and want to find a γ such that P ( ê e > γ) δ. Plugging in the above equation we have γ 2m log 2k δ. So with probability δ we have for all h H: ê(h) e(h) 2m log 2k δ (5.6) What does this mean? Consider h to be the best possible hypothesis, in terms of generalization error and ĥ to be the best in terms of training error. We then have e(ĥ) ê(ĥ) + γ ê(h ) + γ e(h ) + 2γ (5.7) Generalization error of chosen hypothesis is not much larger than that of the optimal one. Bias/Variance: If we have small hypothesis set, we may have large e(ĥ), but it will not differ much from e(h ). If we have large hypothesis set, e(ĥ) becomes smaller, but γ increases. 3