Machine Learning in Computer Vision MVA Lecture Notes. Iasonas Kokkinos Ecole Centrale Paris

Similar documents
Machine Learning. Support Vector Machines. Manfred Huber

Linear & nonlinear classifiers

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

Linear & nonlinear classifiers

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Support Vector Machines and Kernel Methods

Max Margin-Classifier

ICS-E4030 Kernel Methods in Machine Learning

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CS-E4830 Kernel Methods in Machine Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Lecture Support Vector Machine (SVM) Classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Statistical Machine Learning from Data

Support Vector Machines for Classification and Regression

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machine

Support Vector Machines, Kernel SVM

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

Support Vector Machines for Classification: A Statistical Portrait

CS798: Selected topics in Machine Learning

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Announcements - Homework

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning Lecture 7

Regression with Numerical Optimization. Logistic

Homework 4. Convex Optimization /36-725

CSC 411 Lecture 17: Support Vector Machine

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

L5 Support Vector Classification

Machine Learning A Geometric Approach

Support Vector Machines

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

(Kernels +) Support Vector Machines

Support Vector Machines

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture Notes on Support Vector Machine

Introduction to Machine Learning Spring 2018 Note Duality. 1.1 Primal and Dual Problem

Support Vector Machines

LECTURE 7 Support vector machines

Review: Support vector machines. Machine learning techniques and image analysis

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Introduction to Support Vector Machines

Support Vector Machines: Maximum Margin Classifiers

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Least Squares Regression

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Support Vector Machine (continued)

Machine Learning And Applications: Supervised Learning-SVM

Ch 4. Linear Models for Classification

SUPPORT VECTOR MACHINE

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Introduction to Logistic Regression and Support Vector Machine

Constrained Optimization and Lagrangian Duality

Support Vector Machines

Support Vector Machines

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Least Squares Regression

10-701/ Machine Learning - Midterm Exam, Fall 2010

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

COMP 875 Announcements

Jeff Howbert Introduction to Machine Learning Winter

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Support vector machines

Convex Optimization M2

Lecture 8. Instructor: Haipeng Luo

Statistical Data Mining and Machine Learning Hilary Term 2016

Lecture 2 Machine Learning Review

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

CIS 520: Machine Learning Oct 09, Kernel Methods

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Machine Learning 2017

Polyhedral Computation. Linear Classifiers & the SVM

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Machine Learning Lecture 5

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Pattern Recognition 2018 Support Vector Machines

Overfitting, Bias / Variance Analysis

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

SVM and Kernel machine

Discriminative Models

Transcription:

Machine Learning in Computer Vision MVA 20 - Lecture Notes Iasonas Kokkinos Ecole Centrale Paris iasonas.kokkinos@ecp.fr October 8, 20

Chapter Introduction This is an informal set of notes for the first part of the class. They are by no means complete, and do not substitute attending class. My main intent is to give you a skimmed version of the class slides, which you will hopefully find easier to consult. These notes were compiled based on several standard textbooks including: T. Hastie, R. Tibshirani and J. Friedman, Elements of Statistical Learning, Springer 2009. V. Vapnik, The Nature of Statistical Learning, Springer 2000. C. Bishop, Pattern Recognition and Machine Learning, Springer 2006. Apart from these, helpful resources can be found online; in particular Andrew Ng has a highly-recommended set of handouts here: http://cs229.stanford.edu/materials.html while the more theoretically inclined can consult Robert Schapire s course: http://www.cs.princeton.edu/courses/archive/spring08/cos5/schedule.html and Peter Bartlett s course: http://www.cs.berkeley.edu/ bartlett/courses/28b-sp08/ for more extensive introductions to learning theory, boosting and SVMs. As this is the first edition of these lecture notes, there will inevitably be inconsistencies in notation with the slides, typos, etc; your feedback will be most valuable in improving them for the next year.

Chapter 2 Linear Regression Assumed form of discriminant function: f(x, w) w j x j + w 0 w j x j, x 0 (2.) In the more general case: j f(x, w) j0 w j B j (x) (2.2) j0 where B j (x) can be a nonlinear function of x. We want to minimize the empirical loss (training loss) of our classifier quantified in terms of the following criterion: L(w) i l(y i, f(x i, w)) (y i f(x i ; w)) 2 y i w j x i j We introduce vector/matrix notation: y y 2 y. N X y N and the cost becomes: j 2 x x 2... x M.. x N L(w) (y Xw) T (y Xw) e T e, x N N N M, 2

where e y Xw is called the residual vector. This is why the criterion is also known as residual-sum-of-squares cost. Optimizing the training criterion gives: 2. Ridge regression L(w) w 2X T (y Xw) 0 X T y X T Xŵ ŵ (X T X) X T y (2.3) ŷ Xŵ X(X T X) X T y (2.4) What happens when the regression coefficients are correlated? As an extreme case, consider x i x i 2, i. Then the matrix (X T X) becomes ill-conditioned and the corresponding coefficients can take on arbitrary values: a large positive value of one coefficient can be canceled by a large negative one. Therefore we regularize our problem by introducing a penalty C(w) on the expansion coefficients: L(w, λ) e T e + λc(w) (2.5) where λ determines the tradeoff between the data term e T e and the regularizer. For the case of ridge regression, we penalize the l 2 norm of the weight vector, i.e. w 2 2 K k w2 k wt w. The cost function becomes: L ridge (w) e T e + λw T w (2.6) The condition for an optimum of L(w) gives: So the estimate labels will be: 2.. Lasso y T y 2w T X T y + w T ( X T X + λi ) w w ( X T X + λi ) X T y (2.7) ŷ Xw (2.8) Instead of the l 2 we now penalize the l norm of the vector w: E Lasso (w, λ) (y i Xw) T (y Xw) + λ w k (2.9) The derivative of E Lasso w.r.t. w i at any point will be given by: (E Lasso (w)) w i λsign(w i ) + 3 k 2(y i Xw) T w i

If we use for instance Gradient descent to minimize E Lasspo, the rate at which w i will be updated will not be affected by the magnitude of w i, but only on its sign. Note that is in contrast to the logistic regression case: (E Ridge (w)) w i λ(w i ) + 2(y i Xw) T w i, where small-in-magnitude values of w i can survive as they become more slowly suppressed as they decrease. This difference, due to the different norms used, leads to sparser solutions in Lasso. 2.2 Selection of λ If we optimize w.r.t λ directly, λ 0 is the obvious, but trivial solution. This solution will suffer from ill-posedness: as mentioned earlier, the matrix X T X may be poorly conditioned for correlated X. So, the parameters of the model may exhibit high variance, depending on the specific training set used. What λ does is regularize our problem, i.e. impose some constraints that render it well-posed. We first see why this is so. 2.2. Bias variance decomposition of error Assume actual generation process Our model: y ĝ(x) fŵ(x), and ŵ h(λ, S) Expected generalization error at x 0 : y g(x) + ϵ (2.0) x 0 (2.) Err(x 0 ) E[(y ĝ(x 0 )) 2 ] E[(y g(x 0 ) + g(x 0 ) ĝ(x 0 )) 2 ] E[((y g(x 0 )) + (g(x 0 ) E[ĝ(x 0 )]) + (E[ĝ(x 0 )] ĝ(x 0 ))) 2 ] σ 2 + (g(x 0 ) E[ĝ(x 0 ])) 2 + E[(E[ĝ(x 0 )] ĝ(x 0 )) 2 ] }{{}}{{} Bias V ariance Bias is a decreasing function of model complexity. Variance is increasing. 2.2.2 Cross Validation (2.2) Setting λ allows us to control the tradeoff between bias and variance. Consider two extreme cases: λ 0: no regularization - ill posed problem, and high variance. λ inf: constant model - high bias. The best is somewhere in between. Use cross-validation to estimate λ. 4

2.3 SVD-based interpretations (optional reading) Consider the following Singular Value Decomposition for the N M matrix X : X }{{} N M U }{{} D }{{} V T }{{} N M M M M M D diag(d, d 2,..., d min(m,n), 0,..., 0 }{{} ), d i d i+, d i 0 max(m N,0) while the columns of U and V are orthogonal and unit-norm. We remind that N is the number of training points and M the dimensionality of the features. A first thing to note is that if we have zero-mean ( centered ) data, X T X will be the covariance matrix of the data. Moreover we have: X T X VD T U T UDV T VD T DV T VD 2 V T (2.4) so V will contain the eigenvalues of the covariance matrix of the data. Further, we can use the SVD decomposition interpret certain quantities obtained in the previous sections. First, for (2.4), which gives the estimated output y in linear regression, we have: ŷ X ( X T X ) X T y ŷ UDV T ( VD T U T UDV T) V T D T Uy U ( U T y ) We can interpret this expression as follows: U: forms an orthonormal basis for a subspace of R M. C H U T h: gives us the expansion coefficients for the projection of a vector h on this basis. ĥ UC h : is the reconstruction of h on basis U. We therefore have that U T y gives the projection coefficients for y on U. So ŷ UU T y, gives us the projection of y on basis U For the case of ridge regression, using the expression in (2.8) we get: ŷ X ( X T X + λi ) X T y UD ( D 2 + λi ) DU T y d 2 j u m d 2 j + my λut This version of SVD is slightly non-typical; it is the one used in Hastie, Tibshirani and Friedman. The default is for D to be N M: Typical SVD, for N > M: X }{{} N M U D }{{}}{{} N N N M V }{{} T M M D [diag(d, d 2,..., d N ) (0) N N M ], d i d i+, d i 0 (2.3) 5

Hence the name shrinkage : the expansion coefficients for the reconstruction of y on the subspace spanned by U are shrunk by regression. d2 j d 2 j +λ < w.r.t. the values used in linear 6

Chapter 3 Logistic Regression 3. Probabilistic transition from linear to logistic regression Consider that the labels are generated according to the linear model + gaussian noise assumption: ( P (y i x i, w) exp yi xw ( ) ) T 2 2πσ 2 2σ 2. (3.) We consider as a merit function the likelihood of the data: C(w) P (y X, w) i P (y i x i, w) or more conveniently, the log-likelihood: log(c(w)) log P (y X) i log P (y i x i ) c 2σ 2 (y i w T x i ) 2 i c 2σ 2 (y Xw)T (y Xw) c 2σ 2 et e (3.2) So minimizing the least squares criterion amounts to maximizing the log Likelihood of the labels under the model in (3.). But this model make sense only for continuous observations. In classification the observations are discrete (labels). We therefore consider using a Bernoulli distribution instead: P (y x, w) f(x, w) P (y 0 x, w) f(x, w). 7

We can compactly express these two equations as: P (y x) (f(x, w)) y ( f(x, w) y (3.3) We can now write the (log-)likelihood of the data as: C(w) P (y X, w) i log C(w) log P (y X, w) i i P (y i x i, w) log P (y i x i, w) y i log f(x i, w) + ( y i ) log( f(x i, w)) (3.4) Now consider using the following expression for f(x, w): f(x, w) g(w T x) + exp( w T x) (3.5) g(a) +exp( a) is called the sigmoidal function and has several interesting properties, the most useful ones being: dg g(a)( g(a)), da g(a) g( a) Plugging (3.5) into (3.4), this is the criterion we should be maximizing: log C(w) log P (y X, w) y i log g(w T x i ) + ( y i ) log( g(w T (3.6) x i )) A simple approach to maximize it is to use Gradient ascent: w t+ k wk t + α C(w) w k w t (3.7) Using the expression in (3.6), we have: C(w) w k [ y i g(w T x i ) g(w T x i + ( y i ) ) w k [ y i g(w T x i ) ( yi ) g(w T x i ) [ y i ( g(w T x i )) ( y i )g(w T x i ) ] x i k [ y i g(w T x i ) ] x i k ] ) w k x i ) g(w T x i ) ( g(wt ] g(w T x i )( g(w T x i )) wt x i w k 8

We can write this more compactly by using vector notation; using the N vector g with elements: g i g(w T x i ) (3.8) and the N -dimensional k-th column X k of the N M matrix X, we have: C(w) w k [y g] T X k X T k [y g] Stacking the partial derivatives together we form the Jacobian: J(w) dc(w) dw XT [y g]... X T [y g] [y g] X T M The gradient-ascen update for w will be: w t+ w t + αj T w A more elaborate, but more efficient technique is the Newton-Raphson method. Consider first the Newton method for finding the roots of a -D function f(θ): θ θ f(θ) f (θ) If we want to maximize the function, we need to find roots of its derivative f ; so we apply the Newton method to find the roots of f (θ): θ θ f (θ) f (θ) In our case we want to maximize a multidimensional function, which leads us to the Newton Raphson method: w w (H(w)) J(w). (3.9) We already saw how to compute the Jacobian. For the k, j-th element of the Hessian we have H(w)(k, j) 2 log C(w) w k w j N [ y i g(x i w T ) ] x i k w j g(x i w T ) x i k w j g(x i w T )( g(x i w T ))x i jx i k X T RX, where R i,i g(x i )( g(x i )) (3.0) 9

This Newton-Raphson update is called Iteratively Reweighted Least Squares in Statistics; to see why this is so, we rewrite the update as: w w + ( X T RX ) X T (y g) ( X T RX ) [( X T RX ) w 0 + X T (y g) ] ( X T RX ) X T R [ Xw + R (y g) ] ( X T RX ) X T Rz where we have introduced the adjusted response : z Xw + R (y g) z i x i }{{} w Previous value + g i ( g i ) (yi g i ), (3.) Using this new vector z, we can interpret the vector w as being the solution to a weighted least squares problem: w argmin w N R i (z i x i w T ) 2 (3.2) Note that apart from z, also the matrix R is updated at each iteration; this explains the iteratively re- prefix in the IRLS name. 3.2 Log-loss We simplify the expression for our training criterion to see what logistic regression does. We therefore introduce: y ± 2y b (3.3) y b {0, } (3.4) y ± {, } (3.5) y b y ± +, 2 (3.6) where y b is the label we have been using so far and y ± is a new label that makes certain expressions easier. In specific we have: P (y f(x)) + exp( f(x)) (3.7) P (y f(x)) P (y f(x)) (3.8) exp(f(x)) + (3.9) P (y i f(x i )) + exp( y i f(x i )) (3.20) 0

So we can write: L(w) C(w) log( + exp( y±f(x i i ))) (3.2) The quantity l(y i, x i ) log( + exp( y±f(x i i ))) (3.22) is called the log-loss.

Chapter 4 Support Vector Machines 4. Large Margin classifiers In linear classification, we are concerned with construction of separating hyperplane: w T x 0. For instance, logistic regression provides an estimate of: P (y x; w) g(w T x) and provide a decision rule using: + exp( w T x) ŷ i { ifp (y x; w).5 0 ifp (y x; w) <.5 (4.) Intuitively, a confident decision is made when w T x is large. So we would like to have: or, combined: (w T x i ) 0 if y i (w T x i ) 0 if y i y i (w T x i ) 0. Given a training set, we want to find a decision boundary that allows us to make correct and confident predictions on all training examples. For points near the decision boundary, even though we may be correct, small changes in the training set could cause different classifications; so our classifier will not be confident. If a point is far from the separating hyperplane then we are more confident in our predictions. In other words, we would like to have all points being far (and on the correct side of) the decision plane. 2

4.. Notation Instead of we now consider classifiers of the form: h(x) g(w T x) (4.2) h(x) g(w T x + b) (4.3) (b w 0 ), where g(z) if z > 0, and g(z), if z < 0. Unlike logistic regression, where g gives a probability estimate, here the classifier directly outputs a decision. 4..2 Margins The confidence scores described above are called functional margins: scaling w, b by two doubles the margin, without essentially changing the classifier. We therefore now turn to geometric margins, which are defined in terms of unitnorm vector w w. We express first any point as: x x + γ w w where x is the projection of x on the decision plane, w T w x + b 0, w is a unitnorm vector perpendicular to the decision plane and γ is thus the distance of x from the decision plane. Solving for γ we have: w T x w T x + w T γ w w γ wt x + b w w T x b + γ w wt w x + b w The above analysis works for the case where y. If y, we would like w T x+b to be negative, we therefore write: ( w γ i y i T w xi + b ) w For a training set x i, y i and a classifier w we define its geometric margin as: γ min i γ i (4.4) which indicates the worst performance of the classifier on the training set. 3

4..3 Introducing margins in classification Our task is to classify all points with maximal geometric margin. If we consider the case where the points are separable, we have the following optimization problem: max γ,w,b γ s.t. y i (w T x i + b) γ, i... M w In words, maximize γ, where all points have at least γ geometric margin. It is hard to incorporate the w constraint (it is non-convex; for two dimensional inputs, w should lie on a circle). We therefore divide both sides by w : max γ,w,b γ w s.t. y i (w T x i + b) γ, i... M From the constraint, since w is no longer constrained to have unit norm, we realize that γ is a functional margin. Consider we find a combination of w, b that solves the optimization problem. We get a one-dimensional family of solutions by scaling w, b, γ by a common constant. We discard this ambiguity by consider only those values of w, b for which γ, which gives us the following problem: max w,b w s.t. y i (w T x i + b), i... M Equivalently: min w,b 2 w 2 s.t. y i (w T x i + b), i... M So we end up with a quadratic optimization problem, involving a Convex (quadratic) function of w and linear constraints on w. 4.2 Duality Consider a general constrained optimization problem: min w f(w) s.t. h i (w) 0, i... l g i (w) 0, This is equivalent to unconstrained optimization: i... m min w f uc (w) f(w) + l I 0(h i (w)) + m I +(g i (w)) (4.5) 4

where I 0, I : hard constraints I 0 (x) { 0, x 0, x 0, I +(x) { 0, x 0, x > 0 We form a lower bound of this function, by replacing each hard constraint with a soft one: λ, µ: Penalty for violating constraints. This gives us the Lagrangian: We observe that Therefore L(w, λ, µ) f(w) + I 0 (x i ) λ i x i I + (x i ) µ i x i, µ > 0 l m λ i h i (w) + µ i g i (w), µ i > 0 i f uc (w) f(w ) min w max L(w, λ, µ). (4.6) λ,µ:µ i >0 max λ,µ:µ i>0 Now consider the Lagrange Dual Function: L(w, λ, µ). (4.7) θ(λ, µ) inf L(w, λ, µ). (4.8) w θ(λ, µ) is always a lower bound on the optimal value p of the original problem. If the original problem has a (feasible) solution w, then: So L(w, λ, µ) f(w ) + w :feasible f(w ) + µ i>0 f(w ). l m λ i h i (w ) + µ i g i (w ) l m λ i 0 + µ i g i (w ) }{{} <0 θ(λ, µ) inf w L(w, λ, µ) L(w, λ, µ) f(w ). (4.9) So we have a function θ(λ, µ) that lower bounds the minimum cost of our original problem. We want to make this bound as tight as possible, i.e. maximize it. 5

New task (dual problem): max λ,µ θ(λ, µ) (4.0) s.t. µ i > 0 i In general, d max θ(λ, µ) λ,µ max L(w, λ, µ) λ,µ:µ i >0 w min w max λ,µ:µ i>0 min f uc(w) p w L(w, λ, µ) If f(w), h i (w), f i (w) are convex functions, and the primal and dual problems are feasible, then the bound is tight, i.e. d p, which means that there exist w, λ, µ meeting all constraints while satisfying: At that point we have: f(w ) θ(λ, µ ) f(w ) θ(λ, µ ) (4.) inf w f(w) + l m λ i h i (w) + µ i f i (w) m f(w ) + λ i h i (w ) + µ i f i (w ) f(w ) Therefore: µ i f i(w ) 0, i ( Complementary Slackness ). 4.2. Karush-Kuhn-Tucker Conditions L(w, λ, µ ) is a saddle point (minimum for w, maximum for λ, µ). From the minimum w.r.t w we have that f(w ) + l m λ i h i (w ) + f i (w ) 0 6

From the maximum w.r.t. λ, µ we have that the constraints are satisfied. Combining with the complementary slackness conditions we have the KKT conditions: f(w ) + h i (w ) 0 f i (w ) 0 µ i f i (w ) 0 µ i 0 l m λ i h i (w ) + f i (w ) 0 4.2.2 Dual Problem for Large-Margin training The primal problem is the objective we setup for max-margin training: min w,b Its Lagrangian will be: 2 w 2 s.t. y i (w T x i + b) + 0, i... M L(w, b, µ) 2 w 2 µ i [y i (w T x i + b) ] (4.2) At an optimum we have a minimum w.r.t. w, which gives: 0 w and a minimum w.r.t. b, which gives: w 0 µ i [y i x i ] µ i y i x i, µ i y i Plugging in the optimal values into the Lagrangian we get: θ(µ) L(w, b, µ) 2 w 2 µ i [y i (w T x i + b) ] M 2 ( µ i y i x i ) T ( µ j y j x j ) µ i [y i (( µ j y j x j ) T x i + b) ] j j j µ i µ i µ j y i y j (x i ) T (x j ) b µ i y i 2 7

We thereby eliminated w from the problem. For reasons that will become clear later, we write (x i ) T (x j ) < x i, x j > (inner product). The new optimization problem becomes: max µ θ(µ) µ i µ i µ j y i y j < x i, x j > 2 s.t. µ i > 0, i µ i y i 0 j If we solve the problem above w.r.t. µ, we can use µ to determine w. One important note is that as w µ i y i x i, (4.3) the final decision boundary orientation is a linear combination of the training points. Moreover, from complementary slackness we have: where: µ i g i (x) 0, (4.4) g i (x) y i (w T x i + b) 0 y i (w T x i + b) (4.5) Therefore if µ i 0 then y i (w T x i + b). As w is formed from the set of points for which µ i 0, this means that only the points that lie on the margin influence the solution. These points are called Support Vectors. From the set S of support vectors we can determine best b : 4.2.3 Non-separable data y i (w T x i + b), i S b ( y i w T x i) (4.6) N S i S To deal with the case where the data cannot be separated, we introduce positive slack variables, ξ i, i... M in the constraints: or equivalently: w T x i + b ξ i, y i w T x i + b + ξ i, y i y i (w T x i + b) ξ i 8

If ξ i > an error occurs, and i ξi is an upper bound on the number of errors. We consider a new problem: min w,b s.t. 2 w 2 + C ξ i y i (w T x i + b) ξ i ξ i 0 (4.7) where C determines the tradeoff between the large margin (low w ) and low error constraint. The Lagrangian of this problem is: L(w, b, ξ, µ, ν) 2 wt w + C ξ i µ i [y i (w T x + b) + ξ] ν i ξ i, and its dual becomes: max µ s.t. µ i y i y j µ i µ j < x i, x j > 2 0 µ i C µ i y i 0. j The KKT conditions are similar; an interesting change is that we now have: C µ i ν i 0 (4.8) y i (< x i, w > +b) + ξ i 0 (4.9) µ i [ y i (< x i, w > +b) + ξ i] 0 (4.20) ν i ξ i 0 (4.2) ξ i 0 (4.22) µ i 0 (4.23) ν i 0 (4.24) which gives y i (< x i, w > +b) > y i (< x i, w > +b) < y i (< x i, w > +b) 3:ξ i 0 µ i 0 2:ξ i >0 4:ν i 0 : µ i C 3:µ i ξ i 0 : µ i [0, C] 9

4.2.4 Kernels µ i 0 (3) ξ i y i (< x i, w > +b) (4.25) µ i 0 () ν i C (4) ξ i 0 (4.26) ξ i 0 (4.27) ξ i max(0, y i (< x i, w > +b)) (4.28) Up to now, the decision was based on the sign of y w T x w i x i (4.29) What if we used nonlinear transformations of the data? As a concrete example, consider x y w T ϕ(x) [ x K w i ϕ k (x) (4.30) x 2 ], ϕ(x) x x 2 x 2 x 2 2 x x 2 Using such features would allow us to go from linear to quadratic decision boundaries, defined by: w i ϕ i (x) 0 (4.3) i We observe that, in the linear case, for any new point x, the decision will have the form: y sign(< w, x > +b) µ i µ iy i sign( µ i < x i, x > +b) (4.32) So all we need to know in order to compute the response at a new x is its inner product with the Support Vectors. So if we plug in our new features, we guess that we should get a classifier of the form: y sign( µ i < ϕ(x i ), ϕ(x) > +b) (4.33) 20

Actually everything in the original algorithm was written in terms of inner products between input vectors. So we can write everything in terms of inner products of our new features ϕ; in specific, the dual optimization problem becomes: max µ s.t. µ i y i y j µ i µ j < ϕ(x i ), ϕ(x j ) > 2 0 µ i C µ i y i 0 j If now we define a Kernel between two points x and y as: K(x, y) < ϕ(x), ϕ(y) > and replace < x, y > with K(x, y), we can write the same algorithm: y sign( µ i y i K(x i, x) + b) (4.34) max µ s.t. µ i y i y j µ i µ j K(x i, x j ) 2 0 µ i C µ i y i 0 j Then As a concrete example, consider the mapping: [ ] x x x, ϕ(x) x x 2 2 2x x 2 < ϕ(x), ϕ(y) > x 2 y 2 + 2x x 2 y y 2 + x 2 2y2 2 (x y + x 2 y 2 ) 2 (x T y) 2 K(x, y) In ( the same) way, we can use K(x, y) (x T y + c) d to implement a mapping into N + d space. d 2

Another case: x y 2 K(x, y) exp( 2σ 2 ) (4.35) It can be shown that this amounts to an inner product in an infinite-dimensional feature space. K can de a distance-like measure among inputs, even though we may not have a straightforward feature-space interpretation; it allows us to apply SVMs to strings, graphs, documents etc. Most importantly, it allows us to work in very high dimensional spaces without necessarily finding and/or computing the mappings. Under what conditions does a kernel function represent an inner product in some space? We want to have K ij K(x i, x j ) K(x j, x i ) K ji. K: Gram matrix. For m points this should be valid i... m, j... m. Consider now the quantity: z T Kz z i K ij z j i j z i ϕ T (x i )ϕ(x j )z j i j z i ϕ k (x i )ϕ k (x j )z j i j k z i ϕ k (x i )ϕ k (x j )z j k i j ( 2 z i ϕ k (x )) i 0 k i K is therefore positive semi-definite and symmetric. Mercer Kernel consider a Kernel K : R N R N R; K represents an inner product if and only if K is symmetric positive semi-definite. 4.3 Hinge vs. Log-loss Consider the SVM training criterion: L(w) m [ξ i ] + 2C wt w m [ y i (w T x i )] + + λw T w (4.36) [ y i (t i )] + : hinge loss 22

Compared to regularized Logistic regression L(w) m log( + exp( y(w T x i )) + λw T w, the hinge loss is more abrupt: as soon as as y i (w T x i ) >, the i-th point is declared well-classified and no longer contributes to the solution (i.e. is not a support vector). 23

Chapter 5 Adaboost If many cases it easy to find some rules of thumb that are often correct but hard to find a single highly accurate prediction rule Boosting is a general method for converting a set of rules of thumb into highly accurate prediction rules; assuming that a given weak learning algorithm can consistently find classifiers (rules of thumb) at least slightly better than random, say, accuracy 5(in two-class setting), then given sufficient data a boosting algorithm can provably construct a single classifier with very high accuracy. 5. AdaBoost Adaboost is a boosting algorithm that provides justifiable answers to the following questions: First, how to choose examples on each round? Adaboost concentrates on hardest examples, i.e. those most often misclassified by previous rules of thumb. Second how to combine rules of thumb into single prediction rule? Adaboost proposes to take a weighted majority vote of these rules of thumb. In specific we consider that we are provided with a training set: (x i, y i ), x i X, y i {, }, i,..., N. Adaboost maintains a probability density on these data which is initially uniform, and on the way places larger weight on harder examples. At each round a weak classifier h t ( rule of thumb ) is identified that has the best performance on the weighted data, and used to update (i) the classifier s expression (ii) the probability density on the data. 24

In specific Adaboost can be described in terms of the following algorithm: Initially, set D (i) N, i For t... T : Find weak classifier h t : X {, } with smallest error ϵ t N Di t[y i h t (x i )] i Di t (5.) Set a t 2 log ( ϵ t) ϵ t Update distribution D i t+ Di t Z t { exp( αt ), ify i h t (x i ) exp(α t ), ify i h t (x i ) Di t Z t exp( α t y i h t (x i )), where to guarantee that D t+ remains a proper density we normalize by: Z t i D i t exp ( α ty i h t (x i )) Combine classifier results (weighted vote) f(x) t a t h t (x) Output final classifier: H(x) sign (f(x)) (5.2) 5.. Training Error Analysis We first introduce first the edge of the weak learner at the t-th round as: We will prove that: N [yi H(x i )] }{{} Final Training Error ϵ t 2 γ t }{{} Edge 25 T t T t exp [ 2 ] ϵ t ( ϵ t ) 4γ 2 t ( 2 T t γ 2 t )

The challenge lies in proving the first inequality; the second comes from the definition of the edge, while the third from the fact that 2x exp( x), x [0, 2 ]. If we We have required that the weak learner is always correct more than half of the times on the training data, i.e. that γ t > 0 t, so if t γ t > γ > 0 we will then have: N [yi H(x i )] exp( 2T γ 2 ) (5.3) which means that the number of training errors will be exponentially decreasing in time. First, by recursing, we have that: D i t+ Di t Z t exp( α t y i h t (x i )) D i t Z t exp( α t y i h t (x i )) exp( α t y i h T (x i )) Z t D t exp ( [ α t y i h t (x i ) + α t y i h t (x i ) ]) Z t Z t... (5.4) ( exp y i ) t k α kh k (x i ) N t k Z k or: D i T + N exp ( y i f(x i ) ) T k Z k (5.5) which expresses the distribution on the data in terms of the classifier s score on them. We use this result to obtain that: N [yi H(x i )] H(x)signf(x) N T i { y i f(x i ) 0 0 y i f(x i ) > 0 N exp( yi f(x i )) T DT i + Z k k k Z k which relates the number of errors with the normalizing constants Z k at each round. 26

We set out to prove that: N [yi H(x i )] }{{} Final Training Error T t [ 2 ] ϵ t ( ϵ t ) (5.6) so, given the last result, all we need to prove is that: 2 ϵ t ( ϵ t ) Z t The key to this is the choice we made for α: In specific, we have: Z t α 2 ϵ α log( ϵ ) (5.7) Dt i exp( α t y i h t (x i )) i:y i h t(x i ) i:y i h t (x i ) D i t exp( α t y i h t (x i )) + D i t exp(α t ) + i:y i h t (x i ) (ϵ t ) exp(α t ) + ( ϵ t ) exp( α t ) ϵt ϵ log( ϵ ) ϵt ϵ t + ( ϵ t ) ϵt ϵt 2 ϵ t ( ϵ t ) This concludes the proof. 5..2 Adaboost as coordinate descent i:y i h t(x i ) D i t exp( α t ) D i t exp( α t y i h t (x i )) We demonstrated that boosting optimizes the following upper bound to the misclassification cost: N [yi H(x i )] m exp( y if(x i )) (5.8) T t Z t T [ϵ t exp(α t ) + ( ϵ t ) exp( α t )] (5.9) t 27

Actually the choice of α t is such that Z t is minimized at each iteration: which gives ϵ t exp(α t ) + ( ϵ t ) exp( α t ) α t 0 ϵ t exp(α t ) ( ϵ t ) exp( α) 0 ϵ t exp(2α t ) ( ϵ t ) α t 2 log ( ( ϵt ) ϵ t ϵ t exp(α t ) + ( ϵ t ) exp( α t ) 2 ϵ t ( ϵ t ) (5.0) So choosing the best (in the sense of minimizing Z t, and thence (5.8)) value of α tells us that Z t will equal 2 ϵ t ( ϵ t ), where ϵ t < /2 is the weighted error of the weak learner. Moreover ϵ t ( ϵ t ) is increasing in ϵ t in [0, /2], so finding the weak learner with the minimal ϵ t guarantees that we most quickly decrease (5.8). We can conceptually describe what this amounts to as follows: we have at our disposal the responses of all weak learners (an infinite number of functions in general), and consider combining them with a weight vector, initially set to zero. In this setting, the Adaboost algorithm can be interpreted as coordinate descent: we pick the dimension (weak learner, h t ), and step (voting strength, α t ) that result in the sharpest decrease in the cost. So we can see Adaboost as a greedy, but particularly effective optimization procedure. 5..3 Adaboost as building an additive model A similar interpretation of Adaboost is in terms of fitting a linear model by finding at each step the optimal perturbation of a previously constructed decision rule. In specific consider that we have f(x) at round t and we want to find how to optimally perturb it by: f (x) f(x) + αh(x) (5.) Coordinate descent is an optimization technique that changes a single dimension of an optimized vector at each iteration ) 28

where h(x) is a weak learner and α is the perturbation scale. To consider the impact that this perturbation will have on the cost we use a Taylor expansion as follows : exp( y i f (x i )) Z exp( y i [f(x i ) + αh(x i )]) exp( y i f(x i ))[ αy i h(x i ) + α2 2 (yi ) 2 h 2 (x i )] exp( y i f(x i ))[ αy i h(x i ) + α2 2 ] exp( y i f(x i )) [ αy i h(x i ) + α2 }{{ Z } 2 ] D i where Z N exp( yi f(x i )). So our goal is to minimize: Finding the optimal h amounts to solving: h argmax D i [ αy i h(x i ) + α2 2 ] (5.2) D i y i h(x i ) ϵ (5.3) So picking the classifier with minimal ϵ amounts to finding the optimal h And when optimizing wrt α we get: α argmax α N So, once more, we get the Adaboost algorithm. D i exp( αy i h(x i )) ϵ exp(α) + ( ϵ) exp( α) (5.4) 5..4 Adaboost as a method for estimating the posterior Consider the expectation of the exponential loss: E X,Y exp( yf(x)) exp( yf(x))p (x, y)dxdy x x x y exp( yf(x))p (y x)dyp (x)dx y [[P (y x) exp( f(x)) + P (y x) exp(f(x))]dy] dx 29

If a function f(x) where to minimize this cost, at a point x it should satisfy or f(x) argmax g [P (y x) exp( g) + P (y x) exp(g) f(x) log P (y x) P (y x) P (y x) + exp( 2yf(x)) In Adaboost work with functions that are expressed as sums of weak learners. So the optimal f(x) will not necessarily be given by the expression log P (y x) P (y x), but rather will be an approximation to it formed by a linear combination of weak learners. 5.2 Learning Theory For pattern recognition: Small sample analysis. Bounds on worst performance. Hoeffding inequality: Let Z,..., Z m be iid, drawn from Bernoulli distribution: P (Z i ) ϕ Z i ( ϕ) Z i. Consider their mean: ˆϕ m m Z i. Then for any γ > 0, P ( ϕ ˆϕ > γ) 2 exp( 2γ 2 m) On training set {x i, y i }, i... m, consider training error for hypothesis h: Generalization error ê(h) m m [h(x i ) y i ] e(h) P (x,y) D ([h(x) y]) How different will e be from ê if training set is drawn from D?. Consider an Hypothesis Class: the set of all classifiers considered. E.g. for linear classification, all linear classifiers. Consider that we have a finite class: H {h,... h K } (5.5) e.g. a small set of decision stumps for a few thresholds. No parameters to estimate. Take any of these classifiers, h i. Consider Bernoulli variable Z generated by sampling (x, y) D and indicating whether h i (x) y. We can write the empirical error in terms of this Bernoulli variable: ê(h i ) m Z j m j 30

The mean of this variable is e(h i ), i.e. the generalization error for the i-th classifier. We then have from the Hoeffding inequality: P ( ê(h i ) e(h i ) > γ) 2 exp( 2γ 2 m) For large m, the training error will be close to the generalization error. Consider all possible classifiers: P ( h H : ê(h) e(h) > γ) K P ( ê(h i ) e(h) > γ) K 2 exp( 2γ 2 m) k K2 exp( 2γ 2 m) Union Bound: P (A A 2... A k ) P (A ) +... P (A k ) Consider we have m training points, k hypotheses and want to find a γ such that P ( ê e > γ) δ. Plugging in the above equation we have γ 2m log 2k δ. So with probability δ we have for all h H: ê(h) e(h) 2m log 2k δ (5.6) What does this mean? Consider h to be the best possible hypothesis, in terms of generalization error and ĥ to be the best in terms of training error. We then have e(ĥ) ê(ĥ) + γ ê(h ) + γ e(h ) + 2γ (5.7) Generalization error of chosen hypothesis is not much larger than that of the optimal one. Bias/Variance: If we have small hypothesis set, we may have large e(ĥ), but it will not differ much from e(h ). If we have large hypothesis set, e(ĥ) becomes smaller, but γ increases. 3