Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant Functions Statistical Learning Theory & SVMs Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de Generative Models (4 weeks) Bayesian Networks Markov Random Fields leibe@vision.rwth-aachen.de Many slides adapted from B. Schiele 3 Topics of This Lecture Recap: Linear Discriminant Functions Recap: Generalized Linear Discriminants Logistic Regression Probabilistic discriminative models Logistic sigmoid (logit function) Cross-entropy error Gradient descent Iteratively Reweighted Least Squares Note on error functions Statistical Learning Theory Generalization and overfitting Empirical and actual risk VC dimension Empirical Risk Minimization Structural Risk Minimization 4 Basic idea Directly encode decision boundary Minimize misclassification probability directly. Linear discriminant functions y(x) = w T x + w 0 weight vector bias (= threshold) w, w 0 define a hyperplane in R D. If a data set can be perfectly classified by a linear discriminant, then we call it linearly separable. y = 0 y < 0 y > 0 5 Recap: Extension to Nonlinear Basis Fcts. Recap: Basis Functions Generalization Transform vector x with M nonlinear basis functions Á j (x): MX y k (x) = w Á j (x) + w k0 Advantages j=1 Transformation allows non-linear decision boundaries. By choosing the right Á j, every continuous function can (in principle) be approximated with arbitrary accuracy. Generally, we consider models of the following form where Á j (x) are known as basis functions. In the simplest case, we use linear basis functions: Á d (x) = x d. Other popular basis functions Disadvantage The error function can in general no longer be minimized in closed form. Minimization with Gradient Descent 6 Polynomial Gaussian Sigmoid 7 1
Gradient Descent Recap: Gradient Descent Iterative minimization Start with an initial guess for the parameter values w (0). Move towards a (local) minimum by following the gradient. Basic strategies Batch learning @E(w) @w w( ) Example: Quadratic error function E(w) = (y(x n ; w) t n ) 2 Sequential updating leads to delta rule (=LMS rule) (y k(x n ; w) t kn ) Á j (x n ) Sequential updating where E(w) = @E n(w) @w E n (w) w( ) 8 where ± kná j (x n ) ± kn = y k (x n ; w) t kn Simply feed back the input data point, weighted by the classification error. 9 Recap: Gradient Descent Recap: Classification as Dim. Reduction Cases with differentiable, non-linear activation function 0 1 MX y k (x) = g(a k ) = g @ w ki Á j (x n ) A j=0 bad separation good separation Gradient descent (again with quadratic error function) @E n (w) @w = @g(a k) @w (y k (x n ; w) t kn ) Á j (x n ) ± kná j (x n ) ± kn = @g(a k) @w (y k (x n ; w) t kn ) 10 Classification as dimensionality reduction Interpret linear classification as a projection onto a lower-dim. space. y = w T x Learning problem: Try to find the projection vector w that maximizes class separation. 11 Image source: C.M. Bishop, 2006 Recap: Fisher s Linear Discriminant Analysis Topics of This Lecture Class 1 x x w Slide adapted from Ales Leonardis Class 2 Maximize distance between classes Minimize distance within a class Criterion: S B between-class scatter matrix S W within-class scatter matrix The optimal solution for w can be obtained as: w / S 1 W (m 2 m 1 ) Classification function: y J(w) = wt S Bw w T S W w where w 0 = w T m 12 Recap: Generalized Linear Discriminants Logistic Regression Probabilistic discriminative models Logistic sigmoid (logit function) Cross-entropy error Gradient descent Iteratively Reweighted Least Squares Note on error functions Statistical Learning Theory Generalization and overfitting Empirical and actual risk VC dimension Empirical Risk Minimization Structural Risk Minimization 17 2
Probabilistic Discriminative Models Probabilistic Discriminative Models We have seen that we can write p(c 1 jx) = ¾(a) 1 = 1 + exp( a) We can obtain the familiar probabilistic model by setting a = ln p(xjc 1)p(C 1 ) p(xjc 2 )p(c 2 ) Or we can use generalized linear discriminant models or a = w T x a = w T Á(x) logistic sigmoid function 18 In the following, we will consider models of the form with This model is called logistic regression. Why should we do this? What advantage does such a model have compared to modeling the probabilities? p(ájc 1 )p(c 1 ) p(c 1 já) = p(ájc 1 )p(c 1 ) + p(ájc 2 )p(c 2 ) Any ideas? p(c 1 já) = y(á) = ¾(w T Á) p(c 2 já) = 1 p(c 1 já) 19 Comparison Logistic Sigmoid Let s look at the number of parameters Assume we have an M-dimensional feature space Á. And assume we represent p(á C k ) and p(c k ) by Gaussians. How many parameters do we need? For the means: 2M For the covariances: M(M+1)/2 Together with the class priors, this gives M(M+5)/2+1 parameters! How many parameters do we need for logistic regression? p(c 1 já) = y(á) = ¾(w T Á) Properties Definition: Inverse: ¾(a) = Symmetry property: 1 1 + exp( a) µ ¾ a = ln 1 ¾ ¾( a) = 1 ¾(a) logit function Just the values of w M parameters. For large M, logistic regression has clear advantages! Derivative: d¾ = ¾(1 ¾) da 20 21 Logistic Regression Gradient of the Error Function y n = ¾(w T Á n ) Let s consider a data set {Á n,t n } with n = 1,,N, where Á n = Á(x n ) and t n 2 f0; 1g, t = (t 1 ; : : : ; t N ) T. With y n = p(c 1 Á n ), we can write the likelihood as NY p(tjw) = yn tn f1 y ng 1 tn Define the error function as the negative log-likelihood E(w) = ln p(tjw) = ft n ln y n + (1 t n ) ln(1 y n )g This is the so-called cross-entropy error function. 22 dy n Error function dw = y n(1 y n )Á n E(w) = ft n ln y n + (1 t n ) ln(1 y n )g Gradient ( d dw re(w) = t y d n dw n + (1 t n ) (1 y ) n) y n (1 y n ) ½ y n (1 y n ) = t n Á y n (1 t n ) y ¾ n(1 y n ) n (1 y n ) Á n = f(t n t n y n y n + t n y n )Á n g = (y n t n )Á n 23 3
Gradient of the Error Function A More Efficient Iterative Method Gradient for logistic regression re(w) = (y n t n )Á n Second-order Newton-Raphson gradient descent scheme H 1 re(w) where H = rre(w) is the Hessian matrix, i.e. the matrix of second derivatives. Does this look familiar to you? This is the same result as for the Delta (=LMS) rule (y k(x n ; w) t kn )Á j (x n ) Properties Local quadratic approximation to the log-likelihood. Faster convergence. We can use this to derive a sequential estimation algorithm. However, this will be quite slow 24 25 Newton-Raphson for Least-Squares Estimation Newton-Raphson for Logistic Regression Let s first apply Newton-Raphson to the least-squares error function: E(w) = 1 w T 2 Á 2 n t n re(w) = H = rre(w) = w T Á n t n Án = T w T t 2 Á n Á T n = T where 6 = 4 Resulting update scheme: ( T ) 1 ( T w ( ) T t) Á T 1.. Á T N = ( T ) 1 T t Closed-form solution! 3 7 5 26 Now, let s try Newton-Raphson on the cross-entropy error function: E(w) = ft n ln y n + (1 t n ) ln(1 y n )g re(w) = H = rre(w) = dy n dw = yn(1 yn)á n (y n t n )Á n = T (y t) y n (1 y n )Á n Á T n = T R where R is an N N diagonal matrix with R nn = y n (1 y n ). The Hessian is no longer constant, but depends on w through the weighting matrix R. 27 Iteratively Reweighted Least Squares Summary: Logistic Regression Update equations ( T R ) 1 T (y t) o = ( T R ) n 1 T R w ( ) T (y t) = ( T R ) 1 T Rz with z = w ( ) R 1 (y t) Again very similar form (normal equations) But now with non-constant weighing matrix R (depends on w). Need to apply normal equations iteratively. Iteratively Reweighted Least-Squares (IRLS) 28 Properties Directly represent posterior distribution p(á C k ) Requires fewer parameters than modeling the likelihood + prior. Very often used in statistics. It can be shown that the cross-entropy error function is concave Optimization leads to unique minimum But no closed-form solution exists Iterative optimization (IRLS) Both online and batch optimizations exist There is a multi-class version described in (Bishop Ch.4.3.4). Caveat Logistic regression tends to systematically overestimate odds ratios when the sample size is less than ~500. 29 4
Topics of This Lecture A Note on Error Functions Recap: Generalized Linear Discriminants Logistic Regression Probabilistic discriminative models Logistic sigmoid (logit function) Cross-entropy error Gradient descent Iteratively Reweighted Least Squares Note on error functions Statistical Learning Theory Generalization and overfitting Empirical and actual risk VC dimension Empirical Risk Minimization Structural Risk Minimization 30 Not differentiable! Ideal misclassification error function (black) This is what we want to approximate, Unfortunately, it is not differentiable. The gradient is zero for misclassified points. Ideal misclassification error z n = t n y(x n ) We cannot minimize it by gradient descent. 31 Image source: Bishop, 2006 A Note on Error Functions A Note on Error Functions Ideal misclassification error Squared error Ideal misclassification error Squared error Squared error (sigmoid) Sensitive to outliers! But: Zero gradient here! Penalizes too correct data points! Sensitivity to outliers fixed! Too correct data points fixed! z n = t n y(x n ) z n = t n y(x n ) Squared error used in Least-Squares Classification Very popular, leads to closed-form solutions. However, sensitive to outliers due to squared penalty. Penalizes too correct data points Generally does not lead to good classifiers. 32 Image source: Bishop, 2006 Squared error with sigmoid activation function (tanh) Fixes the problems with outliers and too correct data points. But: zero gradient for confidently misclassified data points. Will give better performance than original squared error, but still does not fix all problems. 33 Image source: Bishop, 2006 A Note on Error Functions Topics of This Lecture Robust to outliers! Ideal misclassification error Squared error Cross-entropy error Recap: Generalized Linear Discriminants Logistic Regression Probabilistic discriminative models Logistic sigmoid (logit function) Cross-entropy error Gradient descent Iteratively Reweighted Least Squares Cross-Entropy Error Minimizer of this error is given by posterior class probabilities. Concave error function, unique minimum exists. Robust to outliers, error increases only roughly linearly z n = t n y(x n ) But no closed-form solution, requires iterative estimation. 34 Image source: Bishop, 2006 Note on error functions Statistical Learning Theory Generalization and overfitting Empirical and actual risk VC dimension Empirical Risk Minimization Structural Risk Minimization 36 5
Generalization and Overfitting Example: Linearly Separable Data Goal: predict class labels of new observations Train classification model on limited training set. The further we optimize the model parameters, the more the training error will decrease. However, at some point the test error will go up again. Overfitting to the training set! test error training error 37 Image source: B. Schiele Overfitting is often a problem with linearly separable data Which of the many possible decision boundaries is correct? All of them have zero error on the training set However, they will most likely result in different predictions on novel test data. Different generalization performance How to select the classifier with the best generalization performance???? 38 A Broader View on Statistical Learning Statistical Learning Theory Formal treatment: Statistical Learning Theory Supervised learning (from the learning machine s view) Supervised learning Environment: assumed stationary. I.e. the data x have an unknown but fixed probability density p X (x) Selection of a specific function f(x; ) Given: training examples f(x i ; y i )g N i=1 Goal: the desired response y shall be approximated optimally. Teacher: specifies for each data point x the desired classification y (where y may be subject to noise). y = g(x; º) with noise º Measuring the optimality Loss function L(y; f(x; )) Learning machine: represented by class of functions, which produce for each x an output y: y = f(x; ) with parameters Example: quadratic loss L(y; f(x; )) = (y f(x; )) 2 41 42 Risk Risk Measuring the optimality Measure the optimality by the risk (= expected loss). Difficulty: how should the risk be estimated? However, what we re really interested in is Actual risk (= ZExpected risk) R( ) = L(y; f(x; ))dp X;Y (x; y) Practical way Empirical risk (measured on the training/validation set) R emp ( ) = 1 N L(y i ; f(x i ; )) i=1 P X;Y (x; y) is the probability distribution of (x,y). P X;Y (x; y) is fixed, but typically unknown. In general, we can t compute the actual risk directly! Example: quadratic loss function R emp ( ) = 1 N (y i f(x i ; )) 2 i=1 The expected risk is the expectation of the error on all data. I.e., it is the expected value of the generalization error. 43 44 6
Summary: Risk Actual risk Advantage: measure for the generalization ability Disadvantage: in general, we don t know P X;Y (x; y) Empirical risk Disadvantage: no direct measure of the generalization ability Advantage: does not depend on P X;Y (x; y) We typically know learning algorithms which minimize the empirical risk. Strong interest in connection between both types of risk 45 Statistical Learning Theory Idea Compute an upper bound on the actual risk based on the empirical risk where N: number of training examples p * : probability that the bound is correct h: capacity of the learning machine ( VC-dimension ) Side note: R( ) R emp ( ) + ²(N; p ; h) (This idea of specifying a bound that only holds with a certain probability is explored in a branch of learning theory called Probably Approximately Correct or PAC Learning). 48 VC Dimension Vapnik-Chervonenkis dimension Measure for the capacity of a learning machine. VC Dimension Interpretation as a two-player game Opponent s turn: He says a number N. Formal definition: If a given set of ` points can be labeled in all possible 2` ways, and for each labeling, a member of the set {f( )} can be found which correctly assigns those labels, we say that the set of points is shattered by the set of functions. The VC dimension for the set of functions {f( )} is defined as the maximum number of training points that can be shattered by {f( )}. Our turn: We specify a set of N points {x 1,,x N }. Opponent s turn: He gives us a labeling {x 1,,x N }2 {0,1} N Our turn: We specify a function f( ) which correctly classifies all N points. If we can do that for all 2 N possible labelings, then the VC dimension is at least N. 49 50 VC Dimension VC Dimension Example The VC dimension of all oriented lines in R 2 is 3. 1. Shattering 3 points with an oriented line: Intuitive feeling (unfortunately wrong) The VC dimension has a direct connection with the number of parameters. Counterexample f(x; ) = g(sin( x)) ( 1; x > 0 g(x) = 1; x 0 2. More difficult to show: it is not possible to shatter 4 points (XOR) More general: the VC dimension of all hyperplanes in R n is n+1. 51 Image source: C. Burges, 1998 Just a single parameter. Infinite VC dimension Proof: Choose x i = 10 i ; i = 1; : : : ; ` Ã! `X (1 y i )10 i = ¼ 1 + 2 i=1 52 7
Upper Bound on the Risk Upper Bound on the Risk Important result (Vapnik 1979, 1995) r h(log(2n=h) + 1) log( =4) With probability (1- ), the following bound holds R( ) R emp ( ) + N This bound is independent of P X;Y (x; y)! VC confidence Typically, we cannot compute the left-hand side (the actual risk) If we know h (the VC dimension), we can however easily compute the risk bound R( ) R emp ( ) + ²(N; p ; h) ²(N; p ; h) R emp ( ) 53 54 Structural Risk Minimization References and Further Reading How can we implement this? R( ) R emp ( ) + ²(N; p ; h) More information on SVMs can be found in Chapter 7.1 of Bishop s book. Classic approach ²(N; p ; h) R emp ( ) Keep constant and minimize. ²(N; p ; h) be kept constant by controlling the model can parameters. Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 Support Vector Machines (SVMs) R emp ( ) ²(N; p ; h) Keep constant and minimize. In fact: R emp ( ) = 0 for separable data. ²(N; p ; h) Control by adapting the VC dimension (controlling the capacity of the classifier). 56 Additional information about Statistical Learning Theory and a more in-depth introduction to SVMs are available in the following tutorial: C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, Vol. 2(2), pp. 121-167 1998. 57 8