Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41
Section 5 Classification Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 2 / 41
Outline Goal Understanding the problem settings in classification and regression estimation. Focus on classification: Understanding the concept of Bayesian decision theory. Understanding the notion of discriminant functions. Understanding generalized linear discriminant functions and separating hyperplanes. Understanding some simple classification algorithms. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 3 / 41
Classification Example Sorting incoming fish on a conveyor according to species using optical sensing Features: Length Brightness Width Shape of fins Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 4 / 41
Classification Given an environment characterized by p(x, y). k classes c 1,..., c k y is a class indicator variable A supervisor classifies x into k classes according to P(c x) Goal Construct a function that classifies in the same manner as the supervisor Methods: Linear Discriminant Analysis, Nearest Neighbor, Support Vector Machines, etc.... Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 5 / 41
Bayesian Decision Theory Assign observed x R d into one of k classes. A classifier is a mapping that assigns labels to observations f α : x {1,..., k}. For any observation x there exists a set of k possible actions α i, i.e. k different assignments of labels. The loss L incurred for taking action α i when the true label is j is denoted by a loss matrix L ij = L(α i c = j). Natural 0 1 loss function can be defined by simply counting misclassifications: L ij = 1 δ ij, where δ ij = { 1 if i = j, 0 otherwise. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 6 / 41
Bayesian Decision Theory (cont d) A classifier is trained on a set of observed pairs {(x 1, c 1 ),..., (x n, c n )} i.i.d. p(x, c) = p(c x)p(x) The probability that a given x is member of class c j is obtained via the Bayes rule: where P(c j x) = p(x c = j) P(c = j), p(x) k p(x) = p(x c = j)p(c = j). j=1 Given an observation x, the expected loss associated with choosing action α i (the conditional risk) is k R(f αi x) = L ij P(c j x) = P(c j x) = 1 P(c i x). j=1 j i Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 7 / 41
Bayesian Decision Theory (cont d) Goal: minimize the overall risk R(f α ) = R( f α (x) x ) p(x) dx. R d If f α (x) minimizes the conditional risk R(f α (x) x) for every x, the overall risk will be minimized as well. This is achieved by the Bayes optimal classifier which chooses the mapping k f (x) = argmin L ij p(c = j x). i j=1 For 0 1 loss this reduces to classifying x to the class with highest posterior probability: f (x) = argmax p(c = i x). i Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 8 / 41
Bayesian Decision Theory (cont d) Simplification: only 2 classes: c is Binomial RV. Bayes optimal classifier is defined by the zero crossings of the Bayes (optimal) discriminant function G(x) = P(c 1 x) P(c 2 x), or g(x) = log P(c 1 x) P(c 2 x). Link to regression: use encoding {+1, 1} for the two possible states c 1,2 of c. The conditional expectation of c x equals the Bayes discriminant function: E[c x] = c {+1, 1} cp(c x) = P(c 1 x) P(c 2 x) = G(x). Classification can be viewed as a (local) approximation of G(x) = E[c x] near its zero crossings. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 9 / 41
Linear Discriminant Functions Problem: direct approximation of G would require the knowledge of the Bayes optimal discriminant. We have to define a hypothesis class from which a function is chosen by some inference mechanism. One such class is the set of linear discriminant functions g(x; w) = w 0 + w t x. Decide c 1 if g(x; w) > 0 and c 2 if g(x; w) < 0. Equation g(x; w) = 0 defines the decision surface. Linearity of g(x; w) hyperplane w is orthogonal to any vector lying in the plane. The hyperplane divides the feature space into half-spaces R 1 ( positive side ) and R 2 ( negative side ). Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 10 / 41
Decision Hyperplanes g(x; w) defines distance r from x to the hyperplane: x = x p + r w w. g(x p ) = 0 g(x) = r w r = g(x)/ w. x 3 r x x p R 1 R 2 g(x) = 0 w w 0 / w H x 2 x 1 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 11 / 41
Generalized Linear Discriminant Functions Use basis functions {b 1 (x),..., b m (x)}, where each b i (x) : R d R, and g(x; w) = w 0 + w 1 b 1 (x) + + w m b m (x) =: w t y (note that we have redefined y here in order to be consistent with the following figure) 1 x x 2 y = ( ) 4 2 y 3 ˆ R 1-2 -1 0 1 2 R 1 R 2 R 1 x 0 0 0.5 FIGURE 5.5. The mapping y = (1, x, x 2 ) t takes a line and transforms it to a parabola in three dimensions. A plane splits the resulting y-space into regions corresponding to two categories, and this in turn gives a nonsimply connected decision region in the one-dimensional x-space. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Volker Roth (University Pattern of Classification. Basel) Copyright Machine c 2001Learning by John 2017 Wiley & Sons, Inc. 21st March 2017 12 / 41 1 y 1 1.5 2 2.5-1 ˆ R 2 0 2 1 y 2
Generalized Linear Discriminant Functions Use basis functions {b 1 (x),..., b m (x)}, where each b i (x) : R d R, and g(x; w) = w 0 + w 1 b 1 (x) + + w m b m (x) =: w t y. x 2 x 1 x2 αx 1 x 2 y = ( ) y 3 ˆ R 2 y 2w ˆ ˆH R 1 x 2 y 1 x 1 R 1 R 1 R 2 FIGURE 5.6. The two-dimensional input space x is mapped through a polynomial function f to y. Here the mapping is y 1 = x 1, y 2 = x 2 and y 3 x 1 x 2. A linear discriminant Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 13 / 41 x 1
Separable Case Consider sample {y i, c i } n i=1. If there exists f (y; w) = y t w which is positive for all examples in class 1 and negative for all examples in class 2, we say that the sample is linearly separable. Normalization : replace all samples labeled c 2 by their negatives simply write y t w > 0 for all samples. Each sample places a constraint on the possible location of w solution region. solution region y 2 solution region y 2 a a separating plane y 1 "separating" plane y 1 FIGURE 5.8. Four training samples (black for ω 1,redforω 2 )and the solution region in feature space. The figure on the left shows the raw data; the solution vectors leads to a Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 14 / 41
Separable Case: margin Different solution vectors may have different margins b : y t w b > 0. Intuitively, large margins are good. a 2 a 2 b/ y 3 } solution region y 1 y 2 solution region } b/ y 1 y 1 y 2 y3 a 1 y 3 a 1 } b/ y 2 FIGURE 5.9. The effect of the margin on the solution region. At the left is the case of no margin (b = 0) equivalent to a case such as shown at the left in Fig. 5.8. At the right Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 15 / 41
Gradient Descent Solve y t w > 0 by defining J(w) such that a minimizer of J is a solution. Start with initial w(1), and choose next value by moving in the direction of steepest gradient: w(k + 1) = w(k) η(k) J(w(k)). Alternatively, use second order expansion (Newton): w(k + 1) = w(k) H 1 J(w(k)). J(a) a 2 a 1 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 16 / 41
Minimizing the Perceptron Criterion Solve y t w > 0 by defining J(w) such that a minimizer of J is a solution. Most obvious: number of misclassifications, but not differentiable. Alternative choice: J p (w) = y M y t w, where M(w) is the set of samples misclassified by w. Since y t w < 0 y M, J p is non-negative, and zero only if w is a solution. Gradient: J(w) = y M y w(k + 1) = w(k) + η(k) y M This defines the Batch Perceptron algorithm. y. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 17 / 41
Minimizing the Perceptron Criterion (2) J p (a) 10 5 0-2 0 y 1 y 2 solution y 3-2 0 a 1 a 2 region 2 y 3 y 1 y3 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 18 / 41 2
Fixed-Increment Single Sample Perceptron Fix learning rate η(k) = 1. Sequential single-sample updates: use superscripts y 1, y 2,... for misclassified samples y M. Ordering is irrelevant. Simple algorithm: w(1) arbitrary w(k + 1) = w(k) + y k, k 1 Perceptron Convergence Theorem If the samples are linearly separable, the sequence of weight vectors given by the Fixed-Increment Single Sample Perceptron algorithm will terminate at a solution vector. Proof: exercises. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 19 / 41
Issues A number of problems with the perceptron algorithm: When the data are separable, there are many solutions, and which one is found depends on the starting values. In particular, no separation margin can be guaranteed (however, there exist modified versions...) The number of steps can be very large. When the data are not separable, the algorithm will not necessarily converge, and cycles may occur. The cycles can be long and therefore hard to detect. Method technical in nature, no (obvious) connection to learning theory. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 20 / 41
Informative (or Generative) vs Discriminative Recall: classification can be viewed as a (local) approximation of G(x) = E[c x] near its zero crossings. Two main strategies for solving this problem: Informative: Informative classifiers model the class densities. The likelihood of each class is examined and classification is done by assigning to the most likely class. Discriminative: These classifiers focus on modeling the class boundaries or the class membership probabilities directly. No attempt is made to model the underlying class conditional densities. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 21 / 41
Informative Classifiers Central idea: model the conditional class densities p(x c). Use Bayes rule to compute the posteriors: P(c j x) = p(x c = j) P(c = j). p(x) Assuming a parametrized class conditional density p θj (x c = j) and collecting all model parameters in a vector θ, these parameters are usually estimated by maximizing the full log likelihood ˆθ MLE = argmax θ = argmax θ n log p θ (x i, c i ) i=1 n (log p θ (x i c i ) + log P θ (c i )). i=1 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 22 / 41
Informative Classifiers: LDA In Linear Discriminant Analysis (LDA), a Gaussian model is used where all classes share a common covariance matrix Σ: p θ (x c = j) = N (x; µ j, Σ). The resulting discriminant functions are linear: g(x) = log P(c = 1)N (x; µ 1, Σ) P(c = 2)N (x; µ 2, Σ) = log π 1 π 2 1 2 (µ 1 + µ 2 ) T Σ 1 (µ 1 µ 2 ) }{{} w 0 +(µ 1 µ 2 ) T Σ 1 x }{{} w T x = w 0 + w T x. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 23 / 41
Informative Classifiers: LDA Discriminant function g(x) measures differences in the Mahalanobis distances from x to the k-th class mean: D (x, µ k ) = (x µ k ) T Σ 1 (x µ k ). Bayes-optimal decision rule: measure the Mahalanobis distances to each class and assign x to the class with the (prior corrected) smallest distance. Connection to LDA algorithm: let Σ w be an estimate of the shared covariance matrix Σ, and m j an estimate of µ j. LDA finds the weight vector w LDA = Σ 1 w (m 1 m 2 ). This w LDA asymptotically coincides with the Bayes-optimal w (if Gaussian model is correct!). Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 24 / 41
LDA: Algorithmic view 2 x 2 2 x 2 1.5 1.5 1 1 0.5 w 0.5 0.5 1 1.5 x 1-0.5 0.5 1 1.5 w x 1 FIGURE 3.5. Projection of the same set of samples onto two different lines in the directions Question: marked Howw. should The figure we select on the the rightprojection shows greater vector, separation whichbetween optimally the red and black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, discriminates between the different classes? Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 25 / 41
LDA: Algorithmic view Samples: {(x 1, c 1 ),..., (x n, c n )} Partition the sample set in class specific subsets Projected samples: x i = w T x i, 1 i n. Measure for the discrimination of the projected points? Sample averages m c = 1 n c x X c x, with n c = X c. Sample averages of projected points: m c = 1 w T x = w T m α. n c x X c Distance of the averages in the 2 class case: w T (m 1 m 2 ). Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 26 / 41
LDA: Algorithmic view Scatter of class c: Σ c = x X c (x m c )(x m c ) T. Scatter matrix is unscaled estimate of covariance matrix. Within scatter : Σ W = Σ 1 + Σ 2 (unscaled average). Scatter of projected data: x X c = (w T x m c )(w T x m c ) T x X c w T (x m c )(x m c ) T w = w T Σ c w =: Σ c 2 classes = Σ1 + Σ 2 = w T Σ W w. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 27 / 41
LDA: Algorithmic view J(w) = m 1 m 2 2 Σ 1 + Σ 2 = w T =:Σ B {}}{ (m 1 m 2 )(m 1 m 2 ) T w w T. Σ W w Idea: Optimal w can be found by maximization of J(w)! d dw J(w) = d dw ( ) (w T Σ W w) 1 w T Σ B w = 2Σ W w(w T Σ W w) 1 w T Σ B w(w T Σ W w) 1 +2Σ B w(w T Σ W w) 1 = 0. w T Σ W w Σ B w = Σ W w w T Σ B w w T Σ W w. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 28 / 41
LDA: Algorithmic view Generalized Eigenvalue Problem: Let Σ W be non singular: Σ 1 W Σ Bw }{{} (m 1 m 2 ) = λw with λ = w T Σ B w w T Σ W w. Unscaled Solution: ŵ = Σ 1 W (m 1 m 2 ). Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 29 / 41
Discriminative classifiers Discriminative classifiers focus directly on the discriminant function. In general, they are more flexible with regard to the class conditional densities they are capable of modeling. Bayes formula: P(c = 1 x) g(x) = log P(c = 2 x) p(x c = 1)P(c = 1) = log p(x c = 2)P(c = 2), Can model any conditional probabilities that are exponential tilts of each other: p(x c = 1) = e g(x) P(c = 2) p(x c = 2) P(c = 1) Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 30 / 41
Discriminative classifiers Since the discriminative model directly focuses on the posteriors, it is natural to estimate the model parameters θ in p θ (x c) by maximizing the conditional log likelihood ˆθ DISCR = argmax θ n log p θ (c i x i ). i=1 The discriminative model makes no use of the information contained in the marginal density p(x). Thus, if the density model is correct, useful information is ignored and the informative approach may give better results. However, if the class models are incorrect, it may be better not to use them. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 31 / 41
Logistic Regression (LOGREG) Logistic regression uses a linear discriminant function, i.e. g(x) = w T x + w 0. For the special case p(x c = {1, 0}) = N (x; µ 1,0, Σ), LDA and LOGREG coincide: p(x c = 1) = N (x, µ 1, Σ) = e w T k x+w0 N (x; µ 0, Σ) π 0 π 1 g(x) = w 0 + w T x = log π 1N (x; µ 1, Σ) π 0 N (x; µ 0, Σ) Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 32 / 41
Logistic Regression (LOGREG) Two-class problem with Binomial RV c taking values in {0, 1} sufficient to represent P(1 x), since P(0 x) = 1 P(1 x). Success probability of the Binomial RV: π(x) := P(1 x) Basketball example: Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 33 / 41
Logistic Regression (LOGREG) LOGREG: log P(c=1 x) P(c=0 x) = log π(x) 1 π(x) = w T x + w 0 Thus, the conditional expectation of c x satisfies the logistic regression model: π(x) := E[c x] = exp{w T x + w 0 } 1 + exp{w T x + w 0 } = σ(w T x + w 0 ), Logistic squashing function σ(z) = ez 1+e = 1 z 1+e turns linear z predictions into probabilities Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 34 / 41
Logistic Regression (LOGREG) Assume that w 0 is absorbed in w using x (1 : x). Estimate w by maximizing the conditional likelihood n ŵ DISCR = argmax p(c i x i ; w) w = argmax w i=1 n i=1 (π(x i ; w)) c i (1 π(x i ; w)) 1 c i, or by maximizing the corresponding log likelihood l: n l(w) = [c i log π(x i ; w) + (1 c i ) log(1 π(x i ; w))]. i=1 The score functions are defined as s(w) = n w l(w) = x i (c i π i ). i=1 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 35 / 41
Logistic Regression (LOGREG) Since the π i depend non-linearly on w, the equation system s(w) = 0 usually cannot be solved analytically and iterative techniques must be applied. Newton s method updates the parameter estimates w at the r-th step by w (r+1) = w (r) {H (r) } 1 s (r), where H (r) is the Hessian of l, evaluated at w (r) : ( H (r) 2 ) l = w w T w=w (r) n = π (r) i (1 π (r) i )x i x T i. i=1 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 36 / 41
Logistic Regression (LOGREG) The Newton updates can be restated as an Iterated Re-weighted Least Squares (IRLS) problem: The Hessian is equal to ( X T WX), with W = diag {π 1 (1 π 1 ),..., π n (1 π n )} The gradient of l (the scores) can be written as s = X T W e, where e is a vector with entries e j = (c j π j )/W jj. With q (r) := Xw (r) + e, the updates read (X T W (r) X)w r+1 = X T W (r) q (r). These are the normal equations of a LS problem with input matrix (W (r) ) 1/2 X and r.h.s. (W (r) ) 1/2 q (r). The values W ii are functions of the actual w (r), so that iteration is needed. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 37 / 41
Logistic Regression (LOGREG) Simple binary classification problem in R 2. Solved with LOGREG using polynomial basis functions. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 38 / 41
Loss functions LOGREG maximizes log likelihood n l(w) = [c i log π(x i ; w) + (1 c i ) log(1 π(x i ; w))], i=1 where π = 1 1+e, z = w t x. z This is the same as minimizing n l(w) = [ c i log π(x i ; w) (1 c i ) log(1 π(x i ; w))] = =: i=1 n i=1 n i=1 [ log(1 + e z i ) + (1 c i )z i ] Loss(c i, z i ). Loss function: (1 c i )z + log(1 + e z ). Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 39 / 41
Loss functions 0 1 2 3 4 2 1 0 1 2 3 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 40 / 41
Summing up Classification can be viewed as a (local) approximation of G(x) = E[c x] near its zero crossings. We have to define a hypothesis class from which a function is chosen by some inference mechanism. One such class is the set of linear discriminant functions g(x; w) = w 0 + w t x. Can be easily generalized to arbitrary basis functions. Two main strategies for solving this problem: Informative: use marginal density p(x), and discriminative: ignore p(x). Two simple models: LDA and LOGREG. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 41 / 41