Machine Learning Lecture 5 - PDF Free Download

Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de

Course Outline Fundamentals Bayes Decision Theory Probability Density Estimation Classification Approaches Linear Discriminants Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Deep Learning Foundations Convolutional Neural Networks Recurrent Neural Networks 2

Recap: Mixture of Gaussians (MoG) Generative model j p(j) = ¼ j Weight of mixture component p(x) 1 2 3 p(xjµ j ) Mixture component x p(x) p(xjµ) = Mixture density MX p(xjµ j )p(j) Slide credit: Bernt Schiele x j=1 3

Recap: Estimating MoGs Iterative Strategy Assuming we knew the values of the hidden variable ML for Gaussian #1 ML for Gaussian #2 assumed known 1 111 22 2 2 j h(j = 1jx n ) = 1 111 00 0 0 h(j = 2jx n ) = 0 000 11 1 1 ¹ 1 = P N n=1 h(j = 1jx n)x n P N i=1 h(j = 1jx n) ¹ 2 = P N n=1 h(j = 2jx n)x n P N i=1 h(j = 2jx n) Slide credit: Bernt Schiele 4

Recap: Estimating MoGs Iterative Strategy Assuming we knew the mixture components assumed known p(j = 1jx) p(j = 2jx) 1 111 22 2 2 j Bayes decision rule: Decide j = 1 if p(j = 1jx n ) > p(j = 2jx n ) Slide credit: Bernt Schiele 5

Recap: K-Means Clustering Iterative procedure 1. Initialization: pick K arbitrary centroids (cluster means) 2. Assign each sample to the closest centroid. 3. Adjust the centroids to be the means of the samples assigned to them. 4. Go to step 2 (until no change) Algorithm is guaranteed to converge after finite #iterations. Local optimum Final result depends on initialization. Slide credit: Bernt Schiele 6

Recap: EM Algorithm Expectation-Maximization (EM) Algorithm E-Step: softly assign samples to mixture components j (x n ) Ã ¼ jn(x n j¹ j ; j ) P N 8j = 1; : : : ; K; k=1 ¼ kn(x n j¹ k ; k ) n = 1; : : : ; N M-Step: re-estimate the parameters (separately for each mixture component) based on the soft assignments NX ^N j Ã j (x n ) = soft number of samples labeled j n=1 ^¼ new j ^¹ new j ^ new j Ã ^N j N Ã 1^Nj Ã 1^Nj Slide adapted from Bernt Schiele NX j (x n )x n n=1 NX n=1 j (x n )(x n ^¹ new j )(x n ^¹ new j ) T 7

Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Least-squares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent 8

Discriminant Functions Bayesian Decision Theory p(c k jx) = p(xjc k)p(c k ) p(x) Model conditional probability densities p(xjc k ) and priors p(c k ) Compute posteriors p(c k jx) (using Bayes rule) Minimize probability of misclassification by maximizing p(cjx). New approach Directly encode decision boundary Without explicit modeling of probability densities Minimize misclassification probability directly. Slide credit: Bernt Schiele 9

Recap: Discriminant Functions Formulate classification in terms of comparisons Discriminant functions y 1 (x); : : : ; y K (x) Classify x as class C k if y k (x) > y j (x) 8j 6= k Examples (Bayes Decision Theory) y k (x) = p(c k jx) y k (x) = p(xjc k )p(c k ) y k (x) = log p(xjc k ) + log p(c k ) Slide credit: Bernt Schiele 10

Discriminant Functions Example: 2 classes y 1 (x) > y 2 (x), y 1 (x) y 2 (x) > 0, y(x) > 0 Decision functions (from Bayes Decision Theory) y(x) = p(c 1 jx) p(c 2 jx) y(x) = ln p(xjc 1) p(xjc 2 ) + ln p(c 1) p(c 2 ) Slide credit: Bernt Schiele 11

Learning Discriminant Functions General classification problem Goal: take a new input x and assign it to one of K classes C k. Given: training set X = {x 1,, x N } with target values T = {t 1,, t N }. Learn a discriminant function y(x) to perform the classification. 2-class problem Binary target values: K-class problem 1-of-K coding scheme, e.g. t n 2 f0; 1g t n = (0; 1; 0; 0; 0) T 12

Linear Discriminant Functions 2-class problem y(x) > 0 : Decide for class C 1, else for class C 2 In the following, we focus on linear discriminant functions y(x) = w T x + w 0 weight vector bias (= threshold) If a data set can be perfectly classified by a linear discriminant, then we call it linearly separable. Slide credit: Bernt Schiele 13

Linear Discriminant Functions Decision boundary Normal vector: w Offset: w 0 kwk y(x) = 0 y > 0 y = 0 y < 0 defines a hyperplane y(x) = w T x + w 0 Slide credit: Bernt Schiele 14

Linear Discriminant Functions Notation D : Number of dimensions x = 2 6 4 x 1 x 2.. 3 7 5 w = 2 6 4 w 1 w 2.. 3 7 5 x D w D y(x) = w T x + w 0 DX = w i x i + w 0 = i=1 DX w i x i i=0 with x 0 = 1 constant Slide credit: Bernt Schiele 15

Extension to Multiple Classes Two simple strategies One-vs-all classifiers One-vs-one classifiers How many classifiers do we need in both cases? What difficulties do you see for those strategies? 16 Image source: C.M. Bishop, 2006

Extension to Multiple Classes Problem Both strategies result in regions for which the pure classification result (y k > 0) is ambiguous. In the one-vs-all case, it is still possible to classify those inputs based on the continuous classifier outputs y k > y j 8j k. Solution We can avoid those difficulties by taking K linear functions of the form y k (x) = w T k x + w k0 and defining the decision boundaries directly by deciding for C k iff y k > y j 8j k. This corresponds to a 1-of-K coding scheme t n = (0; 1; 0; : : : ; 0; 0) T 17 Image source: C.M. Bishop, 2006

Extension to Multiple Classes K-class discriminant Combination of K linear functions y k (x) = w T k x + w k0 Resulting decision hyperplanes: (w k w j ) T x + (w k0 w j0 ) = 0 It can be shown that the decision regions of such a discriminant are always singly connected and convex. This makes linear discriminant models particularly suitable for problems for which the conditional densities p(x w i ) are unimodal. 18 Image source: C.M. Bishop, 2006

General Classification Problem Classification problem Let s consider K classes described by linear models y k (x) = w T k x + w k0 ; k = 1; : : : ; K We can group those together using vector notation where y(x) = W ft ex 2 w 10 : : : w K0 w 11 : : : w K1 fw = [ ew 1 ; : : : ; ew K ] = 6. 4...... w 1D : : : w KD 3 7 5 The output will again be in 1-of-K notation. We can directly compare it to the target value. t = [t 1 ; : : : ; t k ] T 20

General Classification Problem Classification problem For the entire dataset, we can write Y( e X) = e X f W and compare this to the target matrix T where fw = [ ew 1 ; : : : ; ew K ] 2 3 ex = 6 xt 4. 1. 7 5 T = x T N 2 6 4 t T 1... t T N 3 7 5 Result of the comparison: ex f W T fw Goal: Choose such that this is minimal! 21

Least-Squares Classification Simplest approach Directly try to minimize the sum-of-squares error We could write this as E(w) = 1 2 NX KX (y k (x n ; w) t kn ) 2 n=1 k=1 But let s stick with the matrix notation for now (The result will be simpler to express and we ll learn some nice matrix algebra rules along the way...) 22

Least-Squares Classification Multi-class case E D ( W) f = 1 n( 2 Tr X e W f T) T ( X e W f o T) Let s formulate the sum-of-squares error in matrix notation X using: a 2 ij = Tr A T A ª i;j Taking the derivative yields @ @ f W E D( f W) = 1 2 @ @ f W Tr n ( e X f W T) T ( e X f W T) = 1 @ 2 @( X e W f T) T ( X e W f T) Tr @ @ W f ( X e W f T) T ( X e W f T) = e X T ( e X f W T) o chain rule: @Z @X = @Z @Y @Y @X n ( X e W f T) T ( X e W f o T) using: @ Tr fag = I @A 25

Least-Squares Classification Minimizing the sum-of-squares error @ @ W f E D( W) f = X e T ( X e W f T) =! 0 ex W f = T fw = ( e X T e X) 1 e X T T = e X y T pseudo-inverse We then obtain the discriminant function as y(x) = f W T ex = T T³ e X y T ex Exact, closed-form solution for the discriminant function parameters. 26

Problems with Least Squares Least-squares is very sensitive to outliers! The error function penalizes predictions that are too correct. 27 Image source: C.M. Bishop, 2006

Problems with Least-Squares Another example: 3 classes (red, green, blue) Linearly separable problem Least-squares solution: Most green points are misclassified! Deeper reason for the failure Least-squares corresponds to Maximum Likelihood under the assumption of a Gaussian conditional distribution. However, our binary target vectors have a distribution that is clearly non-gaussian! Least-squares is the wrong probabilistic tool in this case! 28 Image source: C.M. Bishop, 2006

Generalized Linear Models Linear model y(x) = w T x + w 0 Generalized linear model y(x) = g(w T x + w 0 ) g( ) is called an activation function and may be nonlinear. The decision surfaces correspond to y(x) = const:, w T x + w 0 = const: If g is monotonous (which is typically the case), the resulting decision boundaries are still linear functions of x. 30

Generalized Linear Models Consider 2 classes: p(c 1 jx) = = = p(xjc 1 )p(c 1 ) p(xjc 1 )p(c 1 ) + p(xjc 2 )p(c 2 ) 1 1 + p(xjc 2)p(C 2 ) p(xjc 1 )p(c 1 ) 1 1 + exp( a) g(a) Slide credit: Bernt Schiele with a = ln p(xjc 1)p(C 1 ) p(xjc 2 )p(c 2 ) 31

Logistic Sigmoid Activation Function g(a) 1 1 + exp( a) Example: Normal distributions with identical covariance p x a pxb x p a x p b x Slide credit: Bernt Schiele x 32

Normalized Exponential General case of K > 2 classes: p(c k jx) = p(xjc k)p(c k ) P j p(xjc j)p(c j ) = P exp(a k) j exp(a j) with a k = ln p(xjc k )p(c k ) This is known as the normalized exponential or softmax function Can be regarded as a multiclass generalization of the logistic sigmoid. Slide credit: Bernt Schiele 33

Relationship to Neural Networks 2-Class case with x 0 = 1 constant Neural network ( single-layer perceptron ) Slide credit: Bernt Schiele 34

Relationship to Neural Networks Multi-class case with x 0 = 1 constant Multi-class perceptron K Slide credit: Bernt Schiele 35

Logistic Discrimination If we use the logistic sigmoid activation function g(a) 1 1 + exp( a) y(x) = g(w T x + w 0 ) then we can interpret the y(x) as posterior probabilities! Slide adapted from Bernt Schiele 36

Other Motivation for Nonlinearity Recall least-squares classification One of the problems was that data points that are too correct have a strong influence on the decision surface under a squared-error criterion. E(w) = NX (y(x n ; w) t n ) 2 n=1 Reason: the output of y(x n ;w) can grow arbitrarily large for some x n : y(x; w) = w T x + w 0 By choosing a suitable nonlinearity (e.g. a sigmoid), we can limit those influences y(x; w) = g(w T x + w 0 ) 37

Discussion: Generalized Linear Models Advantages The nonlinearity gives us more flexibility. Can be used to limit the effect of outliers. Choice of a sigmoid leads to a nice probabilistic interpretation. Disadvantage Least-squares minimization in general no longer leads to a closed-form analytical solution. Need to apply iterative methods. Gradient descent. 38

Linear Separability Up to now: restrictive assumption Only consider linear decision boundaries Classical counterexample: XOR x 2 x 1 Slide credit: Bernt Schiele 39

Generalized Linear Discriminants Generalization Transform vector x with M nonlinear basis functions Á j (x): y k (x) = MX w kj Á j (x) + w k0 j=1 Purpose of Á j (x): basis functions Allow non-linear decision boundaries. By choosing the right Á j, every continuous function can (in principle) be approximated with arbitrary accuracy. Notation y k (x) = MX w kj Á j (x) with Á 0 (x) = 1 Slide credit: Bernt Schiele j=0 41

Generalized Linear Discriminants Model y k (x) = MX w kj Á j (x) = y k (x; w) j=0 K functions (outputs) y k (x;w) Learning in Neural Networks Single-layer networks: Á j are fixed, only weights w are learned. Multi-layer networks: both the w and the Á j are learned. We will take a closer look at neural networks from lecture 11 on. For now, let s first consider generalized linear discriminants in general Slide credit: Bernt Schiele 42

Gradient Descent Learning the weights w: N training data points: X = {x 1,, x N } K outputs of decision functions: y k (x n ;w) Target vector for each data point: T = {t 1,, t N } Error function (least-squares error) of linear model E(w) = 1 2 = 1 2 NX n=1 k=1 KX (y k (x n ; w) t kn ) 2 0 NX KX @ n=1 k=1 j=1 1 MX w kj Á j (x n ) t kn A 2 Slide credit: Bernt Schiele 43

Gradient Descent Problem The error function can in general no longer be minimized in closed form. ldea (Gradient Descent) Iterative minimization Start with an initial guess for the parameter values. Move towards a (local) minimum by following the gradient. w ( +1) kj = w ( ) kj @E(w) @w kj This simple scheme corresponds to a 1 st -order Taylor expansion (There are more complex procedures available). w( ) : Learning rate w (0) kj 44

Gradient Descent Basic Strategies Batch learning w ( +1) kj = w ( ) kj @E(w) @w kj w( ) : Learning rate Compute the gradient based on all training data: @E(w) @w kj Slide credit: Bernt Schiele 45

Gradient Descent Basic Strategies Sequential updating E(w) = NX n=1 w ( +1) kj = w ( ) kj E n (w) @E n(w) @w kj : Learning rate w( ) Compute the gradient based on a single data point at a time: @E n (w) @w kj Slide credit: Bernt Schiele 46

Gradient Descent Error function NX E(w) = E n (w) = 1 2 n=1 E n (w) = 1 2 0 @E n (w) @w kj = @ NX 0 KX @ n=1 k=1 0 KX @ k=1 1 MX w kj Á j (x n ) t kn A j=1 1 MX w kj Á j (x n ) t kn A j=1 1 MX w k ~j Á ~j (x n) t kn A Á j (x n ) ~j=1 2 2 = (y k (x n ; w) t kn ) Á j (x n ) Slide credit: Bernt Schiele 47

Gradient Descent Delta rule (=LMS rule) w ( +1) kj = w ( ) kj (y k(x n ; w) t kn ) Á j (x n ) = w ( ) kj ± kná j (x n ) where ± kn = y k (x n ; w) t kn Simply feed back the input data point, weighted by the classification error. Slide credit: Bernt Schiele 48

Gradient Descent Cases with differentiable, non-linear activation function 0 y k (x) = g(a k ) = g @ 1 MX w ki Á j (x n ) A j=0 Gradient descent Slide credit: Bernt Schiele @E n (w) @w kj = @g(a k) @w kj (y k (x n ; w) t kn ) Á j (x n ) w ( +1) kj = w ( ) kj ± kná j (x n ) ± kn = @g(a k) @w kj (y k (x n ; w) t kn ) 49

Summary: Generalized Linear Discriminants Properties General class of decision functions. Nonlinearity g( ) and basis functions Á j allow us to address linearly non-separable problems. Shown simple sequential learning approach for parameter estimation using gradient descent. Better 2 nd order gradient descent approaches available (e.g. Newton-Raphson). Limitations / Caveats Flexibility of model is limited by curse of dimensionality g( ) and Á j often introduce additional parameters. Models are either limited to lower-dimensional input space or need to share parameters. Linearly separable case often leads to overfitting. Several possible parameter choices minimize training error. 50

References and Further Reading More information on Linear Discriminant Functions can be found in Chapter 4 of Bishop s book (in particular Chapter 4.1). Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 52