Machine Learning Lecture 5

Similar documents
Machine Learning Lecture 7

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Machine Learning Lecture 10

Machine Learning Lecture 2

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Machine Learning Lecture 3

Ch 4. Linear Models for Classification

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Reading Group on Deep Learning Session 1

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Machine Learning Lecture 12

Machine Learning Lecture 2

Linear Models for Classification

Machine Learning Lecture 2

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 10. Announcement. Mixture Models II. Topics of This Lecture. This Lecture: Advanced Machine Learning. Recap: GMMs as Latent Variable Models

Neural Network Training

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Linear & nonlinear classifiers

Machine Learning for Signal Processing Bayes Classification and Regression

Multi-layer Neural Networks

Introduction to Neural Networks

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

Logistic Regression. COMP 527 Danushka Bollegala

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear & nonlinear classifiers

STA 4273H: Statistical Machine Learning

Artificial Neural Networks. MGS Lecture 2

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Linear discriminant functions

Machine Learning Lecture 14

Machine Learning. 7. Logistic and Linear Regression

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Machine Learning Lecture 14

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Logistic Regression. Machine Learning Fall 2018

CSCI-567: Machine Learning (Spring 2019)

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Pattern Recognition and Machine Learning

Machine Learning Lecture 1

Multilayer Perceptron

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Advanced statistical methods for data analysis Lecture 2

Machine Learning 2017

Neural Networks. Bishop PRML Ch. 5. Alireza Ghane. Feed-forward Networks Network Training Error Backpropagation Applications

An Introduction to Statistical and Probabilistic Linear Models

Neural Networks and Deep Learning

ECE521 week 3: 23/26 January 2017

Linear Models for Classification

Chapter 14 Combining Models

Artificial Neural Networks

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Max Margin-Classifier

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Intelligent Systems Discriminative Learning, Neural Networks

Feed-forward Networks Network Training Error Backpropagation Applications. Neural Networks. Oliver Schulte - CMPT 726. Bishop PRML Ch.

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Logistic Regression & Neural Networks

Overfitting, Bias / Variance Analysis

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Multiclass Logistic Regression

Multivariate statistical methods and data mining in particle physics

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Feed-forward Network Functions

ECE521 Lecture7. Logistic Regression

Lecture 5: Logistic Regression. Neural Networks

FINAL: CS 6375 (Machine Learning) Fall 2014

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Linear Discrimination Functions

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Statistical Data Mining and Machine Learning Hilary Term 2016

y(x n, w) t n 2. (1)

Neural networks and support vector machines

4. Multilayer Perceptrons

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Logistic Regression. Seungjin Choi

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

Statistical Machine Learning Hilary Term 2018

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

From perceptrons to word embeddings. Simon Šuster University of Groningen

SGN (4 cr) Chapter 5

Transcription:

Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de

Course Outline Fundamentals Bayes Decision Theory Probability Density Estimation Classification Approaches Linear Discriminants Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Deep Learning Foundations Convolutional Neural Networks Recurrent Neural Networks 2

Recap: Mixture of Gaussians (MoG) Generative model j p(j) = ¼ j Weight of mixture component p(x) 1 2 3 p(xjµ j ) Mixture component x p(x) p(xjµ) = Mixture density MX p(xjµ j )p(j) Slide credit: Bernt Schiele x j=1 3

Recap: Estimating MoGs Iterative Strategy Assuming we knew the values of the hidden variable ML for Gaussian #1 ML for Gaussian #2 assumed known 1 111 22 2 2 j h(j = 1jx n ) = 1 111 00 0 0 h(j = 2jx n ) = 0 000 11 1 1 ¹ 1 = P N n=1 h(j = 1jx n)x n P N i=1 h(j = 1jx n) ¹ 2 = P N n=1 h(j = 2jx n)x n P N i=1 h(j = 2jx n) Slide credit: Bernt Schiele 4

Recap: Estimating MoGs Iterative Strategy Assuming we knew the mixture components assumed known p(j = 1jx) p(j = 2jx) 1 111 22 2 2 j Bayes decision rule: Decide j = 1 if p(j = 1jx n ) > p(j = 2jx n ) Slide credit: Bernt Schiele 5

Recap: K-Means Clustering Iterative procedure 1. Initialization: pick K arbitrary centroids (cluster means) 2. Assign each sample to the closest centroid. 3. Adjust the centroids to be the means of the samples assigned to them. 4. Go to step 2 (until no change) Algorithm is guaranteed to converge after finite #iterations. Local optimum Final result depends on initialization. Slide credit: Bernt Schiele 6

Recap: EM Algorithm Expectation-Maximization (EM) Algorithm E-Step: softly assign samples to mixture components j (x n ) à ¼ jn(x n j¹ j ; j ) P N 8j = 1; : : : ; K; k=1 ¼ kn(x n j¹ k ; k ) n = 1; : : : ; N M-Step: re-estimate the parameters (separately for each mixture component) based on the soft assignments NX ^N j à j (x n ) = soft number of samples labeled j n=1 ^¼ new j ^¹ new j ^ new j à ^N j N à 1^Nj à 1^Nj Slide adapted from Bernt Schiele NX j (x n )x n n=1 NX n=1 j (x n )(x n ^¹ new j )(x n ^¹ new j ) T 7

Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Least-squares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent 8

Discriminant Functions Bayesian Decision Theory p(c k jx) = p(xjc k)p(c k ) p(x) Model conditional probability densities p(xjc k ) and priors p(c k ) Compute posteriors p(c k jx) (using Bayes rule) Minimize probability of misclassification by maximizing p(cjx). New approach Directly encode decision boundary Without explicit modeling of probability densities Minimize misclassification probability directly. Slide credit: Bernt Schiele 9

Recap: Discriminant Functions Formulate classification in terms of comparisons Discriminant functions y 1 (x); : : : ; y K (x) Classify x as class C k if y k (x) > y j (x) 8j 6= k Examples (Bayes Decision Theory) y k (x) = p(c k jx) y k (x) = p(xjc k )p(c k ) y k (x) = log p(xjc k ) + log p(c k ) Slide credit: Bernt Schiele 10

Discriminant Functions Example: 2 classes y 1 (x) > y 2 (x), y 1 (x) y 2 (x) > 0, y(x) > 0 Decision functions (from Bayes Decision Theory) y(x) = p(c 1 jx) p(c 2 jx) y(x) = ln p(xjc 1) p(xjc 2 ) + ln p(c 1) p(c 2 ) Slide credit: Bernt Schiele 11

Learning Discriminant Functions General classification problem Goal: take a new input x and assign it to one of K classes C k. Given: training set X = {x 1,, x N } with target values T = {t 1,, t N }. Learn a discriminant function y(x) to perform the classification. 2-class problem Binary target values: K-class problem 1-of-K coding scheme, e.g. t n 2 f0; 1g t n = (0; 1; 0; 0; 0) T 12

Linear Discriminant Functions 2-class problem y(x) > 0 : Decide for class C 1, else for class C 2 In the following, we focus on linear discriminant functions y(x) = w T x + w 0 weight vector bias (= threshold) If a data set can be perfectly classified by a linear discriminant, then we call it linearly separable. Slide credit: Bernt Schiele 13

Linear Discriminant Functions Decision boundary Normal vector: w Offset: w 0 kwk y(x) = 0 y > 0 y = 0 y < 0 defines a hyperplane y(x) = w T x + w 0 Slide credit: Bernt Schiele 14

Linear Discriminant Functions Notation D : Number of dimensions x = 2 6 4 x 1 x 2.. 3 7 5 w = 2 6 4 w 1 w 2.. 3 7 5 x D w D y(x) = w T x + w 0 DX = w i x i + w 0 = i=1 DX w i x i i=0 with x 0 = 1 constant Slide credit: Bernt Schiele 15

Extension to Multiple Classes Two simple strategies One-vs-all classifiers One-vs-one classifiers How many classifiers do we need in both cases? What difficulties do you see for those strategies? 16 Image source: C.M. Bishop, 2006

Extension to Multiple Classes Problem Both strategies result in regions for which the pure classification result (y k > 0) is ambiguous. In the one-vs-all case, it is still possible to classify those inputs based on the continuous classifier outputs y k > y j 8j k. Solution We can avoid those difficulties by taking K linear functions of the form y k (x) = w T k x + w k0 and defining the decision boundaries directly by deciding for C k iff y k > y j 8j k. This corresponds to a 1-of-K coding scheme t n = (0; 1; 0; : : : ; 0; 0) T 17 Image source: C.M. Bishop, 2006

Extension to Multiple Classes K-class discriminant Combination of K linear functions y k (x) = w T k x + w k0 Resulting decision hyperplanes: (w k w j ) T x + (w k0 w j0 ) = 0 It can be shown that the decision regions of such a discriminant are always singly connected and convex. This makes linear discriminant models particularly suitable for problems for which the conditional densities p(x w i ) are unimodal. 18 Image source: C.M. Bishop, 2006

Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Least-squares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent 19

General Classification Problem Classification problem Let s consider K classes described by linear models y k (x) = w T k x + w k0 ; k = 1; : : : ; K We can group those together using vector notation where y(x) = W ft ex 2 w 10 : : : w K0 w 11 : : : w K1 fw = [ ew 1 ; : : : ; ew K ] = 6. 4...... w 1D : : : w KD 3 7 5 The output will again be in 1-of-K notation. We can directly compare it to the target value. t = [t 1 ; : : : ; t k ] T 20

General Classification Problem Classification problem For the entire dataset, we can write Y( e X) = e X f W and compare this to the target matrix T where fw = [ ew 1 ; : : : ; ew K ] 2 3 ex = 6 xt 4. 1. 7 5 T = x T N 2 6 4 t T 1... t T N 3 7 5 Result of the comparison: ex f W T fw Goal: Choose such that this is minimal! 21

Least-Squares Classification Simplest approach Directly try to minimize the sum-of-squares error We could write this as E(w) = 1 2 NX KX (y k (x n ; w) t kn ) 2 n=1 k=1 But let s stick with the matrix notation for now (The result will be simpler to express and we ll learn some nice matrix algebra rules along the way...) 22

Least-Squares Classification Multi-class case E D ( W) f = 1 n( 2 Tr X e W f T) T ( X e W f o T) Let s formulate the sum-of-squares error in matrix notation X using: a 2 ij = Tr A T A ª i;j Taking the derivative yields @ @ f W E D( f W) = 1 2 @ @ f W Tr n ( e X f W T) T ( e X f W T) = 1 @ 2 @( X e W f T) T ( X e W f T) Tr @ @ W f ( X e W f T) T ( X e W f T) = e X T ( e X f W T) o chain rule: @Z @X = @Z @Y @Y @X n ( X e W f T) T ( X e W f o T) using: @ Tr fag = I @A 25

Least-Squares Classification Minimizing the sum-of-squares error @ @ W f E D( W) f = X e T ( X e W f T) =! 0 ex W f = T fw = ( e X T e X) 1 e X T T = e X y T pseudo-inverse We then obtain the discriminant function as y(x) = f W T ex = T T³ e X y T ex Exact, closed-form solution for the discriminant function parameters. 26

Problems with Least Squares Least-squares is very sensitive to outliers! The error function penalizes predictions that are too correct. 27 Image source: C.M. Bishop, 2006

Problems with Least-Squares Another example: 3 classes (red, green, blue) Linearly separable problem Least-squares solution: Most green points are misclassified! Deeper reason for the failure Least-squares corresponds to Maximum Likelihood under the assumption of a Gaussian conditional distribution. However, our binary target vectors have a distribution that is clearly non-gaussian! Least-squares is the wrong probabilistic tool in this case! 28 Image source: C.M. Bishop, 2006

Topics of This Lecture Linear discriminant functions Definition Extension to multiple classes Least-squares classification Derivation Shortcomings Generalized linear models Connection to neural networks Generalized linear discriminants & gradient descent 29

Generalized Linear Models Linear model y(x) = w T x + w 0 Generalized linear model y(x) = g(w T x + w 0 ) g( ) is called an activation function and may be nonlinear. The decision surfaces correspond to y(x) = const:, w T x + w 0 = const: If g is monotonous (which is typically the case), the resulting decision boundaries are still linear functions of x. 30

Generalized Linear Models Consider 2 classes: p(c 1 jx) = = = p(xjc 1 )p(c 1 ) p(xjc 1 )p(c 1 ) + p(xjc 2 )p(c 2 ) 1 1 + p(xjc 2)p(C 2 ) p(xjc 1 )p(c 1 ) 1 1 + exp( a) g(a) Slide credit: Bernt Schiele with a = ln p(xjc 1)p(C 1 ) p(xjc 2 )p(c 2 ) 31

Logistic Sigmoid Activation Function g(a) 1 1 + exp( a) Example: Normal distributions with identical covariance p x a pxb x p a x p b x Slide credit: Bernt Schiele x 32

Normalized Exponential General case of K > 2 classes: p(c k jx) = p(xjc k)p(c k ) P j p(xjc j)p(c j ) = P exp(a k) j exp(a j) with a k = ln p(xjc k )p(c k ) This is known as the normalized exponential or softmax function Can be regarded as a multiclass generalization of the logistic sigmoid. Slide credit: Bernt Schiele 33

Relationship to Neural Networks 2-Class case with x 0 = 1 constant Neural network ( single-layer perceptron ) Slide credit: Bernt Schiele 34

Relationship to Neural Networks Multi-class case with x 0 = 1 constant Multi-class perceptron K Slide credit: Bernt Schiele 35

Logistic Discrimination If we use the logistic sigmoid activation function g(a) 1 1 + exp( a) y(x) = g(w T x + w 0 ) then we can interpret the y(x) as posterior probabilities! Slide adapted from Bernt Schiele 36

Other Motivation for Nonlinearity Recall least-squares classification One of the problems was that data points that are too correct have a strong influence on the decision surface under a squared-error criterion. E(w) = NX (y(x n ; w) t n ) 2 n=1 Reason: the output of y(x n ;w) can grow arbitrarily large for some x n : y(x; w) = w T x + w 0 By choosing a suitable nonlinearity (e.g. a sigmoid), we can limit those influences y(x; w) = g(w T x + w 0 ) 37

Discussion: Generalized Linear Models Advantages The nonlinearity gives us more flexibility. Can be used to limit the effect of outliers. Choice of a sigmoid leads to a nice probabilistic interpretation. Disadvantage Least-squares minimization in general no longer leads to a closed-form analytical solution. Need to apply iterative methods. Gradient descent. 38

Linear Separability Up to now: restrictive assumption Only consider linear decision boundaries Classical counterexample: XOR x 2 x 1 Slide credit: Bernt Schiele 39

Generalized Linear Discriminants Generalization Transform vector x with M nonlinear basis functions Á j (x): y k (x) = MX w kj Á j (x) + w k0 j=1 Purpose of Á j (x): basis functions Allow non-linear decision boundaries. By choosing the right Á j, every continuous function can (in principle) be approximated with arbitrary accuracy. Notation y k (x) = MX w kj Á j (x) with Á 0 (x) = 1 Slide credit: Bernt Schiele j=0 41

Generalized Linear Discriminants Model y k (x) = MX w kj Á j (x) = y k (x; w) j=0 K functions (outputs) y k (x;w) Learning in Neural Networks Single-layer networks: Á j are fixed, only weights w are learned. Multi-layer networks: both the w and the Á j are learned. We will take a closer look at neural networks from lecture 11 on. For now, let s first consider generalized linear discriminants in general Slide credit: Bernt Schiele 42

Gradient Descent Learning the weights w: N training data points: X = {x 1,, x N } K outputs of decision functions: y k (x n ;w) Target vector for each data point: T = {t 1,, t N } Error function (least-squares error) of linear model E(w) = 1 2 = 1 2 NX n=1 k=1 KX (y k (x n ; w) t kn ) 2 0 NX KX @ n=1 k=1 j=1 1 MX w kj Á j (x n ) t kn A 2 Slide credit: Bernt Schiele 43

Gradient Descent Problem The error function can in general no longer be minimized in closed form. ldea (Gradient Descent) Iterative minimization Start with an initial guess for the parameter values. Move towards a (local) minimum by following the gradient. w ( +1) kj = w ( ) kj @E(w) @w kj This simple scheme corresponds to a 1 st -order Taylor expansion (There are more complex procedures available). w( ) : Learning rate w (0) kj 44

Gradient Descent Basic Strategies Batch learning w ( +1) kj = w ( ) kj @E(w) @w kj w( ) : Learning rate Compute the gradient based on all training data: @E(w) @w kj Slide credit: Bernt Schiele 45

Gradient Descent Basic Strategies Sequential updating E(w) = NX n=1 w ( +1) kj = w ( ) kj E n (w) @E n(w) @w kj : Learning rate w( ) Compute the gradient based on a single data point at a time: @E n (w) @w kj Slide credit: Bernt Schiele 46

Gradient Descent Error function NX E(w) = E n (w) = 1 2 n=1 E n (w) = 1 2 0 @E n (w) @w kj = @ NX 0 KX @ n=1 k=1 0 KX @ k=1 1 MX w kj Á j (x n ) t kn A j=1 1 MX w kj Á j (x n ) t kn A j=1 1 MX w k ~j Á ~j (x n) t kn A Á j (x n ) ~j=1 2 2 = (y k (x n ; w) t kn ) Á j (x n ) Slide credit: Bernt Schiele 47

Gradient Descent Delta rule (=LMS rule) w ( +1) kj = w ( ) kj (y k(x n ; w) t kn ) Á j (x n ) = w ( ) kj ± kná j (x n ) where ± kn = y k (x n ; w) t kn Simply feed back the input data point, weighted by the classification error. Slide credit: Bernt Schiele 48

Gradient Descent Cases with differentiable, non-linear activation function 0 y k (x) = g(a k ) = g @ 1 MX w ki Á j (x n ) A j=0 Gradient descent Slide credit: Bernt Schiele @E n (w) @w kj = @g(a k) @w kj (y k (x n ; w) t kn ) Á j (x n ) w ( +1) kj = w ( ) kj ± kná j (x n ) ± kn = @g(a k) @w kj (y k (x n ; w) t kn ) 49

Summary: Generalized Linear Discriminants Properties General class of decision functions. Nonlinearity g( ) and basis functions Á j allow us to address linearly non-separable problems. Shown simple sequential learning approach for parameter estimation using gradient descent. Better 2 nd order gradient descent approaches available (e.g. Newton-Raphson). Limitations / Caveats Flexibility of model is limited by curse of dimensionality g( ) and Á j often introduce additional parameters. Models are either limited to lower-dimensional input space or need to share parameters. Linearly separable case often leads to overfitting. Several possible parameter choices minimize training error. 50

References and Further Reading More information on Linear Discriminant Functions can be found in Chapter 4 of Bishop s book (in particular Chapter 4.1). Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 52