Machine Learning Lecture 7

Similar documents
Machine Learning Lecture 5

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Lecture 12. Talk Announcement. Neural Networks. This Lecture: Advanced Machine Learning. Recap: Generalized Linear Discriminants

Machine Learning Lecture 2

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Ch 4. Linear Models for Classification

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Machine Learning Lecture 10

Machine Learning Lecture 2

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Linear Models for Classification

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Lecture 2

Linear & nonlinear classifiers

Max Margin-Classifier

Linear & nonlinear classifiers

Iterative Reweighted Least Squares

Linear Models for Classification

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Neural Network Training

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Reading Group on Deep Learning Session 1

ECE521 week 3: 23/26 January 2017

Logistic Regression. COMP 527 Danushka Bollegala

Bayesian Logistic Regression

Machine Learning Lecture 3

ECE521 Lecture7. Logistic Regression

Machine Learning. 7. Logistic and Linear Regression

Overfitting, Bias / Variance Analysis

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Kernel Methods and Support Vector Machines

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Multivariate statistical methods and data mining in particle physics

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Lecture 12

Machine Learning 2017

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Machine Learning

Midterm exam CS 189/289, Fall 2015

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

CMU-Q Lecture 24:

Support Vector Machines

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture Machine Learning (past summer semester) If at least one English-speaking student is present.

Stochastic Gradient Descent

10-701/ Machine Learning - Midterm Exam, Fall 2010

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

STA 4273H: Statistical Machine Learning

Machine Learning for NLP

PDEEC Machine Learning 2016/17

Support Vector Machines

Pattern Recognition and Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to SVM and RVM

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning

Discriminative Models

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Advanced statistical methods for data analysis Lecture 2

Machine Learning Lecture 1

Multiclass Logistic Regression

Linear Models in Machine Learning

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Today. Calculus. Linear Regression. Lagrange Multipliers

Machine Learning Lecture 14

Chapter 14 Combining Models

Support Vector Machine (SVM) and Kernel Methods

(Kernels +) Support Vector Machines

Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear Discrimination Functions

An Introduction to Statistical and Probabilistic Linear Models

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Linear Classification

Qualifying Exam in Machine Learning

ECE-271B. Nuno Vasconcelos ECE Department, UCSD

Machine Learning for Signal Processing Bayes Classification and Regression

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Regression with Numerical Optimization. Logistic

Learning from Data Logistic Regression

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Machine Learning

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Transcription:

Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant Functions Statistical Learning Theory & SVMs Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de Generative Models (4 weeks) Bayesian Networks Markov Random Fields leibe@vision.rwth-aachen.de Many slides adapted from B. Schiele 3 Topics of This Lecture Recap: Linear Discriminant Functions Recap: Generalized Linear Discriminants Logistic Regression Probabilistic discriminative models Logistic sigmoid (logit function) Cross-entropy error Gradient descent Iteratively Reweighted Least Squares Note on error functions Statistical Learning Theory Generalization and overfitting Empirical and actual risk VC dimension Empirical Risk Minimization Structural Risk Minimization 4 Basic idea Directly encode decision boundary Minimize misclassification probability directly. Linear discriminant functions y(x) = w T x + w 0 weight vector bias (= threshold) w, w 0 define a hyperplane in R D. If a data set can be perfectly classified by a linear discriminant, then we call it linearly separable. y = 0 y < 0 y > 0 5 Recap: Extension to Nonlinear Basis Fcts. Recap: Basis Functions Generalization Transform vector x with M nonlinear basis functions Á j (x): MX y k (x) = w Á j (x) + w k0 Advantages j=1 Transformation allows non-linear decision boundaries. By choosing the right Á j, every continuous function can (in principle) be approximated with arbitrary accuracy. Generally, we consider models of the following form where Á j (x) are known as basis functions. In the simplest case, we use linear basis functions: Á d (x) = x d. Other popular basis functions Disadvantage The error function can in general no longer be minimized in closed form. Minimization with Gradient Descent 6 Polynomial Gaussian Sigmoid 7 1

Gradient Descent Recap: Gradient Descent Iterative minimization Start with an initial guess for the parameter values w (0). Move towards a (local) minimum by following the gradient. Basic strategies Batch learning @E(w) @w w( ) Example: Quadratic error function E(w) = (y(x n ; w) t n ) 2 Sequential updating leads to delta rule (=LMS rule) (y k(x n ; w) t kn ) Á j (x n ) Sequential updating where E(w) = @E n(w) @w E n (w) w( ) 8 where ± kná j (x n ) ± kn = y k (x n ; w) t kn Simply feed back the input data point, weighted by the classification error. 9 Recap: Gradient Descent Recap: Classification as Dim. Reduction Cases with differentiable, non-linear activation function 0 1 MX y k (x) = g(a k ) = g @ w ki Á j (x n ) A j=0 bad separation good separation Gradient descent (again with quadratic error function) @E n (w) @w = @g(a k) @w (y k (x n ; w) t kn ) Á j (x n ) ± kná j (x n ) ± kn = @g(a k) @w (y k (x n ; w) t kn ) 10 Classification as dimensionality reduction Interpret linear classification as a projection onto a lower-dim. space. y = w T x Learning problem: Try to find the projection vector w that maximizes class separation. 11 Image source: C.M. Bishop, 2006 Recap: Fisher s Linear Discriminant Analysis Topics of This Lecture Class 1 x x w Slide adapted from Ales Leonardis Class 2 Maximize distance between classes Minimize distance within a class Criterion: S B between-class scatter matrix S W within-class scatter matrix The optimal solution for w can be obtained as: w / S 1 W (m 2 m 1 ) Classification function: y J(w) = wt S Bw w T S W w where w 0 = w T m 12 Recap: Generalized Linear Discriminants Logistic Regression Probabilistic discriminative models Logistic sigmoid (logit function) Cross-entropy error Gradient descent Iteratively Reweighted Least Squares Note on error functions Statistical Learning Theory Generalization and overfitting Empirical and actual risk VC dimension Empirical Risk Minimization Structural Risk Minimization 17 2

Probabilistic Discriminative Models Probabilistic Discriminative Models We have seen that we can write p(c 1 jx) = ¾(a) 1 = 1 + exp( a) We can obtain the familiar probabilistic model by setting a = ln p(xjc 1)p(C 1 ) p(xjc 2 )p(c 2 ) Or we can use generalized linear discriminant models or a = w T x a = w T Á(x) logistic sigmoid function 18 In the following, we will consider models of the form with This model is called logistic regression. Why should we do this? What advantage does such a model have compared to modeling the probabilities? p(ájc 1 )p(c 1 ) p(c 1 já) = p(ájc 1 )p(c 1 ) + p(ájc 2 )p(c 2 ) Any ideas? p(c 1 já) = y(á) = ¾(w T Á) p(c 2 já) = 1 p(c 1 já) 19 Comparison Logistic Sigmoid Let s look at the number of parameters Assume we have an M-dimensional feature space Á. And assume we represent p(á C k ) and p(c k ) by Gaussians. How many parameters do we need? For the means: 2M For the covariances: M(M+1)/2 Together with the class priors, this gives M(M+5)/2+1 parameters! How many parameters do we need for logistic regression? p(c 1 já) = y(á) = ¾(w T Á) Properties Definition: Inverse: ¾(a) = Symmetry property: 1 1 + exp( a) µ ¾ a = ln 1 ¾ ¾( a) = 1 ¾(a) logit function Just the values of w M parameters. For large M, logistic regression has clear advantages! Derivative: d¾ = ¾(1 ¾) da 20 21 Logistic Regression Gradient of the Error Function y n = ¾(w T Á n ) Let s consider a data set {Á n,t n } with n = 1,,N, where Á n = Á(x n ) and t n 2 f0; 1g, t = (t 1 ; : : : ; t N ) T. With y n = p(c 1 Á n ), we can write the likelihood as NY p(tjw) = yn tn f1 y ng 1 tn Define the error function as the negative log-likelihood E(w) = ln p(tjw) = ft n ln y n + (1 t n ) ln(1 y n )g This is the so-called cross-entropy error function. 22 dy n Error function dw = y n(1 y n )Á n E(w) = ft n ln y n + (1 t n ) ln(1 y n )g Gradient ( d dw re(w) = t y d n dw n + (1 t n ) (1 y ) n) y n (1 y n ) ½ y n (1 y n ) = t n Á y n (1 t n ) y ¾ n(1 y n ) n (1 y n ) Á n = f(t n t n y n y n + t n y n )Á n g = (y n t n )Á n 23 3

Gradient of the Error Function A More Efficient Iterative Method Gradient for logistic regression re(w) = (y n t n )Á n Second-order Newton-Raphson gradient descent scheme H 1 re(w) where H = rre(w) is the Hessian matrix, i.e. the matrix of second derivatives. Does this look familiar to you? This is the same result as for the Delta (=LMS) rule (y k(x n ; w) t kn )Á j (x n ) Properties Local quadratic approximation to the log-likelihood. Faster convergence. We can use this to derive a sequential estimation algorithm. However, this will be quite slow 24 25 Newton-Raphson for Least-Squares Estimation Newton-Raphson for Logistic Regression Let s first apply Newton-Raphson to the least-squares error function: E(w) = 1 w T 2 Á 2 n t n re(w) = H = rre(w) = w T Á n t n Án = T w T t 2 Á n Á T n = T where 6 = 4 Resulting update scheme: ( T ) 1 ( T w ( ) T t) Á T 1.. Á T N = ( T ) 1 T t Closed-form solution! 3 7 5 26 Now, let s try Newton-Raphson on the cross-entropy error function: E(w) = ft n ln y n + (1 t n ) ln(1 y n )g re(w) = H = rre(w) = dy n dw = yn(1 yn)á n (y n t n )Á n = T (y t) y n (1 y n )Á n Á T n = T R where R is an N N diagonal matrix with R nn = y n (1 y n ). The Hessian is no longer constant, but depends on w through the weighting matrix R. 27 Iteratively Reweighted Least Squares Summary: Logistic Regression Update equations ( T R ) 1 T (y t) o = ( T R ) n 1 T R w ( ) T (y t) = ( T R ) 1 T Rz with z = w ( ) R 1 (y t) Again very similar form (normal equations) But now with non-constant weighing matrix R (depends on w). Need to apply normal equations iteratively. Iteratively Reweighted Least-Squares (IRLS) 28 Properties Directly represent posterior distribution p(á C k ) Requires fewer parameters than modeling the likelihood + prior. Very often used in statistics. It can be shown that the cross-entropy error function is concave Optimization leads to unique minimum But no closed-form solution exists Iterative optimization (IRLS) Both online and batch optimizations exist There is a multi-class version described in (Bishop Ch.4.3.4). Caveat Logistic regression tends to systematically overestimate odds ratios when the sample size is less than ~500. 29 4

Topics of This Lecture A Note on Error Functions Recap: Generalized Linear Discriminants Logistic Regression Probabilistic discriminative models Logistic sigmoid (logit function) Cross-entropy error Gradient descent Iteratively Reweighted Least Squares Note on error functions Statistical Learning Theory Generalization and overfitting Empirical and actual risk VC dimension Empirical Risk Minimization Structural Risk Minimization 30 Not differentiable! Ideal misclassification error function (black) This is what we want to approximate, Unfortunately, it is not differentiable. The gradient is zero for misclassified points. Ideal misclassification error z n = t n y(x n ) We cannot minimize it by gradient descent. 31 Image source: Bishop, 2006 A Note on Error Functions A Note on Error Functions Ideal misclassification error Squared error Ideal misclassification error Squared error Squared error (sigmoid) Sensitive to outliers! But: Zero gradient here! Penalizes too correct data points! Sensitivity to outliers fixed! Too correct data points fixed! z n = t n y(x n ) z n = t n y(x n ) Squared error used in Least-Squares Classification Very popular, leads to closed-form solutions. However, sensitive to outliers due to squared penalty. Penalizes too correct data points Generally does not lead to good classifiers. 32 Image source: Bishop, 2006 Squared error with sigmoid activation function (tanh) Fixes the problems with outliers and too correct data points. But: zero gradient for confidently misclassified data points. Will give better performance than original squared error, but still does not fix all problems. 33 Image source: Bishop, 2006 A Note on Error Functions Topics of This Lecture Robust to outliers! Ideal misclassification error Squared error Cross-entropy error Recap: Generalized Linear Discriminants Logistic Regression Probabilistic discriminative models Logistic sigmoid (logit function) Cross-entropy error Gradient descent Iteratively Reweighted Least Squares Cross-Entropy Error Minimizer of this error is given by posterior class probabilities. Concave error function, unique minimum exists. Robust to outliers, error increases only roughly linearly z n = t n y(x n ) But no closed-form solution, requires iterative estimation. 34 Image source: Bishop, 2006 Note on error functions Statistical Learning Theory Generalization and overfitting Empirical and actual risk VC dimension Empirical Risk Minimization Structural Risk Minimization 36 5

Generalization and Overfitting Example: Linearly Separable Data Goal: predict class labels of new observations Train classification model on limited training set. The further we optimize the model parameters, the more the training error will decrease. However, at some point the test error will go up again. Overfitting to the training set! test error training error 37 Image source: B. Schiele Overfitting is often a problem with linearly separable data Which of the many possible decision boundaries is correct? All of them have zero error on the training set However, they will most likely result in different predictions on novel test data. Different generalization performance How to select the classifier with the best generalization performance???? 38 A Broader View on Statistical Learning Statistical Learning Theory Formal treatment: Statistical Learning Theory Supervised learning (from the learning machine s view) Supervised learning Environment: assumed stationary. I.e. the data x have an unknown but fixed probability density p X (x) Selection of a specific function f(x; ) Given: training examples f(x i ; y i )g N i=1 Goal: the desired response y shall be approximated optimally. Teacher: specifies for each data point x the desired classification y (where y may be subject to noise). y = g(x; º) with noise º Measuring the optimality Loss function L(y; f(x; )) Learning machine: represented by class of functions, which produce for each x an output y: y = f(x; ) with parameters Example: quadratic loss L(y; f(x; )) = (y f(x; )) 2 41 42 Risk Risk Measuring the optimality Measure the optimality by the risk (= expected loss). Difficulty: how should the risk be estimated? However, what we re really interested in is Actual risk (= ZExpected risk) R( ) = L(y; f(x; ))dp X;Y (x; y) Practical way Empirical risk (measured on the training/validation set) R emp ( ) = 1 N L(y i ; f(x i ; )) i=1 P X;Y (x; y) is the probability distribution of (x,y). P X;Y (x; y) is fixed, but typically unknown. In general, we can t compute the actual risk directly! Example: quadratic loss function R emp ( ) = 1 N (y i f(x i ; )) 2 i=1 The expected risk is the expectation of the error on all data. I.e., it is the expected value of the generalization error. 43 44 6

Summary: Risk Actual risk Advantage: measure for the generalization ability Disadvantage: in general, we don t know P X;Y (x; y) Empirical risk Disadvantage: no direct measure of the generalization ability Advantage: does not depend on P X;Y (x; y) We typically know learning algorithms which minimize the empirical risk. Strong interest in connection between both types of risk 45 Statistical Learning Theory Idea Compute an upper bound on the actual risk based on the empirical risk where N: number of training examples p * : probability that the bound is correct h: capacity of the learning machine ( VC-dimension ) Side note: R( ) R emp ( ) + ²(N; p ; h) (This idea of specifying a bound that only holds with a certain probability is explored in a branch of learning theory called Probably Approximately Correct or PAC Learning). 48 VC Dimension Vapnik-Chervonenkis dimension Measure for the capacity of a learning machine. VC Dimension Interpretation as a two-player game Opponent s turn: He says a number N. Formal definition: If a given set of ` points can be labeled in all possible 2` ways, and for each labeling, a member of the set {f( )} can be found which correctly assigns those labels, we say that the set of points is shattered by the set of functions. The VC dimension for the set of functions {f( )} is defined as the maximum number of training points that can be shattered by {f( )}. Our turn: We specify a set of N points {x 1,,x N }. Opponent s turn: He gives us a labeling {x 1,,x N }2 {0,1} N Our turn: We specify a function f( ) which correctly classifies all N points. If we can do that for all 2 N possible labelings, then the VC dimension is at least N. 49 50 VC Dimension VC Dimension Example The VC dimension of all oriented lines in R 2 is 3. 1. Shattering 3 points with an oriented line: Intuitive feeling (unfortunately wrong) The VC dimension has a direct connection with the number of parameters. Counterexample f(x; ) = g(sin( x)) ( 1; x > 0 g(x) = 1; x 0 2. More difficult to show: it is not possible to shatter 4 points (XOR) More general: the VC dimension of all hyperplanes in R n is n+1. 51 Image source: C. Burges, 1998 Just a single parameter. Infinite VC dimension Proof: Choose x i = 10 i ; i = 1; : : : ; ` Ã! `X (1 y i )10 i = ¼ 1 + 2 i=1 52 7

Upper Bound on the Risk Upper Bound on the Risk Important result (Vapnik 1979, 1995) r h(log(2n=h) + 1) log( =4) With probability (1- ), the following bound holds R( ) R emp ( ) + N This bound is independent of P X;Y (x; y)! VC confidence Typically, we cannot compute the left-hand side (the actual risk) If we know h (the VC dimension), we can however easily compute the risk bound R( ) R emp ( ) + ²(N; p ; h) ²(N; p ; h) R emp ( ) 53 54 Structural Risk Minimization References and Further Reading How can we implement this? R( ) R emp ( ) + ²(N; p ; h) More information on SVMs can be found in Chapter 7.1 of Bishop s book. Classic approach ²(N; p ; h) R emp ( ) Keep constant and minimize. ²(N; p ; h) be kept constant by controlling the model can parameters. Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 Support Vector Machines (SVMs) R emp ( ) ²(N; p ; h) Keep constant and minimize. In fact: R emp ( ) = 0 for separable data. ²(N; p ; h) Control by adapting the VC dimension (controlling the capacity of the classifier). 56 Additional information about Statistical Learning Theory and a more in-depth introduction to SVMs are available in the following tutorial: C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, Vol. 2(2), pp. 121-167 1998. 57 8