Slides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D.

Similar documents
Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Models for Classification

Linear Models for Classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning. 7. Logistic and Linear Regression

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

CS395T: Structured Models for NLP Lecture 2: Machine Learning Review

An Introduction to Statistical and Probabilistic Linear Models

Reading Group on Deep Learning Session 1

Machine Learning Lecture 5

Linear & nonlinear classifiers

Linear Models for Classification

Ch 4. Linear Models for Classification

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Linear & nonlinear classifiers

Machine Learning Lecture 7

PATTERN RECOGNITION AND MACHINE LEARNING

Linear Classification: Probabilistic Generative Models

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Linear Regression. Machine Learning Seyoung Kim. Many of these slides are derived from Tom Mitchell. Thanks!

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Logistic Regression. COMP 527 Danushka Bollegala

Neural Networks and the Back-propagation Algorithm

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Linear Classification

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Artificial Neural Networks

Multivariate statistical methods and data mining in particle physics

Probabilistic generative models

Iterative Reweighted Least Squares

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Logistic Regression

Advanced statistical methods for data analysis Lecture 2

Neural Network Training

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

What does Bayes theorem give us? Lets revisit the ball in the box example.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Variational Bayesian Logistic Regression

MODULE -4 BAYEIAN LEARNING

Today. Calculus. Linear Regression. Lagrange Multipliers

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

Cheng Soon Ong & Christian Walder. Canberra February June 2018

SGN (4 cr) Chapter 5

Machine Learning 2017

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Bayesian Decision Theory

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CSC 411: Lecture 04: Logistic Regression

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

ECE521 Lecture7. Logistic Regression

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Linear Models for Regression

Outline Lecture 2 2(32)

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Naïve Bayes classification

Bayesian Linear Regression. Sargur Srihari

Generative v. Discriminative classifiers Intuition

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

ECE521 week 3: 23/26 January 2017

Linear Models for Regression

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Logistic Regression. Machine Learning Fall 2018

Linear Models for Regression. Sargur Srihari

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Logistic Regression and Support Vector Machine

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning Lecture 2

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Multivariate Bayesian Linear Regression MLAI Lecture 11

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Pattern Recognition and Machine Learning

STA 4273H: Statistical Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Machine Learning: Fisher s Linear Discriminant. Lecture 05

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Bayesian Decision Theory

Transcription:

Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP and: Computer vision: models, learning and inference. 2011 Simon J.D. Prince

ClassificaLon

Example: Gender ClassificaLon Incremental logislc regression 300 arc tan basis funclons: Results: 87.5% (humans=95%) Computer vision: models, learning and inference. 2011 Simon J.D. Prince 3

Regression vs. ClassificaLon Regression: x 2 [ 1, 1],t2 [ 1, 1] ClassificaLon (two classes): x 2 [ 1, 1],t2 {0, 1}

Regression vs. ClassificaLon Linear regression model prediclon (y real) ClassificaLon: y in range (0,1) (posterior probabililes) ( ) f: AcLvaLon funclon (nonlinear) Decision surface: y(x) = we wish the model w T x + w 0, to predict d y(x) =f ( w T x + w 0 ) (Generalized linear models) e statistics literature. Th w T x + w 0 =constant ven if the function ( ) is

Decision Theory Inference step Determine either or. Decision step For given x, determine oplmal t.

Minimum MisclassificaLon Rate

Minimum MisclassificaLon Rate We are free to choose the decision rule that assigns each point x to one of the two classes. This defines the decision regions Rk. To minimize integrand: p(x, C k )=p(c k x)p(x) obtained i n restate this result as sayi Assign x to class for which the posterior p(c k x) able x, in must be small is larger!

Minimum Expected Loss Example: classify medical images as cancer or normal Decision Truth

Minimum Expected Loss Regions are chosen to minimize

Reject OpLon

Three strategies 1. Modeling the class-condilonal density for each class C k, and prior, then use Bayes p(c k x) = p(x C k)p(c k ) p(x) Equivalently, model joint distribulon p(x,c k ) (Models of distribulon of input and output are generalve models) 2. First solve the inference problem of determining the posterior class probabililes p(c k x), and then subsequently use decision theory to assign each new x to one of the classes (discriminalve models) 3. Find discriminant funclon that directly maps x to class label

Class-condiLonal density vs. posterior Class-condiLonal densiles Posterior probabililes 5 4 p(x C 2 ) 1.2 1 p(c 1 x) p(c 2 x) class densities 3 2 p(x C 1 ) 0.8 0.6 0.4 1 0.2 0 0 0.2 0.4 0.6 0.8 1 x 0 0 0.2 0.4 0.6 0.8 1 x

Why Separate Inference and Decision? Minimizing risk (loss matrix may change over Lme) Reject oplon Unbalanced class priors Combining models

Several dimensions

Several dimensions y>0 y =0 y<0 x 2 R 2 R 1 Decision surface w x x y(x) w y(x) =w T x + w 0 weight vector bias ve of the bias is C 1 if y(x) 0 therefore called a thres define C 2 otherwise. ation (x) = x 1 w 0 w

Perceptron 1 A linear discriminant model by Rosenblah (1962) y(x) =f ( w T φ(x) ) with f(a) = { +1, a 0 1, a < 0. and feature vector φ(x),

Perceptron 2 Perceptron criterion E P (w) = n M w T φ n t n M is set of misclassified paherns Learning: w (τ+1) = w (τ) η E P (w) =w (τ) + ηφ n t n

Perceptron 3 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 1 0.5 0 0.5 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 Figure 4.7

Fisher s linear discriminant 1 ProjecLng data down to one dimension But how? y = w T x

Fisher s linear discriminant 2 Define class means Try maximize C m 1 = 1 x n, m 2 = 1 x n. N 1 N 2 n C 1 n C 2 asure of the separation of the classes, when proje 4 2 m 2 m 1 = w T (m 2 m 1 ) 0 2 Figure 4.6 2 2 6

m 2 m 1 ). There is still a problem with this approach, however, as illu re 4.6. This shows two classes that are well separated in the origin ional Fisher s space (x 1,x linear 2 ) but that discriminant have considerable3 overlap when projecte e joining their means. This difficulty arises from the strongly nond nces of the class distributions. The idea proposed by Fisher is to ma ion that Instead, will give consider: a large separation ralo of between the projected class class mean ving a small variance variance to within each class class, variance thereby minimizing the class o e projection formula (4.20) transforms the set of labelled data poin abelled set in the one-dimensional J(w) = (m 2space m 1 y. ) 2 The within-class variance rmed data from class C s 2 k is therefore1 given + s2 2 by With s 2 k = n C k (y n m k ) 2 y n = w T x n. We can define the total within-class variance for the t to be simply s 2 1 + s 2 2. The Fisher criterion is defined to be the ratio n-class Called variance Fisher to thecriterion. within-class Maximize variance and it! is given by (m m ) 2

Fisher s linear discriminant 4 Fisher criterion Rewrite Between-class cov. J(w) = (m 2 m 1 ) 2 s 2 1 + s2 2 J(w) = wt S B w w T S W w Within-class cov. S B =(m 2 m 1 )(m 2 m 1 ) T S W = n C 1 (x n m 1 )(x n m 1 ) T + n C 2 (x n m 2 )(x n m 2 ) T.

Fisher s linear discriminant 5 J(w) = wt S B w w T S W w DifferenLate with respect to w (w T S B w)s W w =(w T S W w)s B w. S B w is proporlonal to (m 2 -m 1 ) Thus (Fisher s linear discriminant): w S 1 W (m 2 m 1 )

Fisher s linear discriminant 6 s k = (y n m k ) n C k where y n = w T x n. We can define the total within-clas data set to be simply s 2 1 + s 2 2. The Fisher criterion is defi Fisher s between-class linear discriminant variance to the within-class Fisher Criterion variance and is g w S 1 W (m 2 m 1 ). J(w) = (m 2 m 1 ) 2 s 2 1 + s2 2. We can make the dependence on w explicit by using (4.2 4 4 rewrite the Fisher criterion in the form 2 2 0 0 2 2 2 2 6 2 2 6 Figure 4.6

Least squares for classificalon fails 4 4 2 2 0 0 2 2 4 4 6 6 8 4 2 0 2 4 6 8 8 4 2 0 2 4 6 8 Use logislc regression instead!

ProbabilisLc generalve models Posterior probability for class C 1 can be wrihen p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 )+p(x C 2 )p(c 2 ) 1 = = σ(a) 1+exp( a) with a =ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) d function defined by and the logislc sigmoid funclon σ(a) = 1 1+exp( a)

LogisLc sigmoid funclon 1 0.5 0 5 0 5 σ(a) = 1 1+exp( a)

Gaussian class-condilonal densiles (different means, but equal variances) p(x C k )= { 1 1 (2π) D/2 Σ 1/2 exp 1 } { 2 (x µ k) T Σ 1 (x µ k ) Yields p(c 1 x) =σ(w T x + w 0 ) with w = Σ 1 (µ 1 µ 2 ) w 0 = 1 2 µt 1 Σ 1 µ 1 + 1 2 µt 2 Σ 1 µ 2 +ln p(c 1) p(c 2 ) Linear funclon of x in argument of logislc sigmoid

class-condilonal densiles posterior probability p(c 1 x)\

ProbabilisLc discriminalve models: LogisLc regression - A model of classificalon Posterior probability of class C 1 can be wrihen as a logislc sigmoid aclng on a linear funclon of the feature vector φ σ() is sigmoid ). Here funclon ( ) is the logistic 1sig Also: p(c 1 φ) =y(φ) =σ ( w T φ ) σ(a) = p(c 2 φ) =1 p(c 1 φ) 9). In the terminology of 1+exp( a) M parameter (M(M+5)/2+1 for generalve model)

Maximum likelihood logislc regression (1) N p(t w) = n=1 y t n n {1 y n } 1 t n With t =(t 1,...,t N ) T and y n = p(c 1 φ n ) on by taking the neg e logarithm of th for a data set {φ n,t n }, where t n {0,1} and φ n = φ(x n ), with n = 1,, N Error funclon E(w) = ln p(t w) = with y n = (w T n) N n=1 {t n ln y n +(1 t n ) ln(1 y n )}

Maximum likelihood logislc regression (2) E(w) = ln p(t w) = Gradient of the error funclon with respect to w E(w) = N (y n t n )φ n n=1 StochasLc gradient update rule N {t n ln y n +(1 t n ) ln(1 y n )} n=1 w (τ+1) = w (τ) η E n τ: iteralon n number, number; and η: learning is a learni rate parameter

Example 1 1 x 2 φ 2 0 0.5 1 0 1 0 1 x 1 0 0.5 1 φ 1

LogisLc regression Computer vision: models, learning and inference. 2011 Simon J.D. Prince 35

Two parameters Learning by standard methods (ML,MAP, Bayesian) Inference: Just evaluate Pr(w x) Computer vision: models, learning and inference. 2011 Simon J.D. Prince 36

Neater NotaLon To make notalon easier to handle, we Ahach a 1 to the start of every data vector Ahach the offset to the start of the gradient vector φ New model: Computer vision: models, learning and inference. 2011 Simon J.D. Prince 37

LogisLc regression Computer vision: models, learning and inference. 2011 Simon J.D. Prince 38