> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Similar documents
Ch 4. Linear Models for Classification

Support Vector Machines

Machine Learning Lecture 7

Linear Models for Classification

Max Margin-Classifier

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning Lecture 5

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning 2017

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Midterm exam CS 189/289, Fall 2015

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

An Introduction to Statistical and Probabilistic Linear Models

Machine Learning for NLP

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Kernel Logistic Regression and the Import Vector Machine

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Support Vector Machines for Classification: A Statistical Portrait

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Logistic Regression. Machine Learning Fall 2018

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Lecture 5: Logistic Regression. Neural Networks

Pattern Recognition and Machine Learning

Linear & nonlinear classifiers

Regression with Numerical Optimization. Logistic

Linear Classification: Probabilistic Generative Models

Machine Learning for Signal Processing Bayes Classification and Regression

Nonparametric Bayesian Methods (Gaussian Processes)

Statistical Machine Learning Hilary Term 2018

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Kernel Methods and Support Vector Machines

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear discriminant functions

Introduction to Machine Learning

Generative v. Discriminative classifiers Intuition

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning Linear Classification. Prof. Matteo Matteucci

STA 4273H: Statistical Machine Learning

ECE 5984: Introduction to Machine Learning

Naïve Bayes classification

Iterative Reweighted Least Squares

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

1 Machine Learning Concepts (16 points)

Machine Learning for NLP

Linear Classification

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Discriminative Models

CMU-Q Lecture 24:

Bayesian Support Vector Machines for Feature Ranking and Selection

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Discriminative Models

Support Vector Machines

CSC321 Lecture 4: Learning a Classifier

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Multiclass Logistic Regression

Bayesian Learning (II)

Linear Models for Classification

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning - Waseda University Logistic Regression

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Classification Based on Probability

Logistic Regression & Neural Networks

Reading Group on Deep Learning Session 1

Announcements - Homework

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 3: Pattern Classification. Pattern classification

Introduction to Machine Learning

Lecture 3: Pattern Classification

Logistic Regression. COMP 527 Danushka Bollegala

Lecture 2: Simple Classifiers

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Warm up: risk prediction with logistic regression

Linear & nonlinear classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Logistic Regression

Computational statistics

CS6220: DATA MINING TECHNIQUES

CSC321 Lecture 4: Learning a Classifier

Kernelized Perceptron Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Logistic Regression. Seungjin Choi

Transcription:

Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel

Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation learning classification rule decision Bayes Classifier Probabilistic classifier with a generative setup based on class density models Bayes (Gauss), Naïve Bayes data classification function learning Can we have a probabilistic classifier with a modelling focus on classification? decision Direct Classifiers Find best parameter (e.g. w) with respect to a specific loss function measuring misclassification Perceptron, SVM, Tree, ANN 2

Advantages of Both Worlds Posterior distribution has advantages over classification label: Asymmetric risks: need classification probability Classification certainty: Indicator if decision in unsure Algorithmic approach with direct learning has advantages: Focus of modelling power on correct classification where it counts Easier decision line interpretation 3

Discriminative Probabilistic Classifier Bayes Classifier Discriminative Probabilistic Classifier Linear Classifier P x C & P x C ' Bishop PRML Bishop PRML P C ' x P x C ' P C ' g x = w, x + w / 4

Towards a Direct Probabilistic Classifier Idea 1: Directly learn a posterior distribution For classification with the Bayes classifier, the posterior distribution is relevant. We can directly estimate a model of this distribution (Called this as a discriminative classifier in Naïve Bayes). We know from Naïve Bayes that we can probably expect a good performance from the posterior model. Idea 2: Extend linear classification with probabilistic interpretation The linear classifier outputs a distance to the decision plane. We can use this value and interpret it probabilistically: The further away, the more certain 5

Logistic Regression The Logistic Regression will implement both ideas: It is a model of a posterior class distribution for classification and can be interpreted as a probabilistic linear classifier. But it is a fully probabilistic model, not only a post-processing of a linear classifier. It extends the hyperplane decision idea to Bayes world Direct model of the posterior for classification Probabilistic model (classification according to a probability distribution) Discriminative model (models posterior rather than likelihood and prior) Linear model for classification Simple and accessible (we can understand that) We can study the relation to other linear classifiers, i.e. SVM 6

History of Logistic Regression Logistic Regression is a very old method of statistical analysis and in widespread use, especially in the traditional statistical community (not machine learning). A method more often used to study and identify explaining factors rather than to do individual prediction. (Statistical analysis vs. prediction focus of modern machine learning) Many medical studies of risk factors etc. are based on logistic regression 7

Repetition: Linear Classifier Linear classification rule: Bishop PRML g x = w, x + w / g x 0 g x < 0 Decision boundary is a a hyperplane 8

Repetition: Posterior Distribution Classification with Posterior distribution: Bayes Based on class densities and a prior P C & x = p x C & P C & p x C & P C & + p x C ' P C ' P C ' x = p x C ' P C ' p x C & P C & + p x C ' P C ' Bishop PRML 9

Combination: Discriminative Classifier Decision boundary Bishop PRML Probabilistic interpretation of classification output: ~distance to separation plane 10

Notation Changes We work with two classes Data with (numerical) feature vectors x and labels y {0, 1} We do not use the notation of Bayes with ω anymore. We will need the explicit label value of y in our models later. Classification goal: infer the best class for a given feature point y = arg max C {/,&} P(y x) All our modeling focuses only on the posterior of having class 1: P y = 1 x Obtaining the other is trivial: P y = 0 x = 1 P(y = 1 x) 11

Parametric Posterior Model We need a model for the posterior distribution, depending on the feature vector (of course) and neatly parameterized. P y = 1 x, θ = f x; θ The linear classifier is a good starting point. We know its parametrization very well: g x; w, w / = w, x + w / We thus model the posterior as a function of the linear classifier: P y = 1 x, w, w / = f(w, x + w / ) Posterior from classification result: scaled distance to decision plane 12

Logistic Function To use the unbounded distance to the decision plane in a probabilistic setup, we need to map it into the interval [0, 1] This is very similar as we did in neural nets: activation function The logistic function σ x squashes a value x R to 0, 1 1 σ x = 1 + e PQ The logistic function is a smooth, soft threshold σ x 1 x σ x 0 x σ 0 = 1 2 13

The Logistic Regression Posterior We model the posterior distribution for classification in a twoclasses-setting by applying the logistic function to the linear classifier: P y = 1 x = σ g x P y = 1 x, w, w / = f(w, x + w / ) = 1 1 + e P(wV xwx Y ) This a location-dependent model of the posterior distribution, parametrized by a linear hyperplane classifier. 14

Linear Classifier The logistic regression posterior leads to a linear classifier: P y = 1 x, w, w / = 1 1 + exp (w, x + w / ) P y = 0 x, w, w / = 1 P y = 1 x, w, w / P y = 1 x, w, w / > 1 2 y = 1 classification; y = 0 otherwise Classification boundary is at P y = 1 x, w, w / = 1 2 1 1 + exp (w, x + w / ) = 1 2 w, x + w / = 0 Classification boundary is a hyperplane 15

Interpretation: Logit Is the choice of the logistic function justified? Yes, the logit is a linear function of our data: Logit: log of the odds ratio: ln ^ ln &P^ P(y = 1 x) P(y = 0 x) = w, x + w / But other choices are valid, too They lead to other models than logistic regression, e.g. probit regression Generalized Linear Models (GLM) The linear function (~distance from decision plane) directly expresses our classification certainty, measured by the odds ratio : double distance squared odds e.g. 3: 2 9: 4 E[y] = f P& w, x + w / 16

Training a Posterior Distribution Model The posterior model for classification requires training. Logistic regression is not just a post-processing of a linear classifier. Learning of good parameter values needs be done with respect to the probabilistic meaning of the posterior distribution. In the probabilistic setting, learning is usually estimation We now have a slightly different situation than with Bayes: We do not need class densities but a good posterior distribution. We will use Maximum Likelihood and Maximum-A-Posteriori estimates of our parameters w, w / Later: This also corresponds to a cost function of obtaining w, w / 17

Maximum Likelihood Learning The Maximum Likelihood principle can be adapted to fit the posterior distribution (discriminative case): We choose the parameters which maximize the posterior distribution of the training set X with labels Y: w, w / = arg max w,x Y P Y X; w, w / (iid) = arg max w,x Y h P y x; w, w / x j P y x; w, w / = P y = 1 x; w, w / C P y = 0 x; w, w / &PC 18

Maximum Likelihood Learning (II) Useful shorthand: p m = P y = 1 x m, w, w / = & &Wnop P w V xwx Y log P Y X; w, w / = l y m ln p m + 1 y m ln 1 p m m = l y m ln p m 1 p m + ln 1 p m m = l y m w, x m + w / ln 1 + exp w, x m + w / ln p m 1 p m = w, x + w / 19

Derivative of a Dot Product Gradient operator w = w = w w, x =,,, w & w ' w u w & w, x, w ' w, x,, w u w, x Componentwise w m w, x = u l w w v x v m vw/ = x m Final derivative w w, x = x &, x ',, x u = x, 20

Maximum Likelihood (III) w ln P Y X = l y m σ w, x m + w / x m, w / ln P Y X = l y m m m l σ w, x m + w / m =! 0 =! 0 No closed-form solution: non-linear equation Iterative solution is necessary Will it converge? Yes, ln P is concave, has a unique maximum 21

Hessian: Concave Likelihood H = ' w w, ln P Y X We use an old trick to keep it simple: w w / w, x 1 x w w, ln P Y X = l x mx m, σ w, x m 1 σ w, x m m = X, SX The Hessian is negative definite: The sample covariance matrix, m x m x m is positive definite σ w, x m 1 σ w, x m is always positive The optimization problem is said to be convex and has thus a optimal solution which can be iteratively calculated. 22

Iterative Reweighted Least Squares The concave log P Y X can be maximized iteratively with the Newton-Raphson algorithm: Iterative Reweighted Least Squares w W& w H P& w ln P Y X; w Derivatives and evaluation always with respect to w Method results in an iteration of reweighted least-squares steps w W& = X, SX P& X, S z z = Xw + S P& Y P w Weighted least-squares with z as target: X, SX P& X, S z z: adjusted responses (updated every iteration) P w : vector of responses [p &, p ',, p ƒ ], 23

Example: Logistic Regression 0.125 0.25 0.75 0.875 Solid line: classification (p = 0.5) Dashed lines: p = 0.25, p = 0.75 lines Probabilistic result: posterior of classification everywhere The posterior probability decays/increases with distance to the decision boundary 24

Linearly Separable Maximum Likelihood learning is problematic in the linearly separable case: w diverges in length leads to classification with infinite certainty Classification is still right but posterior estimate is not 25

Prior Assumptions Infinitely certain classification is likely an estimation artefact: We do not have enough training samples maximum likelihood estimation leads to problematic results Solution: MAP estimate with prior assumptions on w P w = N w 0, σ ' I Smaller w are preferred (shrinkage) P y x, w, w / = p C 1 p &PC Likelihood model is unchanged w, w / = arg max w,x Y P Y X; w, w / P w = arg max w,x Y P w h P y x, w, w / x j 26

MAP Learning ln P w h P y x, w, w / x j = l y m w, x m + w / ln 1 + exp w, x m + w / 1 2σ ' w ' We need: w w ' = 2w, w ln P Y X = l y m σ w, x m + w / x m, m Iterative solution: Newton-Raphson Prior enforces a regularization 1 σ ' w, =! 0 27

Bayesian Logistic Regression Idea: In the separable case, there are many perfect linear classifiers which all separate the data. Average the classification result and accuracy using all of these classifiers. Optimal way to deal with missing knowledge in Bayes sense Bishop PRML Bishop PRML 28

Multiclass Logistic Regression Logistic regression can be easily extended to a multiclass setting: y 0,1,, K Idea: use a soft-max posterior P y = k x = P y = K x = K classes exp w v, x 1 + ŒP& m exp w, m x 1 1 + ŒP& m exp w, m x Extension of the sigmoid function to many classes Reduces to standard sigmoid for K = 2 29

Logistic Regression and Neural Nets The standard single neuron with the logistic activation is logistic regression if trained with the same cost function (cross-entropy) But training with least-squares results in a different classifier, see next slide Multiclass logistic regression with soft-max corresponds to what is called a soft-max layer in ANN. It is the standard multiclass output in most ANN architectures. x & x ' x xž x w Σ σ P y = 1 x, w, w / = σ w, x + w / 30

Linear Classifiers All linear classifiers share a common classification rule g x = w, x + w / They differ in learning: defining and finding a good w, w / Learning is mainly driven by the choice of misclassification cost Perceptron: y 1,1, f x = w, x + w / L(y, x) = yf x SVM: Hinge loss L(y, x) = 1 yf x W W Logistic Regression: L y, x = ln P y x = ln 1 + exp yf x yf(x) Bishop PRML 31

Non-Linear Extension Logistic regression is often extended to non-linear cases: Extension through adding additional transformed features Combination terms: x m x Monomial terms: x m ' x x x & x ' x ' ' Standard procedure in medicine: inspect resulting w to find important factors and interactions x m x (comes with statistical information) Usage of kernels is possible: training and classification can be formulated with dot products of data points. The scalar products can be replaced by kernel expansions with the kernel trick. 32

Kernel Logistic Regression Equations of logistic regression can be reformulated with dot products: ƒ w T x = l α m x, m x l α m k x m, x ƒ mw& mw& No Support Vectors: kernel evaluations with all training points P y = 1 x = σ ƒ l α m k x m, x mw& IVM (import vector machine): Extension with only sparse support points Ji Zhu & Trevor Hastie (2005) Kernel Logistic Regression and the Import Vector Machine, Journal of Computational and Graphical Statistics, 14:1, 185-205, DOI: 10.1198/106186005X25619 33

Discriminative vs. Generative Comparison of logistic regression to naïve Bayes Ng, Andrew Y., and Michael I. Jordan. "On Discriminative vs. Generative classifiers: A comparison of logistic regression and naive Bayes." Advances in NIPS 14, 2001. Conclusion: Logistic regression has a lower asymptotic error Naïve Bayes can reach its (higher) asymptotic error faster General over-simplification (dangerous!): use a generative model with few data (more knowledge) and a discriminative model with a lot of training data (more learning) 34

Summary: Logistic Regression Probabilistic, discriminative classifier Linear model of the posterior distribution for classification 1 P y = 1 x, w, w / = 1 + exp w, x + w / Model: log of the odds ratio is a linear function of data ~distance to hyperplane is interpreted as a classification probability Iterative maximum likelihood training Iterative reweighted least squares, well-defined, no closed-form solution Regularization prevents infinite certainty for separable data Wide application: standard method in statistics 35