Linear Classification

Similar documents
Cheng Soon Ong & Christian Walder. Canberra February June 2018

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Ch 4. Linear Models for Classification

Linear Models for Classification

Bayesian Logistic Regression

Linear Classification: Probabilistic Generative Models

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

STA 4273H: Statistical Machine Learning

Outline Lecture 2 2(32)

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

The Laplace Approximation

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Nonparametric Bayesian Methods (Gaussian Processes)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Introduction to Machine Learning

Linear Models for Classification

Machine Learning Lecture 7

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning Practice Page 2 of 2 10/28/13

Support Vector Machines

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Neural Network Training

Bayesian Learning (II)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Pattern Recognition and Machine Learning

Machine Learning for Signal Processing Bayes Classification and Regression

Probabilistic generative models

An Introduction to Statistical and Probabilistic Linear Models

Logistic Regression. Seungjin Choi

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Machine Learning Lecture 5

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Introduction to Machine Learning

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Classification. Sandro Cumani. Politecnico di Torino

Machine Learning. 7. Logistic and Linear Regression

Variational Bayesian Logistic Regression

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning 2017

Bayesian Machine Learning

ECE521 week 3: 23/26 January 2017

Chris Bishop s PRML Ch. 8: Graphical Models

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Logistic Regression. Machine Learning Fall 2018

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Neural Networks - II

Introduction to Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Relevance Vector Machines

Support Vector Machine. Industrial AI Lab.

Lecture 2: Simple Classifiers

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Linear Regression and Discrimination

Linear & nonlinear classifiers

Logistic Regression. William Cohen

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Linear & nonlinear classifiers

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Introduction to Machine Learning

PMR Learning as Inference

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Naïve Bayes classification

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Multivariate statistical methods and data mining in particle physics

Machine Learning Overview

Stat 5101 Lecture Notes

10-701/ Machine Learning - Midterm Exam, Fall 2010

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Multiclass Logistic Regression

Computer Vision Group Prof. Daniel Cremers. 4. Gaussian Processes - Regression

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

CSCI-567: Machine Learning (Spring 2019)

CSC 411: Lecture 04: Logistic Regression

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Reading Group on Deep Learning Session 1

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Gaussian discriminant analysis Naive Bayes

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Iterative Reweighted Least Squares

Introduction to Machine Learning

Kernel Methods. Machine Learning A W VO

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Transcription:

Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015

Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative Models Bayesian Logistic Regression

Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative Models Bayesian Logistic Regression

Prologue Linear classification is easy my good old friend, logistic regression, always serves as a baseline method in various applications. Through a systematic study, however, we can grasp the main idea behind a range of machine learning techniques. This seminar also precedes our future discussion on GP classification. References: The materials basically follow Chapter 4 in Pattern Recognition and Machine Learning. Figures are taken (w/ or w/o modification) without further acknowledgment. See also A. Webb, et al., Statistical Pattern Recognition.

The Linear Classification Problem The Road Map Discriminant functions Probabilistic generative models Probabilistic discriminative models Bayesian logistic regression

On Non-Linearity Generalized linearity: non-linear transformation by basis functions Feature engineering/selection Kernels: transforming to a Hilbert space, fully characterized by a predefined inner-product operation SVM (a discriminant function) Gaussian processes (non-parametric Bayes) k-nn (slacked similarity measure, non-parametric discriminant) Composition of non-linear activation functions Neural networks (finite feature learning, equivalent to learning a sophisticated kernel mapping input data to another Euclidean space) My questions Can neural networks composite infinite dimensional features (with kernels)? Sparse kernel network? Or is it necessary? Or indeed, neural networks ARE non-parametric!

Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative Models Bayesian Logistic Regression

The Decision Boundary Let x = (x 1,, x n ) T be a data sample in R n Let y {0, 1} be the label of x (a two-class problem) A linear classifier always takes the form { 1, if w y = T x + w 0 > 0 0, otherwise The decision boundary is a hyperplane in R n 1 w T x + w 0 = 0

Multiclass problem Let K be the number of classes one-versus-the-rest (K 1 classifiers) one-versus-one + voting (K(K 1)/2 classifiers) Scoring y k (x) = wk T x + w k0

Linear Regression as Classiciation Let the target value be 1-of-K coded. (K: # of classes) The scoring function is y k (x) = w T k x + w k0 The least square objective function m K J = (y k (x (i) ) t (i)) 2 i=1 k=1 The problem of being too correct the target value is too far away from a Gaussian distribution.

Perception Learning Algorithm Non-linear hard squashing function { 1, if w y = T x + w 0 > 0 1, otherwise Perception learning algorithm For each misclassified sample x (i) : w (τ+1) = w (τ) + ηx (i) t (i) For its convergence theorem, please refer to Neural Networks and Machine Learning. Pros and Cons + Cares about mis-classified samples only Does not work with linearly inseparable data Does not generalize to multi-class problems + Introduces non-linear activation function

Fisher s Linear Discriminant Heuristics: Attempt #1: Maximize the class separation Attempt #2 (the Fisher criterion): Maximize the ratio of the between-class variance to the within-class variance (after projecting to a one-dimensional space, perpendicular to the decision boundary)

Formalization of Fisher s Linear Discriminant Between-class variance m 1 = Within-class variance 1 N 1 i C 1 x (i), m 2 = 1 N 2 s 2 k = i C k ( y (i) m k ) 2 i C 2 x (i) where The cost function y n = w T x n J(w) = wt (m 2 m 1 ) s 2 1 + s2 2

Solution to Fisher s Linear Discriminant where J(w) = wt (m 2 m 1 ) s 2 1 + s2 2 = wt S B w w T S w w S B = (m 2 m 1 )(m 2 m 1 ) T S W = i C 1 (x i m 1 )(x i m 1 ) T + i C 2 (x i m 2 )(x i m 2 ) T Differentiating the cost function and setting it to 0, we obtain Thus (ws B w)s W w = (w T S W w)s B w w S 1 W (m 2 m 1 ) See Pattern Recognition and Machine Learning for multi-class scenarios.

Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative Models Bayesian Logistic Regression

A Probabilistic Model A data (x, y) is generated according to the following story Step 1: Choose a box C k, i.e., t = k, with probability p(c k ) Step 2: Draw a ticket x from box k with probability p(x C k ) The Bayesian network Posterior probability (t) (x) p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) 1 = = σ(a) 1 + exp( a) where a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) = ln p(c 1 x) p(c 2 x)

Multi-class Scenarios Softmax Predicting the class label p(c k x) = p(x C k)p(c k ) j p(x C j)p(c j ) ˆt = argmax p(c k x) k

Generative Models v.s. Discriminative Models Training objectives i.e., v.s. maximize p(x, t; Θ) Θ maximize p(x t; Θ)p(t; Θ) Θ maximize p(t x; Θ) Θ

Gaussian Inputs Assumption p(x C k ) = 1 { (2π) n Σ exp 1 } 2 (x µ k) T Σ 1 (x µ) Consider a binary classification problem where w = Σ 1 (µ 1 µ 2 ) p(c 1 x) = σ(w T x + w 0 ) w 0 = 1 2 µt 1 Σ 1 µ 1 + 1 2 µt 2 Σ 1 µ 2 + ln p(c 1) p(c 2 ) Parameter estimation: Maximum likelihood estimation See also Ch 2.3, 2.4 in Statistical Pattern Recognition.

Gaussian Classifiers Shared covariance matrix Σ Linear decision boundary Different Σ k Quadratic decision boundary Warning: Don t use Gaussian Classifiers

Discrete Features Consider binary features x = (x 1,, x n ) T, where x i {0, 1}. p(x C k ) has 2 n 1 independent free parameters Naïve Bayes assumption: x i independent p(x C k ) = n p(x i C k ) = i=1 Posterior class distributions a k (x) = n i=1 p(c k x) = σ(a k (x)) µ x i ki (1 µ ki) 1 x i n {x i ln µ ki (1 x i ) ln(1 µ ki )} + ln p(c k ) i=1 Linear in features x!

Exponential Family Let class-conditional distributions take the form Binary classification p(x λ k ) = h(x)g(λ k ) exp { λ T k u(x)} a(x) = (λ 1 λ 2 ) T x + ln g(λ 1 ) ln g(λ 2 ) + ln p(c 1 ) ln p(c 2 ) Multi-class problem a k (x) = λ T k x + ln g(λ k) + ln p(c k ) Examples: Normal, exponential, log-normal, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, geometric, inverse Gaussian, von Mises, and von Mises-Fisher. Provided some parameters are fixed, Pareto, binomial, multinomial, and negative binomial are also in the exponential family.

Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative Models Bayesian Logistic Regression

Logistic Regression Under rather general assumptions, the posterior class distribution takes the form y(x) = p(c 1 x) = σ(w T x + w 0 ) The idea is to optimize directly the above posterior distribution. Fewer parameters, better performance. Consider a binary classification problem with n-dimensional feature space Logistic regression: n parameters A Gaussian classifier: + Means: 2n + Covariance matrices (shared): n(n + 1)/2 + Class prior: 1

Probit Regression What if we substitute σ with other squashing functions? Probit function Φ(a) = a N (θ 0, 1) dθ Similar performance compared with logistic regression More prone to outliers (Why? Consider the decay ) rate Useful later

Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative Models Bayesian Logistic Regression

Bayesian Learning Let y {0, 1} m denote the labels of training data φ 1,, φ m Prior p(w), which is a $64,000,000 question Likelihood m m p(y w) = p(y (i) w) = σ (w T φ (i)) t (i) ( 1 σ (w T φ (i))) 1 t (i) Posterior i=1 p(w y) = p(w)p(y w) p(y) i=1 w p(w)p(y w) = p(w) Predictive density p(y y) = dw p(y w) p(w y) y m σ( ) t(i) (1 σ( )) 1 t(i) i=1 dw σ ( w T ) m φ p(w) σ( ) t(i) (1 σ( )) 1 t(i) i=1

Intractability Posterior is intractable due to the normalizing factor. Predictive density is intractable due to the integral. We have to resort to approximations Sampling methods Stochastic, usually asymptotically correct, hard to scale Deterministic methods Do things wrongly and hope they work

Laplace Approximation Fit a Gaussian at a mode The standard deviation is chosen such that... the second-order derivative of the log probability matches The first-order derivative is always 0 at a mode Scale free in representing the unnormalized measure Real variables only Only local properties captured, multi-mode distributions?

Fitting a Gaussian Let p(z) = 1 Z f(z) be a true distribution, where Z = f(z) dz Step 1: Find a mode z 0 of p(z), by gradient methods, say, satisfying df(z) dz = 0 z=z0 Step 2: Consider a Taylor expansion of ln f(z) at z 0 ln f(z) ln f(z 0 ) 1 2 A(z z 0) 2 where A is given by A = d2 ln f(z) dz2 z=z0 Taking the exponential, f(z) f(z 0 ) exp { A2 } (z z 0) 2 Step 3: Normalize to a Gaussian distribution ( ) A 1/2 q(z) = exp { A2 } 2π (z z 0) 2

Laplace Approximation for Multivariate Distributions To approximate p(z) = 1 Z f(z), where z Rm We expand at mode z 0 ln f(z) ln f(z 0 ) 1 2 (z z 0) T A(z z 0 ) where A = ln f(z), Hessian of ln f(z), serving as the z=z0 precision matrix in a Gaussian distribution Taking the exponential, we obtain { f(z) f(z 0 ) exp 1 } 2 (z z 0) T A(z z 0 ) Normalize it as a distribution, and then we have { q(z) = A 1/2 exp 1 } (2π) m/2 2 (z z 0) T A(z z 0 ) = N (z z 0, A 1 )

Bayesian Logistic Regression: A Revisit in Earnest Prior: Gaussian, which is natural 1 Posterior: p(w) = N (w m 0, S 0 ) p(w t) p(w)p(t w) ln p(w t) = 1 2 (w m 0) T S0 1 (w m 0) [prior] m { ( + t (i) ln y (i) + 1 t (i)) ( ln 1 y (i))} [likelihood] i=1 + const S N = ln p(w t) = S 1 0 + m y (i) (1 y (i) )φ n φ T n i=1 Hence, the Laplace approximation to the posterior is q(w) = N (w w MAP, S N ) 1 Mathematicians always choose priors for the sake of convenience rather than approaching God.

Predictive Density p(c 1 φ, t) = p(c 1 φ, w)p(w t) dw σ(w T φ )q(w) dw Plan p(c 2 φ, t) = 1 p(c 1 φ, t) Change it to a univariate integral Substitute sigmoid with a probit function, which is then convolved with a normal σ( )N ( ) d Φ( )N ( ) d = Φ( ) σ( )

The Dirac Delta Fucntion Let δ be the Dirac delta function, loosely thought of a function such that Gaussian distribution peaked at 0 with standard deviation 0 { +, if x = 0 δ(x) = 0, otherwise δ function satisfies δ(a x)f(a) da = f(x) and specifically δ(a x) da = 1

Deriving the Predictive Density p(c 1 φ ) σ(w T φ )q(w) dw [Laplace approx.] σ(w T φ) = δ(a w T φ )σ(a) da [Def. of δ] p(c 1 φ ) = δ(a w T φ)σ(a) da q(w) dw = σ(a)δ(a w T φ )q(w) dw da = σ(a)p(a) da where p(a) = δ(a w T φ )q(w) dw We now argue that p(a) is Gaussian

Deriving the Predictive Density (2) p(a) = = δ(a w T φ )q(w) dw 1 dw 2 dw n q( w) dw 2 dw n w is such that w T φ = a We can also verify that p(a) da = δ(a w T φ )q(w) dw da = δ(a w T φ )q(w) da dw = q(w) d w δ(a w T φ) da = 1

Deriving the Predictive Density (3) µ a = E[a] = p(a)a da = q(w)w T φ dw = wmap T φ σa 2 = var[a] = p(a) { a 2 E[a] 2} da = q(w) { (w T φ) 2 (m 2 Nφ) 2} dw = φ T S N φ Thus p(c 1 t) σ(a)p(a) da = σ(a)n (a µ a, σa) 2 da ( ) Φ(λa)N (a µ a, σa) 2 µ a da = Φ (λ 2 + σa) 2 1/2 σ ( κ(σa)µ 2 ) a λ = π/8, κ(σ 2 a) = (1 + πσ 2 a/8) 1/2, chosen such that the rescaled probit function has the same slope as sigmoid at the origin.

Φ( ) φ( ) = Φ( ) Let X N (a, b 2 ) and Y N (c, d 2 ) ( ) w c Pr{X Y } = Pr{X Y Y = w} φ dw d ( ) w c = Pr{X w} φ dw d ( ) ( ) w a w c = Φ φ dw c d By noticing that We have X Y N (a c, b 2 + d 2 ) Pr{X Y } = Pr{X Y 0} ( ) a + c = Φ b 2 + d 2

Take-Home Messages Discriminant functions Probabilistic generative models (don t use it) Probabilistic discriminative models Bayesian logistic regression Sampling methods Deterministic approximations Laplace approximation (fitting a Gaussian at a mode) Substitute the sigmoid function with a probit function Analytical solutions The decision boundary is the same with equal prior probabilities

Thanks for Listening!