Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015
Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative Models Bayesian Logistic Regression
Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative Models Bayesian Logistic Regression
Prologue Linear classification is easy my good old friend, logistic regression, always serves as a baseline method in various applications. Through a systematic study, however, we can grasp the main idea behind a range of machine learning techniques. This seminar also precedes our future discussion on GP classification. References: The materials basically follow Chapter 4 in Pattern Recognition and Machine Learning. Figures are taken (w/ or w/o modification) without further acknowledgment. See also A. Webb, et al., Statistical Pattern Recognition.
The Linear Classification Problem The Road Map Discriminant functions Probabilistic generative models Probabilistic discriminative models Bayesian logistic regression
On Non-Linearity Generalized linearity: non-linear transformation by basis functions Feature engineering/selection Kernels: transforming to a Hilbert space, fully characterized by a predefined inner-product operation SVM (a discriminant function) Gaussian processes (non-parametric Bayes) k-nn (slacked similarity measure, non-parametric discriminant) Composition of non-linear activation functions Neural networks (finite feature learning, equivalent to learning a sophisticated kernel mapping input data to another Euclidean space) My questions Can neural networks composite infinite dimensional features (with kernels)? Sparse kernel network? Or is it necessary? Or indeed, neural networks ARE non-parametric!
Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative Models Bayesian Logistic Regression
The Decision Boundary Let x = (x 1,, x n ) T be a data sample in R n Let y {0, 1} be the label of x (a two-class problem) A linear classifier always takes the form { 1, if w y = T x + w 0 > 0 0, otherwise The decision boundary is a hyperplane in R n 1 w T x + w 0 = 0
Multiclass problem Let K be the number of classes one-versus-the-rest (K 1 classifiers) one-versus-one + voting (K(K 1)/2 classifiers) Scoring y k (x) = wk T x + w k0
Linear Regression as Classiciation Let the target value be 1-of-K coded. (K: # of classes) The scoring function is y k (x) = w T k x + w k0 The least square objective function m K J = (y k (x (i) ) t (i)) 2 i=1 k=1 The problem of being too correct the target value is too far away from a Gaussian distribution.
Perception Learning Algorithm Non-linear hard squashing function { 1, if w y = T x + w 0 > 0 1, otherwise Perception learning algorithm For each misclassified sample x (i) : w (τ+1) = w (τ) + ηx (i) t (i) For its convergence theorem, please refer to Neural Networks and Machine Learning. Pros and Cons + Cares about mis-classified samples only Does not work with linearly inseparable data Does not generalize to multi-class problems + Introduces non-linear activation function
Fisher s Linear Discriminant Heuristics: Attempt #1: Maximize the class separation Attempt #2 (the Fisher criterion): Maximize the ratio of the between-class variance to the within-class variance (after projecting to a one-dimensional space, perpendicular to the decision boundary)
Formalization of Fisher s Linear Discriminant Between-class variance m 1 = Within-class variance 1 N 1 i C 1 x (i), m 2 = 1 N 2 s 2 k = i C k ( y (i) m k ) 2 i C 2 x (i) where The cost function y n = w T x n J(w) = wt (m 2 m 1 ) s 2 1 + s2 2
Solution to Fisher s Linear Discriminant where J(w) = wt (m 2 m 1 ) s 2 1 + s2 2 = wt S B w w T S w w S B = (m 2 m 1 )(m 2 m 1 ) T S W = i C 1 (x i m 1 )(x i m 1 ) T + i C 2 (x i m 2 )(x i m 2 ) T Differentiating the cost function and setting it to 0, we obtain Thus (ws B w)s W w = (w T S W w)s B w w S 1 W (m 2 m 1 ) See Pattern Recognition and Machine Learning for multi-class scenarios.
Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative Models Bayesian Logistic Regression
A Probabilistic Model A data (x, y) is generated according to the following story Step 1: Choose a box C k, i.e., t = k, with probability p(c k ) Step 2: Draw a ticket x from box k with probability p(x C k ) The Bayesian network Posterior probability (t) (x) p(x C 1 )p(c 1 ) p(c 1 x) = p(x C 1 )p(c 1 ) + p(x C 2 )p(c 2 ) 1 = = σ(a) 1 + exp( a) where a = ln p(x C 1)p(C 1 ) p(x C 2 )p(c 2 ) = ln p(c 1 x) p(c 2 x)
Multi-class Scenarios Softmax Predicting the class label p(c k x) = p(x C k)p(c k ) j p(x C j)p(c j ) ˆt = argmax p(c k x) k
Generative Models v.s. Discriminative Models Training objectives i.e., v.s. maximize p(x, t; Θ) Θ maximize p(x t; Θ)p(t; Θ) Θ maximize p(t x; Θ) Θ
Gaussian Inputs Assumption p(x C k ) = 1 { (2π) n Σ exp 1 } 2 (x µ k) T Σ 1 (x µ) Consider a binary classification problem where w = Σ 1 (µ 1 µ 2 ) p(c 1 x) = σ(w T x + w 0 ) w 0 = 1 2 µt 1 Σ 1 µ 1 + 1 2 µt 2 Σ 1 µ 2 + ln p(c 1) p(c 2 ) Parameter estimation: Maximum likelihood estimation See also Ch 2.3, 2.4 in Statistical Pattern Recognition.
Gaussian Classifiers Shared covariance matrix Σ Linear decision boundary Different Σ k Quadratic decision boundary Warning: Don t use Gaussian Classifiers
Discrete Features Consider binary features x = (x 1,, x n ) T, where x i {0, 1}. p(x C k ) has 2 n 1 independent free parameters Naïve Bayes assumption: x i independent p(x C k ) = n p(x i C k ) = i=1 Posterior class distributions a k (x) = n i=1 p(c k x) = σ(a k (x)) µ x i ki (1 µ ki) 1 x i n {x i ln µ ki (1 x i ) ln(1 µ ki )} + ln p(c k ) i=1 Linear in features x!
Exponential Family Let class-conditional distributions take the form Binary classification p(x λ k ) = h(x)g(λ k ) exp { λ T k u(x)} a(x) = (λ 1 λ 2 ) T x + ln g(λ 1 ) ln g(λ 2 ) + ln p(c 1 ) ln p(c 2 ) Multi-class problem a k (x) = λ T k x + ln g(λ k) + ln p(c k ) Examples: Normal, exponential, log-normal, gamma, chi-squared, beta, Dirichlet, Bernoulli, categorical, Poisson, geometric, inverse Gaussian, von Mises, and von Mises-Fisher. Provided some parameters are fixed, Pareto, binomial, multinomial, and negative binomial are also in the exponential family.
Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative Models Bayesian Logistic Regression
Logistic Regression Under rather general assumptions, the posterior class distribution takes the form y(x) = p(c 1 x) = σ(w T x + w 0 ) The idea is to optimize directly the above posterior distribution. Fewer parameters, better performance. Consider a binary classification problem with n-dimensional feature space Logistic regression: n parameters A Gaussian classifier: + Means: 2n + Covariance matrices (shared): n(n + 1)/2 + Class prior: 1
Probit Regression What if we substitute σ with other squashing functions? Probit function Φ(a) = a N (θ 0, 1) dθ Similar performance compared with logistic regression More prone to outliers (Why? Consider the decay ) rate Useful later
Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative Models Bayesian Logistic Regression
Bayesian Learning Let y {0, 1} m denote the labels of training data φ 1,, φ m Prior p(w), which is a $64,000,000 question Likelihood m m p(y w) = p(y (i) w) = σ (w T φ (i)) t (i) ( 1 σ (w T φ (i))) 1 t (i) Posterior i=1 p(w y) = p(w)p(y w) p(y) i=1 w p(w)p(y w) = p(w) Predictive density p(y y) = dw p(y w) p(w y) y m σ( ) t(i) (1 σ( )) 1 t(i) i=1 dw σ ( w T ) m φ p(w) σ( ) t(i) (1 σ( )) 1 t(i) i=1
Intractability Posterior is intractable due to the normalizing factor. Predictive density is intractable due to the integral. We have to resort to approximations Sampling methods Stochastic, usually asymptotically correct, hard to scale Deterministic methods Do things wrongly and hope they work
Laplace Approximation Fit a Gaussian at a mode The standard deviation is chosen such that... the second-order derivative of the log probability matches The first-order derivative is always 0 at a mode Scale free in representing the unnormalized measure Real variables only Only local properties captured, multi-mode distributions?
Fitting a Gaussian Let p(z) = 1 Z f(z) be a true distribution, where Z = f(z) dz Step 1: Find a mode z 0 of p(z), by gradient methods, say, satisfying df(z) dz = 0 z=z0 Step 2: Consider a Taylor expansion of ln f(z) at z 0 ln f(z) ln f(z 0 ) 1 2 A(z z 0) 2 where A is given by A = d2 ln f(z) dz2 z=z0 Taking the exponential, f(z) f(z 0 ) exp { A2 } (z z 0) 2 Step 3: Normalize to a Gaussian distribution ( ) A 1/2 q(z) = exp { A2 } 2π (z z 0) 2
Laplace Approximation for Multivariate Distributions To approximate p(z) = 1 Z f(z), where z Rm We expand at mode z 0 ln f(z) ln f(z 0 ) 1 2 (z z 0) T A(z z 0 ) where A = ln f(z), Hessian of ln f(z), serving as the z=z0 precision matrix in a Gaussian distribution Taking the exponential, we obtain { f(z) f(z 0 ) exp 1 } 2 (z z 0) T A(z z 0 ) Normalize it as a distribution, and then we have { q(z) = A 1/2 exp 1 } (2π) m/2 2 (z z 0) T A(z z 0 ) = N (z z 0, A 1 )
Bayesian Logistic Regression: A Revisit in Earnest Prior: Gaussian, which is natural 1 Posterior: p(w) = N (w m 0, S 0 ) p(w t) p(w)p(t w) ln p(w t) = 1 2 (w m 0) T S0 1 (w m 0) [prior] m { ( + t (i) ln y (i) + 1 t (i)) ( ln 1 y (i))} [likelihood] i=1 + const S N = ln p(w t) = S 1 0 + m y (i) (1 y (i) )φ n φ T n i=1 Hence, the Laplace approximation to the posterior is q(w) = N (w w MAP, S N ) 1 Mathematicians always choose priors for the sake of convenience rather than approaching God.
Predictive Density p(c 1 φ, t) = p(c 1 φ, w)p(w t) dw σ(w T φ )q(w) dw Plan p(c 2 φ, t) = 1 p(c 1 φ, t) Change it to a univariate integral Substitute sigmoid with a probit function, which is then convolved with a normal σ( )N ( ) d Φ( )N ( ) d = Φ( ) σ( )
The Dirac Delta Fucntion Let δ be the Dirac delta function, loosely thought of a function such that Gaussian distribution peaked at 0 with standard deviation 0 { +, if x = 0 δ(x) = 0, otherwise δ function satisfies δ(a x)f(a) da = f(x) and specifically δ(a x) da = 1
Deriving the Predictive Density p(c 1 φ ) σ(w T φ )q(w) dw [Laplace approx.] σ(w T φ) = δ(a w T φ )σ(a) da [Def. of δ] p(c 1 φ ) = δ(a w T φ)σ(a) da q(w) dw = σ(a)δ(a w T φ )q(w) dw da = σ(a)p(a) da where p(a) = δ(a w T φ )q(w) dw We now argue that p(a) is Gaussian
Deriving the Predictive Density (2) p(a) = = δ(a w T φ )q(w) dw 1 dw 2 dw n q( w) dw 2 dw n w is such that w T φ = a We can also verify that p(a) da = δ(a w T φ )q(w) dw da = δ(a w T φ )q(w) da dw = q(w) d w δ(a w T φ) da = 1
Deriving the Predictive Density (3) µ a = E[a] = p(a)a da = q(w)w T φ dw = wmap T φ σa 2 = var[a] = p(a) { a 2 E[a] 2} da = q(w) { (w T φ) 2 (m 2 Nφ) 2} dw = φ T S N φ Thus p(c 1 t) σ(a)p(a) da = σ(a)n (a µ a, σa) 2 da ( ) Φ(λa)N (a µ a, σa) 2 µ a da = Φ (λ 2 + σa) 2 1/2 σ ( κ(σa)µ 2 ) a λ = π/8, κ(σ 2 a) = (1 + πσ 2 a/8) 1/2, chosen such that the rescaled probit function has the same slope as sigmoid at the origin.
Φ( ) φ( ) = Φ( ) Let X N (a, b 2 ) and Y N (c, d 2 ) ( ) w c Pr{X Y } = Pr{X Y Y = w} φ dw d ( ) w c = Pr{X w} φ dw d ( ) ( ) w a w c = Φ φ dw c d By noticing that We have X Y N (a c, b 2 + d 2 ) Pr{X Y } = Pr{X Y 0} ( ) a + c = Φ b 2 + d 2
Take-Home Messages Discriminant functions Probabilistic generative models (don t use it) Probabilistic discriminative models Bayesian logistic regression Sampling methods Deterministic approximations Laplace approximation (fitting a Gaussian at a mode) Substitute the sigmoid function with a probit function Analytical solutions The decision boundary is the same with equal prior probabilities
Thanks for Listening!