Classification. Sandro Cumani. Politecnico di Torino

Politecnico di Torino

Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks

Gaussian Classifier We want to model the data distribution P(x c) This allows computing class posterior probabilities using Bayes rule P(c x) = How do we model P(x c)? P(x c)p(c) c P(x c )P(c )

Gaussian Classifier We want to model the data distribution P(x c) This allows computing class posterior probabilities using Bayes rule P(c x) = How do we model P(x c)? P(x c)p(c) c P(x c )P(c ) Simplest distribution: multivariate Gaussian distribution One mean and one covariance matrix per class In some cases it s useful to tie covariance parameters across classes

Gaussian Distribution Let X denote a Random Variable (R.V.), and x a sample of X. Let X be distributed according to a univariate Gaussian distribution X N(m,σ 2 ) The probability density function for X is 4 P X (x) = m is the distribution mean σ 2 is the distribution variance 1 1 (x m) 2 2πσ 2 e 2 σ 2 4 With an abuse of notation we will use the symbol P to denote both probabilities (for discrete R.V.s) and densities (for continuous R.V.s).

Gaussian Distribution 1.0 0.8 m = 0,σ 2 = 0 m = 0,σ 2 = 4 m = 1,σ 2 = 0.25 0.6 0.4 0.2 0.0 6 4 2 0 2 4 6

Multivariate Gaussian Distribution Let X be a random vector X = [X 1,...,X N ] T where X i are independent identically distributed R.V.s following a standard normal distribution X i N(0, 1) The distribution of X is given by the joint distribution of{x 1,...,X N } or, equivalently, N P X (x) = P Xi (x i ) i=1 P X (x) = (2π) N 2 e 1 2 xt x X is said to follow a standard multivariate normal distribution X N(0, I)

Multivariate Gaussian Distribution In general, X is said to follow a multivariate normal distribution with mean µ and covariance matrix Σ if X can be rewritten as a linear transformation of a standard multivariate normal distributed random vector Y : X = AY +µ where Y N(0, I) and Σ = AA T We will write X N(µ,Σ) The p.d.f. of X is given by P X (x) = (2π) N 2 Σ 1 2 e 1 2 (x µ)t Σ 1 (x µ) Often, rather than working with a covariance matrix, it s easier to work with its inverse, the precision matrix Λ = Σ 1

Multivariate Gaussian Distribution It s usually more practical to work with the logarithm of the p.d.f. (to avoid numerical problems) Let s have a look at the log pdf of X N(µ,Λ 1 ) log P X (x) = 1 2 log Λ 1 2 (x µ)t Λ(x µ) N 2 log(2π) We can notice that it s a negative definite quadratic form in x, thus level sets are ellipses The axes are given by the eigenvalues of the covariance matrix

Multivariate Gaussian Distribution µ = 0,Σ = [ ] 1 0 0 1 µ = 0,Σ = [ ] 2 0 0 1 µ = 0,Σ = [ 1.5 ] 0.7 0.7 1 2.0 1.5 2.0 1.5 0.020 2.0 1.5 1.0 0.5 0.0 0.120 0.090 1.0 0.5 0.0 0.100 0.080 1.0 0.5 0.0 0.060 0.120 0.5 1.0 1.5 0.060 0.030 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 0.060 0.040 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 0.090 0.030 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Multivariate Gaussian Distribution Let X N(m,Λ 1 ). The covariance matrix can be decomposed as Λ 1 = UDU T Consider the R.V. Z = U T (X m): Z N(0, D) i.e., Z is a random vector whose components are independent univariate normal distributed R.V.s Z i N(0, d i ) The first m components of Z = U T (X m) correspond to the directions of highest variance we recovered PCA.

Maximum Likelihood Estimate We assume that our data are independent samples of a R.V. with multivariate Gaussian distribution The log pdf of our dataset X = {x 1,...,x K } given model M is therefore K log P(X M) = log P(x i M) i=1 P(X M) is called likelihood function Maximum Likelihood Estimate (MLE) estimates the parameters of the model M which maximize the (log )likelihood MLE finds the parameters under which the dataset is more likely to be generated

Maximum Likelihood Estimate Let s apply MLE to a multivariate Gaussian distribution The log likelihood we want to maximize is given by log P(X µ,λ 1 ) = K i=1 1 2 log Λ 1 2 (x i µ) T Λ(x i µ)+k The solution is obtained by setting the derivative equal to zero and solving for µ and Λ Λ 1 ML = 1 K µ ML = 1 K i x i (x i µ ML )(x i µ ML ) T i i.e., the empirical mean and covariance matrix of the data

Gaussian Classifier For classification purposes, we can assume that the data of each class c is generated by a R.V. X c N(µ c,λ 1 c ) MLE allows us to estimate the set of parameters Π = { µ c,λ 1 c For each test sample, we compute the class conditional likelihood P(x c) as P(x c) = P(x c,π,m) = N(x;µ c,λ 1 c ) }

Gaussian Classifier Binary classification: we can compute the log likelihood ratio 1 ) l = log P(x c 1) P(x c 2 ) = log N(x;µ 1,Λ 1 N(x;µ 2,Λ 1 The decision function is quadratic in x: l(x) = x T Ax+x T b+c 2 ) with A = 1 2 (Λ 1 Λ 2 ) c = 1 2 b = (Λ 1 µ 1 Λ 2 µ 2 ) ( µ T 1 Λ 1 µ 1 µ T 2Λ 2 µ 2 ) + 1 2 (log Λ 1 log Λ 2 )

Gaussian Classifier A binary 2D example 6-8.0 0.0 24.0 40.0 4 32.0 2 8.0 16.0 0 2 4-16.0-8.0-24.0 0.0 4 2 0 2 4 6

Gaussian Classifier For some datasets it s convenient to assume that the covariance matrices of the different classes are tied Large dimensional data Small number of samples In this case, the ML solution is given by Σ = 1 (x c µ K c )(x c µ c ) T c i c

Gaussian Classifier The binary log likelihood ratio becomes l = log P(x c 1) P(x c 2 ) = log N(x;µ 1,Λ 1 ) N(x;µ 2,Λ 1 ) The decision function is now linear in x: l(x) = x T b+c with c = 1 2 b = Λ(µ 1 µ 2 ) ( µ T 1 Λµ 1 µ T 2Λµ 2 )

Gaussian Classifier A binary 2D example 6 15.0 4 10.0 2 5.0 0 0.0 2-5.0 4-10.0 4 2 0 2 4 6

Gaussian Classifier Consider a classifier based on the Mahalanobis distance, which assigns the label to the class for which x µ c W is minimum A corresponding scoring function would then be f(x) = x µ 2 2 W x µ 1 2 W Observe that f(x) = x T Wx 2x T Wµ 2 +µ T 2 Wµ 2 x T Wx+2x T Wµ 1 µ T 1 Wµ 1 = 2x T W(µ 1 µ 2 )+µ T 2 Wµ 2 µ T 1 Wµ 1 If we set W = Λ, f(x) provides the same decision boundaries as the log likelihood ratio of the Gaussian model with tied covariances!

Gaussian Classifier The model is also closely related to LDA Remember that two class LDA looks for the direction which maximizes the generalized Rayleigh quozient with w T S B w w T S W w S W = K Λ 1 S B = (µ 2 µ 1 )(µ 2 µ 1 ) T We have seen that we can solve the problem by applying the following transformations x = Λ 1 2 x = I S W S B = Λ 1 2(µ 2 µ 1 )(µ 2 µ 1 ) T Λ 1 2

Gaussian Classifier Since v = Λ 1 2(µ 2 µ 1 ) is just a vector, the leading eigenvector of S B is ν = v v Thus, projection over the LDA subspace is, up to a scaling factor, given by w T x = k x T Λ(µ 2 µ 1 ) This corresponds to the classification rule of the Gaussian model with tied covariances! Indeed, a limitation of LDA is that it assumes that all classes have the same within class covariance matrix

Gaussian Classifier Multiclass problems: we learn class specific model parameters This allows computing class conditional likelihoods P(x c) = N(x;µ c,σ c ) If we are interested in closed set class posteriors, we can apply Bayes rule to compute posterior probabilities: P(c x) = P(x c)p(c) P(x) = π cn(x;µ c,σ c ) c π c N(x;µ c,σ c ) We assign to the test the label of the class that has highest posterior probability P(c x)

Gaussian Classifier MNIST Error rates for Gaussian classifier M PCA PCA+LDA AEC (MLP) Tied covariances 100 12.3% 11.8% 50 12.6% 12.0% 9 23.7% 12.3% 13.2% Non tied covariances 100 4.3% 3.6% 50 3.6% 3.5% 9 12.2% 10.2% 6.4%

Logistic Regression For a 2 class problem, the Gaussian model with tied covariances provides likelihood ratios that are linear functions of our data 5 Assuming uniform priors 6 l(x) = log P(x c 1) P(x c 2 ) = wt x It follows that P(x c 1 ) P(x c 2 ) = P(c 1 x) P(c 2 x) = ewt x P(c 1 x, w) = e wtx P(c 2 x, w) = e wtx (1 P(c 1 x, w)) 5 We omit the bias term here. In general, bias can be accounted for by replacing x with [x T, 1] T 6 Non uniform priors can be accounted for using a bias term

Logistic Regression Therefore where P(c 1 x, w) = ewt x 1+e wt x = 1 1+e wt x = σ(wt x) σ(x) = 1 1+e x is called sigmoid function (a special case of logistic function)

Logistic Regression Sigmoid function: 1.0 0.8 σ(x) 0.6 0.4 0.2 0.0 8 6 4 2 0 2 4 6 8 Some properties of σ(x) that will come useful later 1 σ(x) = σ( x) dσ(x) dx = σ(x)(1 σ(x))

Logistic Regression We assume that the label for class c 1 is 1, and the label for class c 2 is 0 Let It follows that y i = P(c 1 x i, w) = σ(w T x i ) P(c 2 x i, w) = 1 y i = σ( w T x) Let t i {0, 1} denote the training label associated to x i

Logistic Regression The likelihood of our label set is P(t x, w) = i P(t i x i, w) where P(t i x i, w) = { yi if t i = 1 1 y i if t i = 0 i.e., each t i is generated according to a Bernoulli distribution with parameter p = y i

Logistic Regression In a compact form, the likelihood can be expressed as P(t x, w) = i y t i i (1 y i) 1 t i In log domain, log P(t x, w) = i [t i log y i +(1 t i ) log(1 y i )] The negative expression E(w) = i [t i log y i +(1 t i ) log(1 y i )] is also called binary cross entropy

Logistic Regression The cross entropy can be interpreted as a type of error function It measures the distance between the predicted labels for the training set and the actual labels As for the Gaussian model, we are interested in maximizing the likelihood log P(t x, w), or, which is equivalent, minimize the cross entropy E(w)

Logistic Regression Note that, if we set z i = 2t i 1, i.e. { 1 if ti = 1 z i = 1 if t i = 0 then E(w) can be rewritten in compact form as E(w) = i = i logσ(z i w T x i ) ) log (1+e z iw T x i Function l(s) = log(1+e s ) is called logistic loss

Logistic Regression Logistic loss 9 8 7 6 5 4 3 2 1 l(s) 0 8 6 4 2 0 2 4 6 8

Logistic Regression Logistic regression can be interpreted as an instance of a broad class of optimization problems which aim at minimizing an empirical risk function over our training data Generalized risk minimization problem: min l(w, x i, z i ) w i where l is called loss (or cost) function. In general, some regularization is used to avoid overfitting min w λ 2 w 2 + 1 K l(w, x i, z i ) i

Logistic Regression Regularization can also be applied to Logistic Regression 7 This is necessary when classes are separable to avoid the norm of w from growing indefinitely! Regularized logistic regression: min w λ 2 w 2 + 1 K i ) log (1+e z iw T x i λ is a hyperparameter of the model, and as usual its optimal value should be selected by means of a validation set 7 Once we add regularization, the model is not invariant to linear transformations anymore. It is therefore useful to preprocess our data (e.g. whitening) for the regularizer to be effective

Logistic Regression The optimal value for w cannot be expressed in closed form It can be shown that the regularized logistic regression objective function is convex We can resort to numerical optimization approaches In this course we will use the L BFGS algortihm 8 L BFGS libraries are available for a wide number of programming languages (including python) 8 Details of L BFGS can be found in: J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. Springer, 2006

Logistic Regression In order to run the numerical solver, we need to compute the gradient of the objective function with respect to w w E(w) The derivative of the cost function l(s, z i ) = log(1+e z is ) with respect to s is Thus, dl(s) ds = z i 1+e z is w l(w T x i, z i ) = z i 1+e z is x i Notice that, if we use t i in place of z i, we have also w l(w T x i, z i ) = (y i t i )x i

Logistic Regression λ = 0 λ = 1 6 10.0 15.0 6 2.0 3.0 4 4 5.0 1.0 2 2 0 0.0 0-1.0 0.0 2-5.0 2 4-10.0 4-2.0 4 2 0 2 4 6 4 2 0 2 4 6

Multiclass Logistic Regression As in the binary case, we assume that log P(x c) = w T c x+k We have one hyperplane w c per class Assuming uniform priors, class posteriors are then given by P(c x) P(x c) e wt c x Since c P(c x) = 1 it follows that P(c x) = ewt c x c ewt c x Function f i (s) = es i j es j is called softmax

Multiclass Logistic Regression We assume that class labels are c i {0,...,N 1}. We adopt a 1 over N coding scheme for class labels. For each data point x i we define the label vector t i with components { 1 if ci = j t ij = 0 otherwise i.e., t i is a vector whose elements are all zero except the element whose position corresponds to the class label.

Multiclass Logistic Regression Let W denote the set of all hyperplanes W = {w 1,...,w N }. The likelihood for vectors t i is given by log P({t i } {x i },W) = i = i = i log P(t i x i, W) t ij log P(c j x, W) j t ij log y ij j where y ij = ewt j x k ewt k x This objective function is also known as negative cross entropy between class labels and predictions

Multiclass Logistic Regression Multiclass logistic regression corresponds to a multinomial model of the class labels, where the multinomial parameters are given by Π = [ w T i x,...,wt N x] As for the binary case, we estimate W as to maximize the likelihood for the training labels Compared to the binary case, the model is overparametrized (i.e., we can add a constant vector to all terms w i without changing the model) In particular, for a 2 class problem, if we subtract w 2 from both w 1 and w 2, we recover exactly the binary logistic regression objective.

Multiclass Logistic Regression Finally, as for the binary class, we can cast the problem as a minimization of a loss function We rewrite the objective in terms of class labels c as log P({t i } {x i },W) = i t ij log y ij j This is also called softmax loss e wt c i x i c ewt c x = log i [ ( = log i e wt c x c ) w Tci x i ]

Multiclass Logistic Regression Adding a regularizer, the multiclass logistic regression objective function is min w 1,...,w N λω(w 1,...,w N )+ 1 K [ ( log i e wt c x c Different regularizers can be used, for example Ω(w 1,...,w N ) = 1 w i 2 2 i ) w Tci x i ]

Multiclass Logistic Regression The empirical loss and its gradients are l(w 1,...,w N, x i ) = wk l(w 1,...,w N, x i ) = [ ( log ( e wt c x c ) w Tci x i ] e wt k x c ewt c x δ k,c k ) x In terms of y ij and t ij : l(w 1,...,w N, x i ) = j t ij log y ij wk l(w 1,...,w N, x i ) = (y nk t nk ) x i

Multiclass Logistic Regression MNIST Error rates for Logistic Regression DimRed λ = 0 λ = 0.00001 λ = 0.001 λ = 0.1 Tied Gau RAW [768] 8.0% 7.4% 7.9% 12.9% PCA [50] 8.8% 8.8% 8.9% 13.3% 12.3 % PCA [100] 7.8% 7.8% 8.2% 12.9% 12.6% AEC [50] 9.1% 9.2% 9.2% 11.9% 12.0% AEC [100] 7.8% 7.8% 8.2% 12.1% 11.8%

Multiclass Logistic Regression Linear logistic regression on MNIST performs better than our Tied Covariance Gaussian classifier, however it s far worse than our non linear Gaussian classifier Remember that, for LR, we assumed that log P(x c) = w T c x+k which has the same form as the Gaussian classifier with tied covariances For Gaussian classifier with non tied covariances we have log P(x c) = x T Λ c x+x T Λ c 1 2 µt cλ c µ c + k which we can rewrite as log P(x c) =< xx T, A > +x T v+k

Multiclass Logistic Regression The log pdf log P(x c) can be expressed as a linear function in an expanded feature space vec(xx T ) φ(x) = x 1 1 2 vec(λ c) w = Λ c µ c 1 2 µt Λ c µ c + k log P(x c) = w T φ(x)+k We can use LR with data points φ(x) to directly estimate w

Multiclass Logistic Regression In general, we can consider a transformation φ(x) of our feature space such that our classes are linearly separable in the expanded feature space The simple expansion we have just seen produces quadratic separtion surfaces (cfr. with the Gaussian model) The dimensionality of the expanded feature space can grow very quickly MNIST Error rates for LR with quadratic feature expansion DimRed λ = 0 λ = 1e 5 λ = 1e 3 λ = 1e 1 Gaussian PCA [50] 2.3% 1.9% 1.7% 3.1% 3.6% AEC [50] 2.3% 2.0% 2.0% 2.2% 3.5%

Neural Networks for classification Neural Networks (NN) provide a method to estimate function φ A Neural Network can be interpreted as a non linear parametric function φ(x,π) The parameters of the non linear transformation are learned from the data The function is represented by means of a directed graph Each node is associated to a function that operates on the input nodes and provides the node output

Neural Networks A neural network node x 1 x 2 f(x,π) y = f(x,π) x 3 x 4

Neural Networks Function f is usually represented as a composition of an affine projection and a scalar non linearity f(x,π) = h(w T x+b) Several possible functions have been proposed for the non linearity Sigmoid function h(x) = 1 1+e x Hyperbolic tangent h(x) = tanh(x) Rectified linear h(x) = max(0, x)

Neural Networks Sigmoid Hyp. Tangent 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0.2 h(x) = σ(x) 4 2 0 2 4 1.0 0.5 0.0 0.5 1.0 h(x) = tanh(x) 4 2 0 2 4 ReLU 5 4 3 2 1 0 h(x) = max(x,0) 4 2 0 2 4

Feed forward Neural Networks Units are organized in layers x 1 x 2 y 1 x 3..... y M x N Connections are defined between layers (no loops)

Feed forward network (revisited) Training a neural network requires optimizing its parameters (in our case, the parameters of the affine transformations) with respect to our objective function This requires computing the gradients of our objective function with respect to the network parameters We can exploit the representation of the network in terms of composition of functions and apply the chain rule to compute our gradients An effective approach to compute these gradients is the back propagation algorithm

Neural Networks for classification Training is usually performed using Stochastic Gradient Descent over batches Given a randomly sampled batch, we compute the gradient of the objective function We update the weights using a gradient descent approach α t is called learning rate w w α t w l({x} BATCH ) Convergence is guaranteed if α t = t αt 2 < More sophisticated approaches have been recently introduced t

Neural Networks for classification We can combine the neural network with the logistic regression model The neural network preprocesses our data, which are then classified by means of LR Recall that, for binary classification, we can write our objective function as l(x) = [t log y+(1 t) log(1 y)] where y = σ(w T x). Adding the NN component we have y = σ(w T f n (x))

Neural Networks for classification The model is equivalent to a NN n 1 with an additional sigmoid activation layer, and loss function l(x) = [t log f n1 (x)+(1 t) log(1 f n1 (x))] For binary classification we can thus build a network with an extra, single node, sigmoide layer and train the network to minimize the objective function i [t i log f(x i )+(1 t i ) log(1 f(x i ))] This objective is called binary cross entropy

Neural Networks for classification A binary 2D example 6 2.5 4 5.0 2 0 2-2.5 0.0 4-5.0 4 2 0 2 4 6

Neural Networks for classification Overfitting can be much more dramatic than for linear logistic regression 6 8.0 4 0.0 2 0 2 0.0-8.0 4.0-8.0 8.0-4.0 0.0 0.0 4.0 4-8.0 4 2 0 2 4 6

Neural Networks for classification Different regularization strategies can be adopted L2 weights regularization Dropout Early stopping (computing error on validation set)

Neural Networks for classification L2 weights regularization 6 4.5 4 3.0 2 1.5 0 2-1.5 0.0 4-3.0 4 2 0 2 4 6

Neural Networks for classification For multiclass problems we consider the multiclass logistic regression objective l(x) = j t j log y j where y j = ewt j f NET (x) k ewt k f NET (x) As for the binary case, we can interpret this model as a NN NET 1 with an additional softmax layer. The network has an output node for each class Training targets are represented using a 1 out of N code

Multiclass Logistic Regression MNIST Error rates for Neural Nets 9 No Reg. L2 (λ = 1e 5 ) Dropout (p = 0.5) MLP (Tanh) 512 512 512 MLP (ReLU) 512 512 512 MLP (ReLU) 1024 1024 1024 1024 1.9% [1.8%] 2.0% [1.7%] 1.5% [1.5%] 1.6% [1.5%] 1.7% [1.6%] 1.7% [1.4%] 1.6% [1.5%] 1.6% [1.4%] 1.4% [1.4%] ConvNet (ReLU) 1.1% [1.0%] 1.1% [1.0%] 0.9% [0.8%] 9 Training set was split into development (90% of the data) and validation (10% of the data) sets to select the best performing model. The performance of the model with lowest error rate on the test set is shown in brackets.