Classification. Sandro Cumani. Politecnico di Torino

Similar documents
Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Neural Network Training

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Linear Models for Classification

Overfitting, Bias / Variance Analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Statistical Machine Learning (BE4M33SSU) Lecture 5: Artificial Neural Networks

Ch 4. Linear Models for Classification

Machine Learning. 7. Logistic and Linear Regression

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

STA 4273H: Statistical Machine Learning

Machine Learning 2017

5. Discriminant analysis

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Linear Classification

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Reading Group on Deep Learning Session 1

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Logistic Regression. Seungjin Choi

CSCI567 Machine Learning (Fall 2018)

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Introduction to Machine Learning

Machine Learning Lecture 5

CSC 411 Lecture 10: Neural Networks

Statistical Machine Learning Hilary Term 2018

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Learning from Data Logistic Regression

Logistic Regression & Neural Networks

Logistic Regression. COMP 527 Danushka Bollegala

Pattern Recognition and Machine Learning

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

CS 195-5: Machine Learning Problem Set 1

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

CMU-Q Lecture 24:

ECE521 week 3: 23/26 January 2017

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

CSC 578 Neural Networks and Deep Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

The exam is closed book, closed notes except your one-page cheat sheet.

CSC 411: Lecture 04: Logistic Regression

Bayesian Decision Theory

Machine Learning (CS 567) Lecture 5

ECE521 Lectures 9 Fully Connected Neural Networks

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Neural Networks and Deep Learning

Bias-Variance Tradeoff

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Deep Feedforward Networks

Introduction to Machine Learning

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 5: Logistic Regression. Neural Networks

CSC 411: Lecture 09: Naive Bayes

Multiclass Logistic Regression

Topics in AI (CPSC 532L): Multimodal Learning with Vision, Language and Sound. Lecture 3: Introduction to Deep Learning (continued)

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition

Gaussian discriminant analysis Naive Bayes

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Artificial Neural Networks. MGS Lecture 2

Introduction to Machine Learning

Iterative Reweighted Least Squares

Linear & nonlinear classifiers

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

1 What a Neural Network Computes

Neural Networks: Backpropagation

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Introduction to Machine Learning Spring 2018 Note 18

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

10-701/ Machine Learning, Fall

Linear Regression (continued)

Lecture 3 Feedforward Networks and Backpropagation

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Feed-forward Network Functions

Minimum Error Rate Classification

Advanced Machine Learning

ECE521 Lecture7. Logistic Regression

Week 5: Logistic Regression & Neural Networks

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

The Bayes classifier

Multilayer Perceptron

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

NEURAL NETWORKS

Gradient-Based Learning. Sargur N. Srihari

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Transcription:

Politecnico di Torino

Outline Generative model: Gaussian classifier (Linear) discriminative model: logistic regression (Non linear) discriminative model: neural networks

Gaussian Classifier We want to model the data distribution P(x c) This allows computing class posterior probabilities using Bayes rule P(c x) = How do we model P(x c)? P(x c)p(c) c P(x c )P(c )

Gaussian Classifier We want to model the data distribution P(x c) This allows computing class posterior probabilities using Bayes rule P(c x) = How do we model P(x c)? P(x c)p(c) c P(x c )P(c ) Simplest distribution: multivariate Gaussian distribution One mean and one covariance matrix per class In some cases it s useful to tie covariance parameters across classes

Gaussian Distribution Let X denote a Random Variable (R.V.), and x a sample of X. Let X be distributed according to a univariate Gaussian distribution X N(m,σ 2 ) The probability density function for X is 4 P X (x) = m is the distribution mean σ 2 is the distribution variance 1 1 (x m) 2 2πσ 2 e 2 σ 2 4 With an abuse of notation we will use the symbol P to denote both probabilities (for discrete R.V.s) and densities (for continuous R.V.s).

Gaussian Distribution 1.0 0.8 m = 0,σ 2 = 0 m = 0,σ 2 = 4 m = 1,σ 2 = 0.25 0.6 0.4 0.2 0.0 6 4 2 0 2 4 6

Multivariate Gaussian Distribution Let X be a random vector X = [X 1,...,X N ] T where X i are independent identically distributed R.V.s following a standard normal distribution X i N(0, 1) The distribution of X is given by the joint distribution of{x 1,...,X N } or, equivalently, N P X (x) = P Xi (x i ) i=1 P X (x) = (2π) N 2 e 1 2 xt x X is said to follow a standard multivariate normal distribution X N(0, I)

Multivariate Gaussian Distribution In general, X is said to follow a multivariate normal distribution with mean µ and covariance matrix Σ if X can be rewritten as a linear transformation of a standard multivariate normal distributed random vector Y : X = AY +µ where Y N(0, I) and Σ = AA T We will write X N(µ,Σ) The p.d.f. of X is given by P X (x) = (2π) N 2 Σ 1 2 e 1 2 (x µ)t Σ 1 (x µ) Often, rather than working with a covariance matrix, it s easier to work with its inverse, the precision matrix Λ = Σ 1

Multivariate Gaussian Distribution It s usually more practical to work with the logarithm of the p.d.f. (to avoid numerical problems) Let s have a look at the log pdf of X N(µ,Λ 1 ) log P X (x) = 1 2 log Λ 1 2 (x µ)t Λ(x µ) N 2 log(2π)

Multivariate Gaussian Distribution It s usually more practical to work with the logarithm of the p.d.f. (to avoid numerical problems) Let s have a look at the log pdf of X N(µ,Λ 1 ) log P X (x) = 1 2 log Λ 1 2 (x µ)t Λ(x µ) N 2 log(2π) We can notice that it s a negative definite quadratic form in x, thus level sets are ellipses The axes are given by the eigenvalues of the covariance matrix

Multivariate Gaussian Distribution µ = 0,Σ = [ ] 1 0 0 1 µ = 0,Σ = [ ] 2 0 0 1 µ = 0,Σ = [ 1.5 ] 0.7 0.7 1 2.0 1.5 2.0 1.5 0.020 2.0 1.5 1.0 0.5 0.0 0.120 0.090 1.0 0.5 0.0 0.100 0.080 1.0 0.5 0.0 0.060 0.120 0.5 1.0 1.5 0.060 0.030 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 0.060 0.040 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 0.5 1.0 1.5 0.090 0.030 2.0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Multivariate Gaussian Distribution Let X N(m,Λ 1 ). The covariance matrix can be decomposed as Λ 1 = UDU T Consider the R.V. Z = U T (X m): Z N(0, D) i.e., Z is a random vector whose components are independent univariate normal distributed R.V.s Z i N(0, d i ) The first m components of Z = U T (X m) correspond to the directions of highest variance we recovered PCA.

Maximum Likelihood Estimate We assume that our data are independent samples of a R.V. with multivariate Gaussian distribution The log pdf of our dataset X = {x 1,...,x K } given model M is therefore K log P(X M) = log P(x i M) i=1 P(X M) is called likelihood function Maximum Likelihood Estimate (MLE) estimates the parameters of the model M which maximize the (log )likelihood MLE finds the parameters under which the dataset is more likely to be generated

Maximum Likelihood Estimate Let s apply MLE to a multivariate Gaussian distribution The log likelihood we want to maximize is given by log P(X µ,λ 1 ) = K i=1 1 2 log Λ 1 2 (x i µ) T Λ(x i µ)+k The solution is obtained by setting the derivative equal to zero and solving for µ and Λ Λ 1 ML = 1 K µ ML = 1 K i x i (x i µ ML )(x i µ ML ) T i i.e., the empirical mean and covariance matrix of the data

Gaussian Classifier For classification purposes, we can assume that the data of each class c is generated by a R.V. X c N(µ c,λ 1 c ) MLE allows us to estimate the set of parameters Π = { µ c,λ 1 c For each test sample, we compute the class conditional likelihood P(x c) as P(x c) = P(x c,π,m) = N(x;µ c,λ 1 c ) }

Gaussian Classifier Binary classification: we can compute the log likelihood ratio 1 ) l = log P(x c 1) P(x c 2 ) = log N(x;µ 1,Λ 1 N(x;µ 2,Λ 1 The decision function is quadratic in x: l(x) = x T Ax+x T b+c 2 ) with A = 1 2 (Λ 1 Λ 2 ) c = 1 2 b = (Λ 1 µ 1 Λ 2 µ 2 ) ( µ T 1 Λ 1 µ 1 µ T 2Λ 2 µ 2 ) + 1 2 (log Λ 1 log Λ 2 )

Gaussian Classifier A binary 2D example 6-8.0 0.0 24.0 40.0 4 32.0 2 8.0 16.0 0 2 4-16.0-8.0-24.0 0.0 4 2 0 2 4 6

Gaussian Classifier For some datasets it s convenient to assume that the covariance matrices of the different classes are tied Large dimensional data Small number of samples In this case, the ML solution is given by Σ = 1 (x c µ K c )(x c µ c ) T c i c

Gaussian Classifier The binary log likelihood ratio becomes l = log P(x c 1) P(x c 2 ) = log N(x;µ 1,Λ 1 ) N(x;µ 2,Λ 1 ) The decision function is now linear in x: l(x) = x T b+c with c = 1 2 b = Λ(µ 1 µ 2 ) ( µ T 1 Λµ 1 µ T 2Λµ 2 )

Gaussian Classifier A binary 2D example 6 15.0 4 10.0 2 5.0 0 0.0 2-5.0 4-10.0 4 2 0 2 4 6

Gaussian Classifier Consider a classifier based on the Mahalanobis distance, which assigns the label to the class for which x µ c W is minimum A corresponding scoring function would then be f(x) = x µ 2 2 W x µ 1 2 W Observe that f(x) = x T Wx 2x T Wµ 2 +µ T 2 Wµ 2 x T Wx+2x T Wµ 1 µ T 1 Wµ 1 = 2x T W(µ 1 µ 2 )+µ T 2 Wµ 2 µ T 1 Wµ 1 If we set W = Λ, f(x) provides the same decision boundaries as the log likelihood ratio of the Gaussian model with tied covariances!

Gaussian Classifier The model is also closely related to LDA Remember that two class LDA looks for the direction which maximizes the generalized Rayleigh quozient with w T S B w w T S W w S W = K Λ 1 S B = (µ 2 µ 1 )(µ 2 µ 1 ) T We have seen that we can solve the problem by applying the following transformations x = Λ 1 2 x = I S W S B = Λ 1 2(µ 2 µ 1 )(µ 2 µ 1 ) T Λ 1 2

Gaussian Classifier Since v = Λ 1 2(µ 2 µ 1 ) is just a vector, the leading eigenvector of S B is ν = v v Thus, projection over the LDA subspace is, up to a scaling factor, given by w T x = k x T Λ(µ 2 µ 1 ) This corresponds to the classification rule of the Gaussian model with tied covariances! Indeed, a limitation of LDA is that it assumes that all classes have the same within class covariance matrix

Gaussian Classifier Multiclass problems: we learn class specific model parameters This allows computing class conditional likelihoods P(x c) = N(x;µ c,σ c ) If we are interested in closed set class posteriors, we can apply Bayes rule to compute posterior probabilities: P(c x) = P(x c)p(c) P(x) = π cn(x;µ c,σ c ) c π c N(x;µ c,σ c ) We assign to the test the label of the class that has highest posterior probability P(c x)

Gaussian Classifier MNIST Error rates for Gaussian classifier M PCA PCA+LDA AEC (MLP) Tied covariances 100 12.3% 11.8% 50 12.6% 12.0% 9 23.7% 12.3% 13.2% Non tied covariances 100 4.3% 3.6% 50 3.6% 3.5% 9 12.2% 10.2% 6.4%

Logistic Regression For a 2 class problem, the Gaussian model with tied covariances provides likelihood ratios that are linear functions of our data 5 Assuming uniform priors 6 l(x) = log P(x c 1) P(x c 2 ) = wt x It follows that P(x c 1 ) P(x c 2 ) = P(c 1 x) P(c 2 x) = ewt x P(c 1 x, w) = e wtx P(c 2 x, w) = e wtx (1 P(c 1 x, w)) 5 We omit the bias term here. In general, bias can be accounted for by replacing x with [x T, 1] T 6 Non uniform priors can be accounted for using a bias term

Logistic Regression Therefore where P(c 1 x, w) = ewt x 1+e wt x = 1 1+e wt x = σ(wt x) σ(x) = 1 1+e x is called sigmoid function (a special case of logistic function)

Logistic Regression Sigmoid function: 1.0 0.8 σ(x) 0.6 0.4 0.2 0.0 8 6 4 2 0 2 4 6 8 Some properties of σ(x) that will come useful later 1 σ(x) = σ( x) dσ(x) dx = σ(x)(1 σ(x))

Logistic Regression We assume that the label for class c 1 is 1, and the label for class c 2 is 0 Let It follows that y i = P(c 1 x i, w) = σ(w T x i ) P(c 2 x i, w) = 1 y i = σ( w T x) Let t i {0, 1} denote the training label associated to x i

Logistic Regression The likelihood of our label set is P(t x, w) = i P(t i x i, w) where P(t i x i, w) = { yi if t i = 1 1 y i if t i = 0 i.e., each t i is generated according to a Bernoulli distribution with parameter p = y i

Logistic Regression In a compact form, the likelihood can be expressed as P(t x, w) = i y t i i (1 y i) 1 t i In log domain, log P(t x, w) = i [t i log y i +(1 t i ) log(1 y i )] The negative expression E(w) = i [t i log y i +(1 t i ) log(1 y i )] is also called binary cross entropy

Logistic Regression The cross entropy can be interpreted as a type of error function It measures the distance between the predicted labels for the training set and the actual labels As for the Gaussian model, we are interested in maximizing the likelihood log P(t x, w), or, which is equivalent, minimize the cross entropy E(w)

Logistic Regression Note that, if we set z i = 2t i 1, i.e. { 1 if ti = 1 z i = 1 if t i = 0 then E(w) can be rewritten in compact form as E(w) = i = i logσ(z i w T x i ) ) log (1+e z iw T x i Function l(s) = log(1+e s ) is called logistic loss

Logistic Regression Logistic loss 9 8 7 6 5 4 3 2 1 l(s) 0 8 6 4 2 0 2 4 6 8

Logistic Regression Logistic regression can be interpreted as an instance of a broad class of optimization problems which aim at minimizing an empirical risk function over our training data Generalized risk minimization problem: min l(w, x i, z i ) w i where l is called loss (or cost) function. In general, some regularization is used to avoid overfitting min w λ 2 w 2 + 1 K l(w, x i, z i ) i

Logistic Regression Regularization can also be applied to Logistic Regression 7 This is necessary when classes are separable to avoid the norm of w from growing indefinitely! Regularized logistic regression: min w λ 2 w 2 + 1 K i ) log (1+e z iw T x i λ is a hyperparameter of the model, and as usual its optimal value should be selected by means of a validation set 7 Once we add regularization, the model is not invariant to linear transformations anymore. It is therefore useful to preprocess our data (e.g. whitening) for the regularizer to be effective

Logistic Regression The optimal value for w cannot be expressed in closed form It can be shown that the regularized logistic regression objective function is convex We can resort to numerical optimization approaches In this course we will use the L BFGS algortihm 8 L BFGS libraries are available for a wide number of programming languages (including python) 8 Details of L BFGS can be found in: J. Nocedal and S. J. Wright, Numerical Optimization, 2nd ed. Springer, 2006

Logistic Regression In order to run the numerical solver, we need to compute the gradient of the objective function with respect to w w E(w) The derivative of the cost function l(s, z i ) = log(1+e z is ) with respect to s is Thus, dl(s) ds = z i 1+e z is w l(w T x i, z i ) = z i 1+e z is x i Notice that, if we use t i in place of z i, we have also w l(w T x i, z i ) = (y i t i )x i

Logistic Regression λ = 0 λ = 1 6 10.0 15.0 6 2.0 3.0 4 4 5.0 1.0 2 2 0 0.0 0-1.0 0.0 2-5.0 2 4-10.0 4-2.0 4 2 0 2 4 6 4 2 0 2 4 6

Multiclass Logistic Regression As in the binary case, we assume that log P(x c) = w T c x+k We have one hyperplane w c per class Assuming uniform priors, class posteriors are then given by P(c x) P(x c) e wt c x Since c P(c x) = 1 it follows that P(c x) = ewt c x c ewt c x Function f i (s) = es i j es j is called softmax

Multiclass Logistic Regression We assume that class labels are c i {0,...,N 1}. We adopt a 1 over N coding scheme for class labels. For each data point x i we define the label vector t i with components { 1 if ci = j t ij = 0 otherwise i.e., t i is a vector whose elements are all zero except the element whose position corresponds to the class label.

Multiclass Logistic Regression Let W denote the set of all hyperplanes W = {w 1,...,w N }. The likelihood for vectors t i is given by log P({t i } {x i },W) = i = i = i log P(t i x i, W) t ij log P(c j x, W) j t ij log y ij j where y ij = ewt j x k ewt k x This objective function is also known as negative cross entropy between class labels and predictions

Multiclass Logistic Regression Multiclass logistic regression corresponds to a multinomial model of the class labels, where the multinomial parameters are given by Π = [ w T i x,...,wt N x] As for the binary case, we estimate W as to maximize the likelihood for the training labels Compared to the binary case, the model is overparametrized (i.e., we can add a constant vector to all terms w i without changing the model) In particular, for a 2 class problem, if we subtract w 2 from both w 1 and w 2, we recover exactly the binary logistic regression objective.

Multiclass Logistic Regression Finally, as for the binary class, we can cast the problem as a minimization of a loss function We rewrite the objective in terms of class labels c as log P({t i } {x i },W) = i t ij log y ij j This is also called softmax loss e wt c i x i c ewt c x = log i [ ( = log i e wt c x c ) w Tci x i ]

Multiclass Logistic Regression Adding a regularizer, the multiclass logistic regression objective function is min w 1,...,w N λω(w 1,...,w N )+ 1 K [ ( log i e wt c x c Different regularizers can be used, for example Ω(w 1,...,w N ) = 1 w i 2 2 i ) w Tci x i ]

Multiclass Logistic Regression The empirical loss and its gradients are l(w 1,...,w N, x i ) = wk l(w 1,...,w N, x i ) = [ ( log ( e wt c x c ) w Tci x i ] e wt k x c ewt c x δ k,c k ) x In terms of y ij and t ij : l(w 1,...,w N, x i ) = j t ij log y ij wk l(w 1,...,w N, x i ) = (y nk t nk ) x i

Multiclass Logistic Regression MNIST Error rates for Logistic Regression DimRed λ = 0 λ = 0.00001 λ = 0.001 λ = 0.1 Tied Gau RAW [768] 8.0% 7.4% 7.9% 12.9% PCA [50] 8.8% 8.8% 8.9% 13.3% 12.3 % PCA [100] 7.8% 7.8% 8.2% 12.9% 12.6% AEC [50] 9.1% 9.2% 9.2% 11.9% 12.0% AEC [100] 7.8% 7.8% 8.2% 12.1% 11.8%

Multiclass Logistic Regression Linear logistic regression on MNIST performs better than our Tied Covariance Gaussian classifier, however it s far worse than our non linear Gaussian classifier Remember that, for LR, we assumed that log P(x c) = w T c x+k which has the same form as the Gaussian classifier with tied covariances For Gaussian classifier with non tied covariances we have log P(x c) = x T Λ c x+x T Λ c 1 2 µt cλ c µ c + k which we can rewrite as log P(x c) =< xx T, A > +x T v+k

Multiclass Logistic Regression The log pdf log P(x c) can be expressed as a linear function in an expanded feature space vec(xx T ) φ(x) = x 1 1 2 vec(λ c) w = Λ c µ c 1 2 µt Λ c µ c + k log P(x c) = w T φ(x)+k We can use LR with data points φ(x) to directly estimate w

Multiclass Logistic Regression In general, we can consider a transformation φ(x) of our feature space such that our classes are linearly separable in the expanded feature space The simple expansion we have just seen produces quadratic separtion surfaces (cfr. with the Gaussian model) The dimensionality of the expanded feature space can grow very quickly MNIST Error rates for LR with quadratic feature expansion DimRed λ = 0 λ = 1e 5 λ = 1e 3 λ = 1e 1 Gaussian PCA [50] 2.3% 1.9% 1.7% 3.1% 3.6% AEC [50] 2.3% 2.0% 2.0% 2.2% 3.5%

Neural Networks for classification Neural Networks (NN) provide a method to estimate function φ A Neural Network can be interpreted as a non linear parametric function φ(x,π) The parameters of the non linear transformation are learned from the data The function is represented by means of a directed graph Each node is associated to a function that operates on the input nodes and provides the node output

Neural Networks A neural network node x 1 x 2 f(x,π) y = f(x,π) x 3 x 4

Neural Networks Function f is usually represented as a composition of an affine projection and a scalar non linearity f(x,π) = h(w T x+b) Several possible functions have been proposed for the non linearity Sigmoid function h(x) = 1 1+e x Hyperbolic tangent h(x) = tanh(x) Rectified linear h(x) = max(0, x)

Neural Networks Sigmoid Hyp. Tangent 1.2 1.0 0.8 0.6 0.4 0.2 0.0 0.2 h(x) = σ(x) 4 2 0 2 4 1.0 0.5 0.0 0.5 1.0 h(x) = tanh(x) 4 2 0 2 4 ReLU 5 4 3 2 1 0 h(x) = max(x,0) 4 2 0 2 4

Feed forward Neural Networks Units are organized in layers x 1 x 2 y 1 x 3..... y M x N Connections are defined between layers (no loops)

Feed forward network (revisited) Training a neural network requires optimizing its parameters (in our case, the parameters of the affine transformations) with respect to our objective function This requires computing the gradients of our objective function with respect to the network parameters We can exploit the representation of the network in terms of composition of functions and apply the chain rule to compute our gradients An effective approach to compute these gradients is the back propagation algorithm

Neural Networks for classification Training is usually performed using Stochastic Gradient Descent over batches Given a randomly sampled batch, we compute the gradient of the objective function We update the weights using a gradient descent approach α t is called learning rate w w α t w l({x} BATCH ) Convergence is guaranteed if α t = t αt 2 < More sophisticated approaches have been recently introduced t

Neural Networks for classification We can combine the neural network with the logistic regression model The neural network preprocesses our data, which are then classified by means of LR Recall that, for binary classification, we can write our objective function as l(x) = [t log y+(1 t) log(1 y)] where y = σ(w T x). Adding the NN component we have y = σ(w T f n (x))

Neural Networks for classification The model is equivalent to a NN n 1 with an additional sigmoid activation layer, and loss function l(x) = [t log f n1 (x)+(1 t) log(1 f n1 (x))] For binary classification we can thus build a network with an extra, single node, sigmoide layer and train the network to minimize the objective function i [t i log f(x i )+(1 t i ) log(1 f(x i ))] This objective is called binary cross entropy

Neural Networks for classification A binary 2D example 6 2.5 4 5.0 2 0 2-2.5 0.0 4-5.0 4 2 0 2 4 6

Neural Networks for classification Overfitting can be much more dramatic than for linear logistic regression 6 8.0 4 0.0 2 0 2 0.0-8.0 4.0-8.0 8.0-4.0 0.0 0.0 4.0 4-8.0 4 2 0 2 4 6

Neural Networks for classification Different regularization strategies can be adopted L2 weights regularization Dropout Early stopping (computing error on validation set)

Neural Networks for classification L2 weights regularization 6 4.5 4 3.0 2 1.5 0 2-1.5 0.0 4-3.0 4 2 0 2 4 6

Neural Networks for classification For multiclass problems we consider the multiclass logistic regression objective l(x) = j t j log y j where y j = ewt j f NET (x) k ewt k f NET (x) As for the binary case, we can interpret this model as a NN NET 1 with an additional softmax layer. The network has an output node for each class Training targets are represented using a 1 out of N code

Multiclass Logistic Regression MNIST Error rates for Neural Nets 9 No Reg. L2 (λ = 1e 5 ) Dropout (p = 0.5) MLP (Tanh) 512 512 512 MLP (ReLU) 512 512 512 MLP (ReLU) 1024 1024 1024 1024 1.9% [1.8%] 2.0% [1.7%] 1.5% [1.5%] 1.6% [1.5%] 1.7% [1.6%] 1.7% [1.4%] 1.6% [1.5%] 1.6% [1.4%] 1.4% [1.4%] ConvNet (ReLU) 1.1% [1.0%] 1.1% [1.0%] 0.9% [0.8%] 9 Training set was split into development (90% of the data) and validation (10% of the data) sets to select the best performing model. The performance of the model with lowest error rate on the test set is shown in brackets.