April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 9, 2018

Content 1 2 3 4

Outline 1 2 3 4

problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold = 0 or threshold = 0.5.

problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold = 0 or threshold = 0.5. Three ways to address the classification problem:

problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold = 0 or threshold = 0.5. Three ways to address the classification problem: 1 Directly model the discrimination function: e.g y(x) = w T x + w 0 2 Generative model: y(x) = P(C k x) = P(x C k)p(c k ) P(x)

y(x) = f (w T x + w 0 ) f ( ): activation function, may be non-linear

y(x) = f (w T x + w 0 ) f ( ): activation function, may be non-linear Even if f ( ) is non-linear, the decision boundary is linear

y(x) = f (w T x + w 0 ) f ( ): activation function, may be non-linear Even if f ( ) is non-linear, the decision boundary is linear Also called generalized linear models

y(x) = f (w T x + w 0 ) f ( ): activation function, may be non-linear Even if f ( ) is non-linear, the decision boundary is linear Also called generalized linear models Applicable if instead of x we use a vector of basis functions φ(x), corresponding to features in a feature space

Using for classification We can use a model, such as least squares to fit a linear classification model with a linear activation function min w,w o l (t i w T x i + w 0 ) 2, i=1 where t i { 1, 1} is the label of the i-th training sample, but this strategy does not work well: 186 4. LINEAR MODELS FOR CLASSIFICATION 4 4 2 2 0 0 2 2 4 4 6 6 8 4 2 0 2 4 6 8 8 4 2 0 2 4 6 8 Figure 4.4 left plot shows data from two classes, denoted by red crosses and blue circles, together with the decision boundary found by least squares (magenta curve) and also by the logistic model (green curve), which is discussed later in Section 4.3.2. right-hand plot shows the corresponding results obtained when extra data points are added at the bottom left of the diagram, showing that least squares is highly sensitive to outliers, unlike logistic.

Outline 1 2 3 4

Rosemblatt s Designed by Frank Rossemblat in 1957

Rosemblatt s Designed by Frank Rossemblat in 1957 A hardware implementation of the learning algorithm

Rosemblatt s Designed by Frank Rossemblat in 1957 A hardware implementation of the learning algorithm precursor of neural networks

Rosemblatt s Designed by Frank Rossemblat in 1957 A hardware implementation of the learning algorithm precursor of neural networks Criticized by Marvin Minsky, producing a decline in research funding

Activation function: f (a) = Perceptron learning { +1, a 0 1, a < 0

Activation function: Loss function: f (a) = E p (w, w 0 ) = Perceptron learning { +1, a 0 1, a < 0 l f (w T x n + w 0 )t n, n=1

Activation function: Loss function: f (a) = E p (w, w 0 ) = Perceptron learning { +1, a 0 1, a < 0 l f (w T x n + w 0 )t n, n=1 Learning rule: where f n = f (w T x n + w 0 ) w (n) = w (n 1) + η(f n t n )x n

Perceptron convergence If the training points are linearly separable, the algorithms converges ( convergence theorem)

Perceptron convergence If the training points are linearly separable, the algorithms converges ( convergence theorem) It could converge to different solutions depending on the order of presentation of training sample

Perceptron convergence If the training points are linearly separable, the algorithms converges ( convergence theorem) 4.1. Discriminant Functions 195 It could converge to different solutions depending on the order of presentation of training sample 1 1 0.5 0 0.5 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 1 0.5 0 0.5 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 Figure 4.7 1 1 0.5 0 0.5 1 Illustration of the convergence of the learning algorithm, showing data points from two

Perceptron problems Non-probabilistic outputs

Perceptron problems Non-probabilistic outputs Non-convex problem

Perceptron problems Non-probabilistic outputs Non-convex problem No convergence guarantee if samples are not linearly separable

Perceptron problems Non-probabilistic outputs Non-convex problem No convergence guarantee if samples are not linearly separable Its power may be increased by stacking several layers (multilayer s) and using smooth activation functions

Outline 1 2 3 4

Parametric discrimination se three conditions are equivalent: P(C 1 x) 0.5 P(C 1 x) 1 P(C 1 1 x) logit(p(c 1 x)) = log P(C1 x) 1 P(C 0 1 x)

Parametric discrimination se three conditions are equivalent: P(C 1 x) 0.5 P(C 1 x) 1 P(C 1 1 x) logit(p(c 1 x)) = log P(C1 x) 1 P(C 0 1 x) If we assume that P(x C 1 ) and P(x C 2 ) are normally distributed sharing the same covariance matrix: logit(p(c 1 x)) = log P(C 1 x) P(C 2 x) = w T x + w 0, where w = Σ 1 (µ 1 + µ 2 ) w 0 = 1 2 (µ 1 + µ 2 ) T Σ 1 (µ 1 + µ 2 ) + log P(C 1) P(C 2 )

logit function: function logit(p(c 1 x)) = log P(C 1 x) 1 P(C 1 x) = w T x + w 0

logit function: function logit(p(c 1 x)) = log inverse-logit: P(C 1 x) 1 P(C 1 x) = w T x + w 0 P(C 1 x) = σ(w T 1 x + w 0 ) = 1 + e (w T x+w 0 )

logit function: function logit(p(c 1 x)) = log inverse-logit: P(C 1 x) 1 P(C 1 x) = w T x + w 0 P(C 1 x) = σ(w T x + w 0 ) = 1 1 + e (w T x+w 0 ) σ is called the logistic or sigmoid function.

y(x) = P(C 1 x) = σ(w T x)

y(x) = P(C 1 x) = σ(w T x) Find w using maximum likelihood estimation: p(t w) = l n=1 y tn n (1 y n ) 1 tn, where t = {t 1,..., t l } and y n = y(x n ).

y(x) = P(C 1 x) = σ(w T x) Find w using maximum likelihood estimation: p(t w) = l n=1 y tn n (1 y n ) 1 tn, where t = {t 1,..., t l } and y n = y(x n ). Cross-entropy error: l E(w) = ln p(t w) = [t n ln y n + (1 t n ) ln(1 y n )] n=1

Multiclass logistic y k (x) = P(C k x) = ewt k x j ewt j x

Likelihood: Multiclass logistic y k (x) = P(C k x) = p(t w 1... w K ) = ewt k x j ewt j x l K n=1 k=1 y t nk nk, where y nk = y k (x n ) and T R l K is a matrix of target variables with elements t nk.

Likelihood: Multiclass logistic y k (x) = P(C k x) = p(t w 1... w K ) = ewt k x j ewt j x l K n=1 k=1 y t nk nk, where y nk = y k (x n ) and T R l K is a matrix of target variables with elements t nk. Multiclass cross-entropy error: E(w 1,..., w K ) = ln p(t w 1... w K ) = l n=1 k=1 K t nk ln y nk

Outline 1 2 3 4

min w Optimization problem l E(w) = min [t n ln y n + (1 t n ) ln(1 y n )] w n=1

min w Optimization problem l E(w) = min [t n ln y n + (1 t n ) ln(1 y n )] w n=1 E(w) = l (y n t n )φ n n=1

min w Optimization problem l E(w) = min [t n ln y n + (1 t n ) ln(1 y n )] w n=1 E(w) = l (y n t n )φ n n=1 w (τ+1) = w (τ) η l (y n t n )φ n n=1

Newton-Raphson w (τ+1) = w (τ) H 1 E(w)

Newton-Raphson w (τ+1) = w (τ) H 1 E(w) E(w) = Φ T (y t)

Newton-Raphson w (τ+1) = w (τ) H 1 E(w) E(w) = Φ T (y t) H = E(w) = Φ T RΦ, with R a diagonal matrix with R nn = y n (1 y n ).

Newton-Raphson w (τ+1) = w (τ) H 1 E(w) E(w) = Φ T (y t) H = E(w) = Φ T RΦ, with R a diagonal matrix with R nn = y n (1 y n ). resulting algorithm is called iterative reweighted least squares.

Regularization l min C [t n ln y n + (1 t n ) ln(1 y n )] + w 2 w n=1

Regularization l min C [t n ln y n + (1 t n ) ln(1 y n )] + w 2 w n=1 Prevents overfitting.

Regularization l min C [t n ln y n + (1 t n ) ln(1 y n )] + w 2 w n=1 Prevents overfitting. Equivalent to the inclusion of a prior and finding a MAP solution for W.

Stochastic gradient descent min w Q(w) = min w n Q i (w) i=1

Stochastic gradient descent min w Batch gradient descent: Q(w) = min w n Q i (w) i=1 w (τ+1) = w (τ) α Q(w) = w (τ) α n Q i (w) i=1

Stochastic gradient descent min w Batch gradient descent: Q(w) = min w n Q i (w) i=1 w (τ+1) = w (τ) α Q(w) = w (τ) α n Q i (w) i=1 on-line gradient descent: w (τ+1) = w (τ) α Q i (w)

Alpaydin, E. 2010 to Machine Learning, 2Ed. MIT Press. (Chap 10) Russell, S and Norvig, P. 2010 Artificial Intelligence: a Modern Approach, 3rd Ed, Prentice-Hall (Sect 18.6)