April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Size: px

Start display at page:

Download "April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D."

Malcolm Paul
5 years ago
Views:

1 Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 9, 2018

2 Content

3 Outline

4 problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold = 0 or threshold = 0.5.

5 problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold = 0 or threshold = 0.5. Three ways to address the classification problem:

6 problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold = 0 or threshold = 0.5. Three ways to address the classification problem: 1 Directly model the discrimination function: e.g y(x) = w T x + w 0

7 problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold = 0 or threshold = 0.5. Three ways to address the classification problem: 1 Directly model the discrimination function: e.g y(x) = w T x + w 0 2 Generative model: y(x) = P(C k x) = P(x C k)p(c k ) P(x)

8 problems { C 1, y(x) threshold predict(x) = C 2, y(x) < threshold, with threshold = 0 or threshold = 0.5. Three ways to address the classification problem: 1 Directly model the discrimination function: e.g y(x) = w T x + w 0 2 Generative model: y(x) = P(C k x) = P(x C k)p(c k ) P(x) 3 Discriminative model: with f an arbitrary function. y(x) = P(C k x) = f (x),

9 y(x) = f (w T x + w 0 ) f ( ): activation function, may be non-linear

10 y(x) = f (w T x + w 0 ) f ( ): activation function, may be non-linear Even if f ( ) is non-linear, the decision boundary is linear

11 y(x) = f (w T x + w 0 ) f ( ): activation function, may be non-linear Even if f ( ) is non-linear, the decision boundary is linear Also called generalized linear models

12 y(x) = f (w T x + w 0 ) f ( ): activation function, may be non-linear Even if f ( ) is non-linear, the decision boundary is linear Also called generalized linear models Applicable if instead of x we use a vector of basis functions φ(x), corresponding to features in a feature space

13 Using for classification We can use a model, such as least squares to fit a linear classification model with a linear activation function min w,w o l (t i w T x i + w 0 ) 2, i=1 where t i { 1, 1} is the label of the i-th training sample,

14 Using for classification We can use a model, such as least squares to fit a linear classification model with a linear activation function min w,w o l (t i w T x i + w 0 ) 2, i=1 where t i { 1, 1} is the label of the i-th training sample, but this strategy does not work well: LINEAR MODELS FOR CLASSIFICATION Figure 4.4 left plot shows data from two classes, denoted by red crosses and blue circles, together with the decision boundary found by least squares (magenta curve) and also by the logistic model (green curve), which is discussed later in Section right-hand plot shows the corresponding results obtained when extra data points are added at the bottom left of the diagram, showing that least squares is highly sensitive to outliers, unlike logistic.

15 Outline

16 Rosemblatt s Designed by Frank Rossemblat in 1957

17 Rosemblatt s Designed by Frank Rossemblat in 1957 A hardware implementation of the learning algorithm

18 Rosemblatt s Designed by Frank Rossemblat in 1957 A hardware implementation of the learning algorithm precursor of neural networks

19 Rosemblatt s Designed by Frank Rossemblat in 1957 A hardware implementation of the learning algorithm precursor of neural networks Criticized by Marvin Minsky, producing a decline in research funding

20 Activation function: f (a) = Perceptron learning { +1, a 0 1, a < 0

21 Activation function: Loss function: f (a) = E p (w, w 0 ) = Perceptron learning { +1, a 0 1, a < 0 l f (w T x n + w 0 )t n, n=1

22 Activation function: Loss function: f (a) = E p (w, w 0 ) = Perceptron learning { +1, a 0 1, a < 0 l f (w T x n + w 0 )t n, n=1 Learning rule: where f n = f (w T x n + w 0 ) w (n) = w (n 1) + η(f n t n )x n

23 Perceptron convergence If the training points are linearly separable, the algorithms converges ( convergence theorem)

24 Perceptron convergence If the training points are linearly separable, the algorithms converges ( convergence theorem) It could converge to different solutions depending on the order of presentation of training sample

25 Perceptron convergence If the training points are linearly separable, the algorithms converges ( convergence theorem) 4.1. Discriminant Functions 195 It could converge to different solutions depending on the order of presentation of training sample Figure Illustration of the convergence of the learning algorithm, showing data points from two

26 Perceptron problems Non-probabilistic outputs

27 Perceptron problems Non-probabilistic outputs Non-convex problem

28 Perceptron problems Non-probabilistic outputs Non-convex problem No convergence guarantee if samples are not linearly separable

29 Perceptron problems Non-probabilistic outputs Non-convex problem No convergence guarantee if samples are not linearly separable Its power may be increased by stacking several layers (multilayer s) and using smooth activation functions

30 Outline

31 Parametric discrimination se three conditions are equivalent: P(C 1 x) 0.5 P(C 1 x) 1 P(C 1 1 x) logit(p(c 1 x)) = log P(C1 x) 1 P(C 0 1 x)

32 Parametric discrimination se three conditions are equivalent: P(C 1 x) 0.5 P(C 1 x) 1 P(C 1 1 x) logit(p(c 1 x)) = log P(C1 x) 1 P(C 0 1 x) If we assume that P(x C 1 ) and P(x C 2 ) are normally distributed sharing the same covariance matrix: logit(p(c 1 x)) = log P(C 1 x) P(C 2 x) = w T x + w 0, where w = Σ 1 (µ 1 + µ 2 ) w 0 = 1 2 (µ 1 + µ 2 ) T Σ 1 (µ 1 + µ 2 ) + log P(C 1) P(C 2 )

33 logit function: function logit(p(c 1 x)) = log P(C 1 x) 1 P(C 1 x) = w T x + w 0

34 logit function: function logit(p(c 1 x)) = log inverse-logit: P(C 1 x) 1 P(C 1 x) = w T x + w 0 P(C 1 x) = σ(w T 1 x + w 0 ) = 1 + e (w T x+w 0 )

35 logit function: function logit(p(c 1 x)) = log inverse-logit: P(C 1 x) 1 P(C 1 x) = w T x + w 0 P(C 1 x) = σ(w T x + w 0 ) = e (w T x+w 0 ) σ is called the logistic or sigmoid function.

36 y(x) = P(C 1 x) = σ(w T x)

37 y(x) = P(C 1 x) = σ(w T x) Find w using maximum likelihood estimation: p(t w) = l n=1 y tn n (1 y n ) 1 tn, where t = {t 1,..., t l } and y n = y(x n ).

38 y(x) = P(C 1 x) = σ(w T x) Find w using maximum likelihood estimation: p(t w) = l n=1 y tn n (1 y n ) 1 tn, where t = {t 1,..., t l } and y n = y(x n ). Cross-entropy error: l E(w) = ln p(t w) = [t n ln y n + (1 t n ) ln(1 y n )] n=1

39 Multiclass logistic y k (x) = P(C k x) = ewt k x j ewt j x

40 Likelihood: Multiclass logistic y k (x) = P(C k x) = p(t w 1... w K ) = ewt k x j ewt j x l K n=1 k=1 y t nk nk, where y nk = y k (x n ) and T R l K is a matrix of target variables with elements t nk.

41 Likelihood: Multiclass logistic y k (x) = P(C k x) = p(t w 1... w K ) = ewt k x j ewt j x l K n=1 k=1 y t nk nk, where y nk = y k (x n ) and T R l K is a matrix of target variables with elements t nk. Multiclass cross-entropy error: E(w 1,..., w K ) = ln p(t w 1... w K ) = l n=1 k=1 K t nk ln y nk

42 Outline

43 min w Optimization problem l E(w) = min [t n ln y n + (1 t n ) ln(1 y n )] w n=1

44 min w Optimization problem l E(w) = min [t n ln y n + (1 t n ) ln(1 y n )] w n=1 E(w) = l (y n t n )φ n n=1

45 min w Optimization problem l E(w) = min [t n ln y n + (1 t n ) ln(1 y n )] w n=1 E(w) = l (y n t n )φ n n=1 w (τ+1) = w (τ) η l (y n t n )φ n n=1

46 Newton-Raphson w (τ+1) = w (τ) H 1 E(w)

47 Newton-Raphson w (τ+1) = w (τ) H 1 E(w) E(w) = Φ T (y t)

48 Newton-Raphson w (τ+1) = w (τ) H 1 E(w) E(w) = Φ T (y t) H = E(w) = Φ T RΦ, with R a diagonal matrix with R nn = y n (1 y n ).

49 Newton-Raphson w (τ+1) = w (τ) H 1 E(w) E(w) = Φ T (y t) H = E(w) = Φ T RΦ, with R a diagonal matrix with R nn = y n (1 y n ). resulting algorithm is called iterative reweighted least squares.

50 Regularization l min C [t n ln y n + (1 t n ) ln(1 y n )] + w 2 w n=1

51 Regularization l min C [t n ln y n + (1 t n ) ln(1 y n )] + w 2 w n=1 Prevents overfitting.

52 Regularization l min C [t n ln y n + (1 t n ) ln(1 y n )] + w 2 w n=1 Prevents overfitting. Equivalent to the inclusion of a prior and finding a MAP solution for W.

53 Stochastic gradient descent min w Q(w) = min w n Q i (w) i=1

54 Stochastic gradient descent min w Batch gradient descent: Q(w) = min w n Q i (w) i=1 w (τ+1) = w (τ) α Q(w) = w (τ) α n Q i (w) i=1

55 Stochastic gradient descent min w Batch gradient descent: Q(w) = min w n Q i (w) i=1 w (τ+1) = w (τ) α Q(w) = w (τ) α n Q i (w) i=1 on-line gradient descent: w (τ+1) = w (τ) α Q i (w)

56 Alpaydin, E to Machine Learning, 2Ed. MIT Press. (Chap 10) Russell, S and Norvig, P Artificial Intelligence: a Modern Approach, 3rd Ed, Prentice-Hall (Sect 18.6)

Machine Learning. 7. Logistic and Linear Regression

Machine Learning. 7. Logistic and Linear Regression Sapienza University of Rome, Italy - Machine Learning (27/28) University of Rome La Sapienza Master in Artificial Intelligence and Robotics Machine Learning 7. Logistic and Linear Regression Luca Iocchi,