Machine Learning 2017

Similar documents
Linear Discrimination Functions

Ch 4. Linear Models for Classification

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

p(d θ ) l(θ ) 1.2 x x x

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Linear Models for Classification

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Machine Learning Lecture 5

ECE521 week 3: 23/26 January 2017

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

p(x ω i 0.4 ω 2 ω

Machine Learning Lecture 7

Statistical Data Mining and Machine Learning Hilary Term 2016

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear & nonlinear classifiers

Minimum Error-Rate Discriminant

5. Discriminant analysis

Logistic Regression. Seungjin Choi

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

CS 195-5: Machine Learning Problem Set 1

Introduction to Machine Learning

Linear discriminant functions

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Statistical Machine Learning Hilary Term 2018

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

L11: Pattern recognition principles

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Linear & nonlinear classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Machine Learning for Signal Processing Bayes Classification and Regression

CSC 411: Lecture 04: Logistic Regression

SGN (4 cr) Chapter 5

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear and logistic regression

Midterm exam CS 189/289, Fall 2015

Bayesian Decision and Bayesian Learning

Classification. Chapter Introduction. 6.2 The Bayes classifier

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Logistic regression and linear classifiers COMS 4771

Pattern Recognition. Parameter Estimation of Probability Density Functions

Overfitting, Bias / Variance Analysis

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning, Fall 2012 Homework 2

Lecture 3: Pattern Classification

Lecture 5: Logistic Regression. Neural Networks

PATTERN CLASSIFICATION

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Pattern Classification

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

p(x ω i 0.4 ω 2 ω

Discriminative Models

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Parametric Techniques

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Introduction to Machine Learning

Introduction to Machine Learning

Bayesian Decision Theory

Bayesian Learning (II)

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Parametric Techniques Lecture 3

A vector from the origin to H, V could be expressed using:

Multiclass Classification-1

Example - basketball players and jockeys. We will keep practical applicability in mind:

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Discriminative Models

ECE521 Lecture7. Logistic Regression

An Introduction to Statistical and Probabilistic Linear Models

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Pattern Recognition 2

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Classification. Sandro Cumani. Politecnico di Torino

Neural Network Training

Lecture 5: LDA and Logistic Regression

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Logistic Regression. Machine Learning Fall 2018

Machine Learning - Waseda University Logistic Regression

CSC321 Lecture 4: Learning a Classifier

Bayesian Decision Theory

Linear Classification

CSCI-567: Machine Learning (Spring 2019)

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Learning Methods for Linear Detectors

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Bayesian Decision Theory Lecture 2

CMU-Q Lecture 24:

1 Machine Learning Concepts (16 points)

Transcription:

Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41

Section 5 Classification Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 2 / 41

Outline Goal Understanding the problem settings in classification and regression estimation. Focus on classification: Understanding the concept of Bayesian decision theory. Understanding the notion of discriminant functions. Understanding generalized linear discriminant functions and separating hyperplanes. Understanding some simple classification algorithms. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 3 / 41

Classification Example Sorting incoming fish on a conveyor according to species using optical sensing Features: Length Brightness Width Shape of fins Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 4 / 41

Classification Given an environment characterized by p(x, y). k classes c 1,..., c k y is a class indicator variable A supervisor classifies x into k classes according to P(c x) Goal Construct a function that classifies in the same manner as the supervisor Methods: Linear Discriminant Analysis, Nearest Neighbor, Support Vector Machines, etc.... Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 5 / 41

Bayesian Decision Theory Assign observed x R d into one of k classes. A classifier is a mapping that assigns labels to observations f α : x {1,..., k}. For any observation x there exists a set of k possible actions α i, i.e. k different assignments of labels. The loss L incurred for taking action α i when the true label is j is denoted by a loss matrix L ij = L(α i c = j). Natural 0 1 loss function can be defined by simply counting misclassifications: L ij = 1 δ ij, where δ ij = { 1 if i = j, 0 otherwise. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 6 / 41

Bayesian Decision Theory (cont d) A classifier is trained on a set of observed pairs {(x 1, c 1 ),..., (x n, c n )} i.i.d. p(x, c) = p(c x)p(x) The probability that a given x is member of class c j is obtained via the Bayes rule: where P(c j x) = p(x c = j) P(c = j), p(x) k p(x) = p(x c = j)p(c = j). j=1 Given an observation x, the expected loss associated with choosing action α i (the conditional risk) is k R(f αi x) = L ij P(c j x) = P(c j x) = 1 P(c i x). j=1 j i Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 7 / 41

Bayesian Decision Theory (cont d) Goal: minimize the overall risk R(f α ) = R( f α (x) x ) p(x) dx. R d If f α (x) minimizes the conditional risk R(f α (x) x) for every x, the overall risk will be minimized as well. This is achieved by the Bayes optimal classifier which chooses the mapping k f (x) = argmin L ij p(c = j x). i j=1 For 0 1 loss this reduces to classifying x to the class with highest posterior probability: f (x) = argmax p(c = i x). i Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 8 / 41

Bayesian Decision Theory (cont d) Simplification: only 2 classes: c is Binomial RV. Bayes optimal classifier is defined by the zero crossings of the Bayes (optimal) discriminant function G(x) = P(c 1 x) P(c 2 x), or g(x) = log P(c 1 x) P(c 2 x). Link to regression: use encoding {+1, 1} for the two possible states c 1,2 of c. The conditional expectation of c x equals the Bayes discriminant function: E[c x] = c {+1, 1} cp(c x) = P(c 1 x) P(c 2 x) = G(x). Classification can be viewed as a (local) approximation of G(x) = E[c x] near its zero crossings. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 9 / 41

Linear Discriminant Functions Problem: direct approximation of G would require the knowledge of the Bayes optimal discriminant. We have to define a hypothesis class from which a function is chosen by some inference mechanism. One such class is the set of linear discriminant functions g(x; w) = w 0 + w t x. Decide c 1 if g(x; w) > 0 and c 2 if g(x; w) < 0. Equation g(x; w) = 0 defines the decision surface. Linearity of g(x; w) hyperplane w is orthogonal to any vector lying in the plane. The hyperplane divides the feature space into half-spaces R 1 ( positive side ) and R 2 ( negative side ). Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 10 / 41

Decision Hyperplanes g(x; w) defines distance r from x to the hyperplane: x = x p + r w w. g(x p ) = 0 g(x) = r w r = g(x)/ w. x 3 r x x p R 1 R 2 g(x) = 0 w w 0 / w H x 2 x 1 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 11 / 41

Generalized Linear Discriminant Functions Use basis functions {b 1 (x),..., b m (x)}, where each b i (x) : R d R, and g(x; w) = w 0 + w 1 b 1 (x) + + w m b m (x) =: w t y (note that we have redefined y here in order to be consistent with the following figure) 1 x x 2 y = ( ) 4 2 y 3 ˆ R 1-2 -1 0 1 2 R 1 R 2 R 1 x 0 0 0.5 FIGURE 5.5. The mapping y = (1, x, x 2 ) t takes a line and transforms it to a parabola in three dimensions. A plane splits the resulting y-space into regions corresponding to two categories, and this in turn gives a nonsimply connected decision region in the one-dimensional x-space. From: Richard O. Duda, Peter E. Hart, and David G. Stork, Volker Roth (University Pattern of Classification. Basel) Copyright Machine c 2001Learning by John 2017 Wiley & Sons, Inc. 21st March 2017 12 / 41 1 y 1 1.5 2 2.5-1 ˆ R 2 0 2 1 y 2

Generalized Linear Discriminant Functions Use basis functions {b 1 (x),..., b m (x)}, where each b i (x) : R d R, and g(x; w) = w 0 + w 1 b 1 (x) + + w m b m (x) =: w t y. x 2 x 1 x2 αx 1 x 2 y = ( ) y 3 ˆ R 2 y 2w ˆ ˆH R 1 x 2 y 1 x 1 R 1 R 1 R 2 FIGURE 5.6. The two-dimensional input space x is mapped through a polynomial function f to y. Here the mapping is y 1 = x 1, y 2 = x 2 and y 3 x 1 x 2. A linear discriminant Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 13 / 41 x 1

Separable Case Consider sample {y i, c i } n i=1. If there exists f (y; w) = y t w which is positive for all examples in class 1 and negative for all examples in class 2, we say that the sample is linearly separable. Normalization : replace all samples labeled c 2 by their negatives simply write y t w > 0 for all samples. Each sample places a constraint on the possible location of w solution region. solution region y 2 solution region y 2 a a separating plane y 1 "separating" plane y 1 FIGURE 5.8. Four training samples (black for ω 1,redforω 2 )and the solution region in feature space. The figure on the left shows the raw data; the solution vectors leads to a Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 14 / 41

Separable Case: margin Different solution vectors may have different margins b : y t w b > 0. Intuitively, large margins are good. a 2 a 2 b/ y 3 } solution region y 1 y 2 solution region } b/ y 1 y 1 y 2 y3 a 1 y 3 a 1 } b/ y 2 FIGURE 5.9. The effect of the margin on the solution region. At the left is the case of no margin (b = 0) equivalent to a case such as shown at the left in Fig. 5.8. At the right Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 15 / 41

Gradient Descent Solve y t w > 0 by defining J(w) such that a minimizer of J is a solution. Start with initial w(1), and choose next value by moving in the direction of steepest gradient: w(k + 1) = w(k) η(k) J(w(k)). Alternatively, use second order expansion (Newton): w(k + 1) = w(k) H 1 J(w(k)). J(a) a 2 a 1 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 16 / 41

Minimizing the Perceptron Criterion Solve y t w > 0 by defining J(w) such that a minimizer of J is a solution. Most obvious: number of misclassifications, but not differentiable. Alternative choice: J p (w) = y M y t w, where M(w) is the set of samples misclassified by w. Since y t w < 0 y M, J p is non-negative, and zero only if w is a solution. Gradient: J(w) = y M y w(k + 1) = w(k) + η(k) y M This defines the Batch Perceptron algorithm. y. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 17 / 41

Minimizing the Perceptron Criterion (2) J p (a) 10 5 0-2 0 y 1 y 2 solution y 3-2 0 a 1 a 2 region 2 y 3 y 1 y3 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 18 / 41 2

Fixed-Increment Single Sample Perceptron Fix learning rate η(k) = 1. Sequential single-sample updates: use superscripts y 1, y 2,... for misclassified samples y M. Ordering is irrelevant. Simple algorithm: w(1) arbitrary w(k + 1) = w(k) + y k, k 1 Perceptron Convergence Theorem If the samples are linearly separable, the sequence of weight vectors given by the Fixed-Increment Single Sample Perceptron algorithm will terminate at a solution vector. Proof: exercises. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 19 / 41

Issues A number of problems with the perceptron algorithm: When the data are separable, there are many solutions, and which one is found depends on the starting values. In particular, no separation margin can be guaranteed (however, there exist modified versions...) The number of steps can be very large. When the data are not separable, the algorithm will not necessarily converge, and cycles may occur. The cycles can be long and therefore hard to detect. Method technical in nature, no (obvious) connection to learning theory. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 20 / 41

Informative (or Generative) vs Discriminative Recall: classification can be viewed as a (local) approximation of G(x) = E[c x] near its zero crossings. Two main strategies for solving this problem: Informative: Informative classifiers model the class densities. The likelihood of each class is examined and classification is done by assigning to the most likely class. Discriminative: These classifiers focus on modeling the class boundaries or the class membership probabilities directly. No attempt is made to model the underlying class conditional densities. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 21 / 41

Informative Classifiers Central idea: model the conditional class densities p(x c). Use Bayes rule to compute the posteriors: P(c j x) = p(x c = j) P(c = j). p(x) Assuming a parametrized class conditional density p θj (x c = j) and collecting all model parameters in a vector θ, these parameters are usually estimated by maximizing the full log likelihood ˆθ MLE = argmax θ = argmax θ n log p θ (x i, c i ) i=1 n (log p θ (x i c i ) + log P θ (c i )). i=1 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 22 / 41

Informative Classifiers: LDA In Linear Discriminant Analysis (LDA), a Gaussian model is used where all classes share a common covariance matrix Σ: p θ (x c = j) = N (x; µ j, Σ). The resulting discriminant functions are linear: g(x) = log P(c = 1)N (x; µ 1, Σ) P(c = 2)N (x; µ 2, Σ) = log π 1 π 2 1 2 (µ 1 + µ 2 ) T Σ 1 (µ 1 µ 2 ) }{{} w 0 +(µ 1 µ 2 ) T Σ 1 x }{{} w T x = w 0 + w T x. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 23 / 41

Informative Classifiers: LDA Discriminant function g(x) measures differences in the Mahalanobis distances from x to the k-th class mean: D (x, µ k ) = (x µ k ) T Σ 1 (x µ k ). Bayes-optimal decision rule: measure the Mahalanobis distances to each class and assign x to the class with the (prior corrected) smallest distance. Connection to LDA algorithm: let Σ w be an estimate of the shared covariance matrix Σ, and m j an estimate of µ j. LDA finds the weight vector w LDA = Σ 1 w (m 1 m 2 ). This w LDA asymptotically coincides with the Bayes-optimal w (if Gaussian model is correct!). Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 24 / 41

LDA: Algorithmic view 2 x 2 2 x 2 1.5 1.5 1 1 0.5 w 0.5 0.5 1 1.5 x 1-0.5 0.5 1 1.5 w x 1 FIGURE 3.5. Projection of the same set of samples onto two different lines in the directions Question: marked Howw. should The figure we select on the the rightprojection shows greater vector, separation whichbetween optimally the red and black projected points. From: Richard O. Duda, Peter E. Hart, and David G. Stork, discriminates between the different classes? Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 25 / 41

LDA: Algorithmic view Samples: {(x 1, c 1 ),..., (x n, c n )} Partition the sample set in class specific subsets Projected samples: x i = w T x i, 1 i n. Measure for the discrimination of the projected points? Sample averages m c = 1 n c x X c x, with n c = X c. Sample averages of projected points: m c = 1 w T x = w T m α. n c x X c Distance of the averages in the 2 class case: w T (m 1 m 2 ). Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 26 / 41

LDA: Algorithmic view Scatter of class c: Σ c = x X c (x m c )(x m c ) T. Scatter matrix is unscaled estimate of covariance matrix. Within scatter : Σ W = Σ 1 + Σ 2 (unscaled average). Scatter of projected data: x X c = (w T x m c )(w T x m c ) T x X c w T (x m c )(x m c ) T w = w T Σ c w =: Σ c 2 classes = Σ1 + Σ 2 = w T Σ W w. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 27 / 41

LDA: Algorithmic view J(w) = m 1 m 2 2 Σ 1 + Σ 2 = w T =:Σ B {}}{ (m 1 m 2 )(m 1 m 2 ) T w w T. Σ W w Idea: Optimal w can be found by maximization of J(w)! d dw J(w) = d dw ( ) (w T Σ W w) 1 w T Σ B w = 2Σ W w(w T Σ W w) 1 w T Σ B w(w T Σ W w) 1 +2Σ B w(w T Σ W w) 1 = 0. w T Σ W w Σ B w = Σ W w w T Σ B w w T Σ W w. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 28 / 41

LDA: Algorithmic view Generalized Eigenvalue Problem: Let Σ W be non singular: Σ 1 W Σ Bw }{{} (m 1 m 2 ) = λw with λ = w T Σ B w w T Σ W w. Unscaled Solution: ŵ = Σ 1 W (m 1 m 2 ). Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 29 / 41

Discriminative classifiers Discriminative classifiers focus directly on the discriminant function. In general, they are more flexible with regard to the class conditional densities they are capable of modeling. Bayes formula: P(c = 1 x) g(x) = log P(c = 2 x) p(x c = 1)P(c = 1) = log p(x c = 2)P(c = 2), Can model any conditional probabilities that are exponential tilts of each other: p(x c = 1) = e g(x) P(c = 2) p(x c = 2) P(c = 1) Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 30 / 41

Discriminative classifiers Since the discriminative model directly focuses on the posteriors, it is natural to estimate the model parameters θ in p θ (x c) by maximizing the conditional log likelihood ˆθ DISCR = argmax θ n log p θ (c i x i ). i=1 The discriminative model makes no use of the information contained in the marginal density p(x). Thus, if the density model is correct, useful information is ignored and the informative approach may give better results. However, if the class models are incorrect, it may be better not to use them. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 31 / 41

Logistic Regression (LOGREG) Logistic regression uses a linear discriminant function, i.e. g(x) = w T x + w 0. For the special case p(x c = {1, 0}) = N (x; µ 1,0, Σ), LDA and LOGREG coincide: p(x c = 1) = N (x, µ 1, Σ) = e w T k x+w0 N (x; µ 0, Σ) π 0 π 1 g(x) = w 0 + w T x = log π 1N (x; µ 1, Σ) π 0 N (x; µ 0, Σ) Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 32 / 41

Logistic Regression (LOGREG) Two-class problem with Binomial RV c taking values in {0, 1} sufficient to represent P(1 x), since P(0 x) = 1 P(1 x). Success probability of the Binomial RV: π(x) := P(1 x) Basketball example: Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 33 / 41

Logistic Regression (LOGREG) LOGREG: log P(c=1 x) P(c=0 x) = log π(x) 1 π(x) = w T x + w 0 Thus, the conditional expectation of c x satisfies the logistic regression model: π(x) := E[c x] = exp{w T x + w 0 } 1 + exp{w T x + w 0 } = σ(w T x + w 0 ), Logistic squashing function σ(z) = ez 1+e = 1 z 1+e turns linear z predictions into probabilities Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 34 / 41

Logistic Regression (LOGREG) Assume that w 0 is absorbed in w using x (1 : x). Estimate w by maximizing the conditional likelihood n ŵ DISCR = argmax p(c i x i ; w) w = argmax w i=1 n i=1 (π(x i ; w)) c i (1 π(x i ; w)) 1 c i, or by maximizing the corresponding log likelihood l: n l(w) = [c i log π(x i ; w) + (1 c i ) log(1 π(x i ; w))]. i=1 The score functions are defined as s(w) = n w l(w) = x i (c i π i ). i=1 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 35 / 41

Logistic Regression (LOGREG) Since the π i depend non-linearly on w, the equation system s(w) = 0 usually cannot be solved analytically and iterative techniques must be applied. Newton s method updates the parameter estimates w at the r-th step by w (r+1) = w (r) {H (r) } 1 s (r), where H (r) is the Hessian of l, evaluated at w (r) : ( H (r) 2 ) l = w w T w=w (r) n = π (r) i (1 π (r) i )x i x T i. i=1 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 36 / 41

Logistic Regression (LOGREG) The Newton updates can be restated as an Iterated Re-weighted Least Squares (IRLS) problem: The Hessian is equal to ( X T WX), with W = diag {π 1 (1 π 1 ),..., π n (1 π n )} The gradient of l (the scores) can be written as s = X T W e, where e is a vector with entries e j = (c j π j )/W jj. With q (r) := Xw (r) + e, the updates read (X T W (r) X)w r+1 = X T W (r) q (r). These are the normal equations of a LS problem with input matrix (W (r) ) 1/2 X and r.h.s. (W (r) ) 1/2 q (r). The values W ii are functions of the actual w (r), so that iteration is needed. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 37 / 41

Logistic Regression (LOGREG) Simple binary classification problem in R 2. Solved with LOGREG using polynomial basis functions. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 38 / 41

Loss functions LOGREG maximizes log likelihood n l(w) = [c i log π(x i ; w) + (1 c i ) log(1 π(x i ; w))], i=1 where π = 1 1+e, z = w t x. z This is the same as minimizing n l(w) = [ c i log π(x i ; w) (1 c i ) log(1 π(x i ; w))] = =: i=1 n i=1 n i=1 [ log(1 + e z i ) + (1 c i )z i ] Loss(c i, z i ). Loss function: (1 c i )z + log(1 + e z ). Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 39 / 41

Loss functions 0 1 2 3 4 2 1 0 1 2 3 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 40 / 41

Summing up Classification can be viewed as a (local) approximation of G(x) = E[c x] near its zero crossings. We have to define a hypothesis class from which a function is chosen by some inference mechanism. One such class is the set of linear discriminant functions g(x; w) = w 0 + w t x. Can be easily generalized to arbitrary basis functions. Two main strategies for solving this problem: Informative: use marginal density p(x), and discriminative: ignore p(x). Two simple models: LDA and LOGREG. Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 41 / 41