Statistical Machine Learning Hilary Term 2018

Similar documents
Statistical Data Mining and Machine Learning Hilary Term 2016

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Ch 4. Linear Models for Classification

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning 2017

Linear Models for Classification

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Logistic Regression. Seungjin Choi

Machine Learning Lecture 5

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Classification. Chapter Introduction. 6.2 The Bayes classifier

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture 2 Machine Learning Review

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Multiclass Logistic Regression

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning for Signal Processing Bayes Classification and Regression

Midterm exam CS 189/289, Fall 2015

Warm up: risk prediction with logistic regression

Machine Learning for NLP

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Introduction to Machine Learning

DATA MINING AND MACHINE LEARNING. Lecture 4: Linear models for regression and classification Lecturer: Simone Scardapane

Lecture 5: Classification

Statistical Machine Learning Hilary Term 2018

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Reading Group on Deep Learning Session 1

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Overfitting, Bias / Variance Analysis

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine Learning Practice Page 2 of 2 10/28/13

Linear Decision Boundaries

Linear and logistic regression

Introduction to Machine Learning

Machine Learning and Data Mining. Linear classification. Kalev Kask

Introduction to Logistic Regression and Support Vector Machine

Introduction to Machine Learning

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Logistic Regression. Machine Learning Fall 2018

ECE521 week 3: 23/26 January 2017

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Machine Learning. 7. Logistic and Linear Regression

Linear Regression (continued)

A Study of Relative Efficiency and Robustness of Classification Methods

Machine Learning Lecture 7

LDA, QDA, Naive Bayes

Machine Learning Linear Classification. Prof. Matteo Matteucci

Introduction to Machine Learning

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Stochastic Gradient Descent

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Logistic Regression. COMP 527 Danushka Bollegala

Does Modeling Lead to More Accurate Classification?

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Classification. Sandro Cumani. Politecnico di Torino

Machine Learning (CS 567) Lecture 5

Lecture 2: Simple Classifiers

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Naive Bayes and Gaussian Bayes Classifier

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Linear Models for Regression CS534

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 9: Classification, LDA

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Introduction to Machine Learning Spring 2018 Note 18

STA 4273H: Statistical Machine Learning

Introduction to Logistic Regression

Lecture 9: Classification, LDA

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

HOMEWORK #4: LOGISTIC REGRESSION

Deep Learning for Computer Vision

Linear discriminant functions

ESS2222. Lecture 4 Linear model

1 Machine Learning Concepts (16 points)

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Logistic Regression. William Cohen

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Support Vector Machines

ECS171: Machine Learning

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Lecture 5: LDA and Logistic Regression

Transcription:

Statistical Machine Learning Hilary Term 2018 Pier Francesco Palamara Department of Statistics University of Oxford Slide credits and other course material can be found at: http://www.stats.ox.ac.uk/~palamara/sml18.html February 14, 2018

Plug-in Classification Discriminant analysis Consider the 0-1 loss and the risk: E [L(Y, f(x)) ] K X = x = L(k, f(x))p(y = k X = x) k=1 The Bayes classifier provides a solution that minimizes the risk: f Bayes (x) = arg max π k g k (x). k=1,...,k We know neither the conditional density g k nor the class probability π k! The plug-in classifier chooses the class f(x) = arg max π k ĝ k (x), k=1,...,k where we plugged in estimates π k of π k and k = 1,..., K and estimates ĝ k (x) of conditional densities, Linear Discriminant Analysis is an example of plug-in classification.

Discriminant analysis Summary: Linear Discriminant Analysis LDA: a plug-in classifier assuming multivariate normal conditional density g k (x) = g k (x µ k, Σ) for each class k sharing the same covariance Σ: X Y = k N (µ k, Σ), g k (x µ k, Σ) =(2π) p/2 Σ 1/2 exp ( 1 ) 2 (x µ k) Σ 1 (x µ k ). LDA minimizes the squared Mahalanobis distance between x and µ k, offset by a term depending on the estimated class proportion π k : f LDA (x) = argmax k {1,...,K} = argmax k {1,...,K} = argmin k {1,...,K} log π k g k (x µ k, Σ) (log π k 12 µ k Σ 1 µ ) ) k + ( Σ 1 µ k x }{{} terms depending on k linear in x 1 2 (x µ k) Σ 1 (x µ k ) log π k. }{{} squared Mahalanobis distance

Discriminant analysis LDA projections Figure by R. Gutierrez-Osuna

Discriminant analysis LDA vs PCA projections 4 2 0 2 4 6 4 2 0 2 4 1.0 0.5 0.0 0.5 1.0 1.5 0.10 0.05 0.00 0.05 0.10 0.15 Comp.1 Comp.2 LDA separates the groups better.

Discriminant analysis Fisherfaces Eigenfaces vs. Fisherfaces, Belhumeur et al. 1997 http://ieeexplore.ieee.org/document/598228/

Discriminant analysis Quadratic Discriminant Analysis Conditional densities with different covariances Given training data with K classes, assume a parametric form for conditional density g k (x), where for each class X Y = k N (µ k, Σ k ), i.e., instead of assuming that every class has a different mean µ k with the same covariance matrix Σ (LDA), we now allow each class to have its own covariance matrix. Considering log π k g k (x) as before, log π k g k (x) = const + log(π k ) 1 ( log Σk + (x µ k ) T Σ 1 k 2 (x µ k) ) = const + log(π k ) 1 ( log Σk + µ T k Σ 1 k 2 µ ) k +µ T k Σ 1 k = a k + b T k x + x T c k x. x 1 2 xt Σ 1 k x A quadratic discriminant function instead of linear.

Discriminant analysis Quadratic Discriminant Analysis Quadratic decision boundaries Again, by considering that we choose class k over k, a k + b T k x + xt c k x (a k + b T k x + xt c k x) = a + b T x + x T c x > 0 we see that the decision boundaries of the Bayes Classifier are quadratic surfaces. The plug-in Bayes Classifer under these assumptions is known as the Quadratic Discriminant Analysis (QDA) Classifier.

Discriminant analysis Quadratic Discriminant Analysis QDA LDA classifier: QDA classifier: { } f LDA (x) = arg min (x µ k ) T Σ 1 (x µ k ) 2 log( π k ) k {1,...,K} { } f QDA (x) = arg min (x µ k ) T Σk 1 (x µ k ) 2 log( π k ) + log( Σ k ) k {1,...,K} for each point x X where the plug-in estimate µ k is as before and Σ k is (in contrast to LDA) estimated for each class k = 1,..., K separately: Σ k = 1 n k (x j µ k )(x j µ k ) T. j:y j=k

Discriminant analysis Quadratic Discriminant Analysis Computing and plotting the QDA boundaries. ##fit QDA iris.qda <- qda(x=iris.data,grouping=ct) ##create a grid for our plotting surface x <- seq(-6,6,0.02) y <- seq(-4,4,0.02) z <- as.matrix(expand.grid(x,y),0) m <- length(x) n <- length(y) iris.qdp <- predict(iris.qda,z)$class contour(x,y,matrix(iris.qdp,m,n), levels=c(1.5,2.5), add=true, d=false, lty=2)

Discriminant analysis Quadratic Discriminant Analysis Iris example: QDA boundaries 1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 Petal.Length Petal.Width

Discriminant analysis Quadratic Discriminant Analysis Iris example: QDA boundaries 1 2 3 4 5 6 7 0.5 1.0 1.5 2.0 2.5 Petal.Length Petal.Width

Discriminant analysis Quadratic Discriminant Analysis LDA or QDA? Having seen both LDA and QDA in action, it is natural to ask which is the better classifier. If the covariances of different classes are very distinct, QDA will probably have an advantage over LDA. Parametric models are only ever approximations to the real world, allowing more flexible decision boundaries (QDA) may seem like a good idea. However, there is a price to pay in terms of increased variance and potential overfitting.

Discriminant analysis Quadratic Discriminant Analysis Regularized Discriminant Analysis In the case where data is scarce, to fit LDA, need to estimate K p + p p parameters QDA, need to estimate K p + K p p parameters. Using LDA allows us to better estimate the covariance matrix Σ. Though QDA allows more flexible decision boundaries, the estimates of the K covariance matrices Σ k are more variable. RDA combines the strengths of both classifiers by regularizing each covariance matrix Σ k in QDA to the single one Σ in LDA Σ k (α) = ασ k + (1 α)σ for some α [0, 1]. This introduces a new parameter α and allows for a continuum of models between LDA and QDA to be used. Can be selected by Cross-Validation for example.

Logistic regression Logistic regression

Logistic regression Review In LDA and QDA, we estimate p(x y), but for classification we are mainly interested in p(y x) Why not estimate that directly? Logistic regression 1 is a popular way of doing this. 1 Despite the name regression, we are using it for classification!

Logistic regression Logistic regression One of the most popular methods for classification Linear model on the probabilities Dates back to work on population growth curves by Verhulst [1838, 1845, 1847] Statistical use for classification dates to Cox [1960s] Independently discovered as the perceptron in machine learning [Rosenblatt 1957] Main example of discriminative as opposed to generative learning Naïve approach to classification: linear regression (as seen last time). Logistic regression refines this idea and provides a more suitable model.

Logistic regression Logistic regression Statistical perspective: consider Y = {0, 1}. Generalised linear model with Bernoulli likelihood and logit link: Y X = x, a, b Bernoulli ( s(a + b x) ) 1 s(a + b x) = 1 1+exp( (a+b x)). 0.5 0 8 6 4 2 0 2 4 6 8 ML perspective: a discriminative classifier. Consider binary classification with Y = {+1, 1}. Logistic regression uses a parametric model on the conditional Y X, not the joint distribution of (X, Y ): p(y = y X = x; a, b) = 1 1 + exp( y(a + b x)).

Logistic regression Prediction Using Logistic Regression

Logistic regression Hard vs Soft classification rules Consider using LDA for binary classification with Y = {+1, 1}. Predictions are based on linear decision boundary: { ŷ LDA (x) = sign log π +1 g +1 (x µ +1, Σ) log π 1 g 1 (x µ 1, Σ) } = sign { a + b x } for a and b depending on fitted parameters θ = ( π +1, π 1, µ +1, µ 1, Σ). Quantity a + b x can be viewed as a soft classification rule. Indeed, it is modelling the difference between the log-discriminant functions, or equivalently, the log-odds ratio: a + b x = log p(y = +1 X = x; θ) p(y = 1 X = x; θ). f(x) = a + b x corresponds to the confidence of predictions and loss can be measured as a function of this confidence: exponential loss: L(y, f(x)) = e yf(x), log-loss: L(y, f(x)) = log(1 + e yf(x) ), hinge loss: L(y, f(x)) = max{1 yf(x), 0}.

Logistic regression Linearity of log-odds and logistic function a + b x models the log-odds ratio: p(y = +1 X = x; a, b) log p(y = 1 X = x; a, b) = a + b x. Solve explicitly for conditional class probabilities (using p(y = +1 X = x; a, b) + p(y = 1 X = x; a, b) = 1): 1 p(y = +1 X = x; a, b) = 1 + exp( (a + b x)) =: s(a + b x) 1 p(y = 1 X = x; a, b) = 1 + exp(+(a + b x)) = s( a b x) where s(z) = 1/(1 + exp( z)) is the logistic function. 1 0.5 0 8 6 4 2 0 2 4 6 8

Logistic regression Fitting the parameters of the hyperplane How to learn a and b given a training data set (x i, y i ) n i=1? Consider maximizing the conditional log likelihood for Y = {+1, 1}: { s(a + b p(y = y i X = x i ; a, b) = p(y i x i ) = x i ) if Y = +1 1 s(a + b x i ) if Y = 1 Noting that 1 s(z) = s( z), we can write the log-likelihood using the compact expression: log p(y i x i ) = log s(y i (a + b x i )). And the log-likelihood over the whole i.i.d. data set is: l(a, b) = n log p(y i x i ) = i=1 n log s(y i (a + b x i )). i=1

Logistic regression Fitting the parameters of the hyperplane How to learn a and b given a training data set (x i, y i ) n i=1? Consider maximizing the conditional log likelihood: n n l(a, b) = log p(y i x i ) = log s(y i (a + b x i )). i=1 i=1 Equivalent to minimizing the empirical risk associated with the log loss: R log (f a,b ) = 1 n log s(y i (a+b x i )) = 1 n log(1+exp( y i (a+b x i ))) n n i=1 i=1

Logistic regression Could we use the 0-1 loss? With the 0-1 loss, the risk becomes: R(f a,b ) = 1 n But what is the gradient?... n step( y i (a + b x i )) i=1

Logistic Regression Logistic regression Log-loss is differentiable, but it is not possible to find optimal a, b analytically. For simplicity, absorb a as an entry in b by appending 1 into x vector, as we did before. Objective function: R log = 1 n n log s(y i x i b) i=1 Logistic Function s( z) = 1 s(z) z s(z) = s(z)s( z) z log s(z) = s( z) 2 z log s(z) = s(z)s( z) Differentiate wrt b: b Rlog = 1 n n s( y i x i b)y i x i i=1 2 R b log = 1 n s(y i x i b)s( y i x i b)x i x i 0. n i=1 We cannot set b Rlog = 0 and solve: no closed form solution. We ll use numerical methods.

Gradient Descent Start at a random point Repeat Determine a descent direction Choose a step size Update Until stopping criterion is satisfied f(w) w * w2 w1 w0 w

Where Will We Converge? f(w) Convex g(w) Non-convex w * w Any local minimum is a global minimum w! w * w Multiple local minima may exist Least Squares, Ridge Regression and Logistic Regression are all convex!

Convexity Logistic regression Gradient descent How to determine convexity? f(x) is convex if Examples: f (x) 0 f(x) = x 2, f (x) = 2 > 0 How to determine convexity in this case? Matrix of second-order derivatives (Hessian) 2 f(x) 2 f(x) x 2 1 x 1 x 2... 2 f(x) 2 f(x) H = x 1 x 2... x 2 2............ 2 f(x) x 2 x D... 2 f(x) x 1 x D 2 f(x) x 1 x D 2 f(x) x 2 x D 2 f(x) x 2 D How to determine convexity in the multivariate case? If the Hessian is positive semi-definite H 0, then f is convex. A matrix H is positive semi-definite if and only if, z, z T Hz = j,k H j,k z j z k 0

Logistic Regression Logistic regression Gradient descent Hessian is positive-definite: objective function is convex and there is a single unique global minimum. Many different algorithms can find optimal b, e.g.: Gradient descent: b new = b + ɛ 1 n s( y ix i b)y ix i n Stochastic gradient descent: b new = b + ɛ t 1 I(t) i=1 s( y ix i b)y ix i i I(t) where I(t) is a subset of the data at iteration t, and ɛ t 0 slowly ( t ɛt =, t ɛ2 t < ). Conjugate gradient, LBFGS and other methods from numerical analysis. Newton-Raphson: b new = b ( 2 b R log ) 1 b Rlog This is also called iterative reweighted least squares.

Logistic regression Gradient descent Iterative reweighted least squares (IRLS) We can write gradient and Hessian in a more compact form. Define µ i = s(x i b), and the diagonal matrix S with µ i(1 µ i ) on its diagonal. Also define the vector c where c i = 1(y i = +1). Then b Rlog = 1 n n s( y i x i b)y i x i i=1 = 1 n n x i (µ i c i ) i=1 = X (µ c) 2 R b log = 1 n s(y i x i b)s( y i x i b)x i x i n i=1 = X SX

Logistic regression Gradient descent Iterative reweighted least squares (IRLS) Let b t be the parameters after t Newton steps. The gradient and Hessian at step t are given by: The Newton Update Rule is: g t = X T (µ t c) = X T (c µ t ) H t = X T S t X b t+1 = b t H 1 t g t = b t + (X T S t X) 1 X T (c µ t ) = (X T S t X) 1 X T S t (Xb t + S 1 t (c µ t )) = (X T S t X) 1 X T S t z t Where z t = Xb t + S 1 t (c µ t ). Then b t+1 is a solution of the weighted least squares problem: N minimise S t,ii (z t,i b T x i ) 2 i=1

Logistic regression Linearly separable data Gradient descent Assume that the data is linearly separable, i.e. there is a scalar α and a vector β such that y i (α + β x i ) > 0, i = 1,..., n. Let c > 0. The empirical risk for a = cα, b = cβ is R log (f a,b ) = 1 n n log(1 + exp( cy i (α + β x i ))) i=1 which can be made arbitrarily close to zero as c, i.e. soft classification rule becomes ± (overconfidence) overfitting. Regularization provides a solution to this problem.

Logistic regression Multi-class logistic regression Gradient descent The multi-class/multinomial logistic regression uses the softmax function to model the conditional class probabilities p (Y = k X = x; θ), for K classes k = 1,..., K, i.e., exp ( wk p (Y = k X = x; θ) = x + b ) k K l=1 exp ( wl x + b l). Parameters are θ = (b, W ) where W = (w kj ) is a K p matrix of weights and b R K is a vector of bias terms.

Logistic regression Multi-class logistic regression Gradient descent

Crab Dataset Logistic regression Gradient descent library(mass) ## load crabs data data(crabs) ct <- as.numeric(crabs[,1])-1+2*(as.numeric(crabs[,2])-1) ## project into first two LD cb.lda <- lda(log(crabs[,4:8]),ct) cb.ldp <- predict(cb.lda) x <- cb.ldp$x[,1:2] y <- as.numeric(ct==0) eqscplot(x,pch=2*y+1,col=y+1)

Crab Dataset Logistic regression Gradient descent ## visualize decision boundary gx1 <- seq(-6,6,.02) gx2 <- seq(-4,4,.02) gx <- as.matrix(expand.grid(gx1,gx2)) gm <- length(gx1) gn <- length(gx2) gdf <- data.frame(ld1=gx[,1],ld2=gx[,2]) lda <- lda(x,y) y.lda <- predict(lda,x)$class eqscplot(x,pch=2*y+1,col=2-as.numeric(y==y.lda)) y.lda.grid <- predict(lda,gdf)$class contour(gx1,gx2,matrix(y.lda.grid,gm,gn), levels=c(0.5), add=true,d=false,lty=2,lwd=2)

Crab Dataset Logistic regression Gradient descent ## logistic regression xdf <- data.frame(x) logreg <- glm(y ~ LD1 + LD2, data=xdf, family=binomial) y.lr <- predict(logreg,type="response") eqscplot(x,pch=2*y+1,col=2-as.numeric(y==(y.lr>.5))) y.lr.grid <- predict(logreg,newdata=gdf,type="response") contour(gx1,gx2,matrix(y.lr.grid,gm,gn), levels=c(.1,.25,.75,.9), add=true,d=false,lty=3,lwd=1) contour(gx1,gx2,matrix(y.lr.grid,gm,gn), levels=c(.5), add=true,d=false,lty=1,lwd=2) ## logistic regression with quadratic interactions logreg <- glm(y ~ (LD1 + LD2)^2, data=xdf, family=binomial) y.lr <- predict(logreg,type="response") eqscplot(x,pch=2*y+1,col=2-as.numeric(y==(y.lr>.5))) y.lr.grid <- predict(logreg,newdata=gdf,type="response") contour(gx1,gx2,matrix(y.lr.grid,gm,gn), levels=c(.1,.25,.75,.9), add=true,d=false,lty=3,lwd=1) contour(gx1,gx2,matrix(y.lr.grid,gm,gn), levels=c(.5), add=true,d=false,lty=1,lwd=2)

Logistic regression Gradient descent Crab Dataset : Blue Female vs. rest 4 2 0 2 4 6 4 2 0 2 4 4 2 0 2 4 6 4 2 0 2 4 Comparing LDA and logistic regression.

Logistic regression Gradient descent Crab Dataset 4 2 0 2 4 6 4 2 0 2 4 4 2 0 2 4 6 4 2 0 2 4 Comparing logistic regression with and without quadratic interactions.

Logistic regression Gradient descent Logistic regression Python demo Single-class: https://github.com/vkanade/mlmt2017/blob/ master/lecture11/logistic%20regression.ipynb Multi-class: https://github.com/vkanade/mlmt2017/blob/master/ lecture11/multiclass%20logistic%20regression.ipynb