Statistical Data Mining and Machine Learning Hilary Term 2016

Similar documents
Statistical Machine Learning Hilary Term 2018

Naïve Bayes. Naïve Bayes

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Introduction to Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Logistic Regression. Machine Learning Fall 2018

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Discriminative Models

Machine Learning, Fall 2012 Homework 2

Introduction to Machine Learning

Lecture 2 Machine Learning Review

Machine Learning for Signal Processing Bayes Classification and Regression

Linear Models for Classification

ECE521 week 3: 23/26 January 2017

Is the test error unbiased for these programs?

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Linear Models in Machine Learning

Machine Learning 2017

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

DATA MINING AND MACHINE LEARNING

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Introduction to Machine Learning Spring 2018 Note 18

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Linear discriminant functions

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Discriminative Models

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Support Vector Machines for Classification: A Statistical Portrait

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Support Vector Machines

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Introduction to Machine Learning

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Classification. Chapter Introduction. 6.2 The Bayes classifier

Logistic Regression. COMP 527 Danushka Bollegala

1 Machine Learning Concepts (16 points)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning Lecture 5

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Gaussian discriminant analysis Naive Bayes

Machine Learning for NLP

10-701/ Machine Learning - Midterm Exam, Fall 2010

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Does Modeling Lead to More Accurate Classification?

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to Machine Learning

Support Vector Machine

A Study of Relative Efficiency and Robustness of Classification Methods

CSCI-567: Machine Learning (Spring 2019)

Naïve Bayes classification

Linear and logistic regression

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine Learning Practice Page 2 of 2 10/28/13

Midterm exam CS 189/289, Fall 2015

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

The Bayes classifier

Linear Methods for Prediction

Graphical Models for Collaborative Filtering

CMSC858P Supervised Learning Methods

Support Vector Machine I

Lecture 5: LDA and Logistic Regression

Ch 4. Linear Models for Classification

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Warm up: risk prediction with logistic regression

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Sparse Linear Models (10/7/13)

Nonparametric Bayesian Methods (Gaussian Processes)

ECS171: Machine Learning

Lecture 9: Classification, LDA

Bayesian Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Logistic Regression. William Cohen

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 9: Classification, LDA

Probabilistic Graphical Models & Applications

Machine Learning - MT Classification: Generative Models

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Machine Learning Lecture 7

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Statistical Methods for Data Mining

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Linear classifiers: Logistic regression

Transcription:

Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml

Naïve Bayes Naïve Bayes Naïve Bayes: another plug-in classifier with a simple generative model - it assumes all measured variables/features are independent given the label. Often used in text document classification, e.g. of scientific articles or emails. A basic standard model for text classification consists of considering a pre-specified dictionary of p words and summarizing each document i by a binary vector x i where x (j) i = { 1 if word j is present in document 0 otherwise. Presence of the word j is the j-the feature/dimension. To implement a plug-in classifier, we need a model for the conditional probability mass function g k (x) = P(X = x Y = k) for each class k = 1,..., K.

Naïve Bayes Naïve Bayes Naïve Bayes is a plug-in classifier which ignores feature correlations 1 and assumes: g k (x i ) = P(X = x i Y = k) = = p j=1 p j=1 P(X (j) = x (j) i Y = k) (φ kj ) x(j) i (1 φ kj ) 1 x(j) i, where we denoted parametrized conditional PMF with φ kj = P(X (j) = 1 Y = k) (probability that j-th word appears in class k document). Given dataset, the MLE of the parameters is: ˆπ k = n k n, ˆφkj = i:y i=k x(j) i n k. 1 given the class, it assumes each word appears in a document independently of all others

Naïve Bayes Naïve Bayes MLE: ˆπ k = n k n, ˆφkj = i:y i=k x(j) i n k. A problem with MLE: if the l-th word did not appear in documents labelled as class k then ˆφ kl = 0 and P(Y = k X = x with l-th entry equal to 1) p ( ) x (j) ˆπ k ˆφkj (1 ˆφ ) 1 x (j) kj = 0 j=1 i.e. we will never attribute a new document containing word l to class k (regardless of other words in it). This is an example of overfitting.

Generative vs Discriminative Generative Learning Classifiers we have seen so far are generative: we work with a joint distribution p X,Y (x, y) over data vectors and labels. A learning algorithm: construct f : X Y which predicts the label of X. Given a loss function L, the risk R of f (X) is R(f ) = E px,y [L(Y, f (X))] For 0/1 loss in classification, Bayes classifier f Bayes (x) = argmax k=1,...,k p(y = k x) = argmax p X,Y (x, k) k=1,...,k has the minimum risk (Bayes risk), but is unknown since p X,Y is unknown. Assume a parameteric model for the joint: p X,Y (x, y) = p X,Y (x, y θ) Fit ˆθ = argmax θ n log p(x i, y i θ) and plug in back to Bayes classifier: ˆf (x) = argmax k=1,...,k p(y = k x, θ) = argmax p X,Y (x, k ˆθ). k=1,...,k

Generative vs Discriminative Generative vs Discriminative Learning Generative learning: find parameters which explain all the data available. ˆθ = argmax log p(x i, y i θ) θ Examples: LDA, QDA, naïve Bayes. Makes use of all the data available. Flexible modelling framework, so can incorporate missing features or unlabeled examples. Stronger modelling assumptions, which may not be realistic (Gaussianity, independence of features). Discriminative learning: find parameters that aid in prediction. ˆθ = argmin θ 1 n L(y i, f θ (x i )) or ˆθ = argmax θ log p(y i x i, θ) Examples: logistic regression, neural nets, support vector machines. Typically performs better on a given task. Weaker modelling assumptions: essentially no model on X, only on Y X. Can overfit more easily.

Logistic regression A discriminative classifier. Consider binary classification with Y = { 1, +1}. Logistic regression uses a parametric model on the conditional Y X, not the joint distribution of (X, Y): p(y = y X = x; a, b) = 1 1 + exp( y(a + b x)). a, b fitted by minimizing the empirical risk with respect to log loss.

Hard vs Soft classification rules Consider using LDA for binary classification with Y = { 1, +1}. Predictions are based on linear decision boundary: { ŷ LDA (x) = sign log ˆπ +1 g +1 (x ˆµ +1, ˆΣ) log ˆπ 1 g 1 (x ˆµ 1, ˆΣ) } = sign { a + b x } for a and b depending on fitted parameters ˆθ = (ˆπ 1, ˆπ +1, ˆµ 1, ˆµ +1, Σ). Quantity a + b x can be viewed as a soft classification rule. Indeed, it is modelling the difference between the log-discriminant functions, or equivalently, the log-odds ratio: a + b x = log p(y = +1 X = x; ˆθ) p(y = 1 X = x; ˆθ). f (x) = a + b x corresponds to the confidence of predictions and loss can be measured as a function of this confidence: exponential loss: L(y, f (x)) = e yf (x), log-loss: L(y, f (x)) = log(1 + e yf (x) ), hinge loss: L(y, f (x)) = max{1 yf (x), 0}.

Linearity of log-odds and logistic function We can treat a and b as parameters in their own right in the model of the conditional Y X. p(y = +1 X = x; a, b) log p(y = 1 X = x; a, b) = a + b x. Solve explicitly for conditional class probabilities: 1 p(y = +1 X = x; a, b) = 1 + exp( (a + b x)) =: s(a + b x) 1 p(y = 1 X = x; a, b) = 1 + exp(+(a + b x)) = s( a b x) where s(z) = 1/(1 + exp( z)) is the logistic function. 1 0.5 0 8 6 4 2 0 2 4 6 8

Fitting the parameters of the hyperplane Consider maximizing the conditional log likelihood: l(a, b) = log p(y = y i X = x i ) = log s(y i (a + b x i )). Equivalent to minimizing the empirical risk associated with the log loss: ˆR log (f a,b ) = 1 log s(y i (a + b x i )) = 1 log(1 + exp( y i (a + b x i ))) n n over all linear soft classification rules f a,b (x) = a + b x.

Not possible to find optimal a, b analytically. For simplicity, absorb a as an entry in b by appending 1 into x vector. Objective function: ˆR log = 1 n log s(y i xi b) Logistic Function s( z) = 1 s(z) z s(z) = s(z)s( z) z log s(z) = s( z) 2 z log s(z) = s(z)s( z) Differentiate wrt b: bˆr log = 1 n 2 bˆr log = 1 n s( y i xi b)y i x i s(y i xi b)s( y i xi b)x i xi 0.

Second derivative is positive-definite: objective function is convex and there is a single unique global minimum. Many different algorithms can find optimal b, e.g.: Gradient descent: Stochastic gradient descent: b new = b + ɛ 1 n s( y ix i b)y ix i b new 1 = b + ɛ t s( y ixi I(t) i I(t) b)y ix i where I(t) is a subset of the data at iteration t, and ɛ t 0 slowly ( t ɛt =, t ɛ2 t < ). Newton-Raphson: b new = b ( 2 bˆr log ) 1 bˆr log This is also called iterative reweighted least squares. Conjugate gradient, LBFGS and other methods from numerical analysis.

vs. LDA Both have linear decision boundaries and model log-posterior odds as log p(y = +1 X = x) p(y = 1 X = x) = a + b x LDA models the marginal density of x as a Gaussian mixture with shared covariance g(x) = π 1 N (x; µ 1, Σ) + π +1 N (x; µ +1, Σ) and fits the parameters θ = (µ 1, µ +1, π 1, π +1, Σ) by maximizing joint likelihood n p(x i, y i θ). a and b are then determined from θ. Logistic regression leaves the marginal density g(x) as an arbitrary density function, and fits the parameters a,b by maximizing the conditional likelihood n p(y i x i ; a, b).

Linearly separable data Assume that the data is linearly separable, i.e. there is a scalar α and a vector β such that y i (α + β x i ) > 0, i = 1,..., n. Let c > 0. The empirical risk for a = cα, b = cβ is ˆR log (f a,b ) = 1 n log(1 + exp( cy i (α + β x i ))) which can be made arbitrarily close to zero as c, i.e. soft classification rule becomes ± (overconfidence).

Multi-class logistic regression The multi-class/multinomial logistic regression uses the softmax function to model the conditional class probabilities p (Y = k X = x; θ), for K classes k = 1,..., K, i.e., exp ( w k p (Y = k X = x; θ) = x + b ) k K l=1 exp ( w l x + b l). Parameters are θ = (b, W) where W = (w kj ) is a K p matrix of weights and b R K is a vector of bias terms.

: Summary Makes less modelling assumptions than generative classifiers: often resulting in better prediction accuracy. Diverging optimal parameters for linearly separable data: need to regularise / pull them towards zero. A simple example of a generalised linear model (GLM), for which there is a well established statistical theory: Assessment of fit via deviance and plots, Well founded approaches to removing insignificant features (drop-in deviance test, Wald test).

Regularization Regularization Flexible models for high-dimensional problems require many parameters. With many parameters, learners can easily overfit. Regularization: Limit flexibility of model to prevent overfitting. Add term penalizing large values of parameters θ. min ˆR(f θ ) + λ θ ρ 1 ρ = min θ θ n L(y i, f θ (x i )) + λ θ ρ ρ where ρ 1, and θ ρ = ( p j=1 θ j ρ ) 1/ρ is the L ρ norm of θ (also of interest when ρ [0, 1), but is no longer a norm). Also known as shrinkage methods parameters are shrunk towards 0. λ is a tuning parameter (or hyperparameter) and controls the amount of regularization, and resulting complexity of the model.

Regularization Regularization 3 2.5 2 1.5 1 0.5.01.10.50 1.0 1.5 2.0 0 5 4 3 2 1 0 1 2 3 4 5 L ρ regularization profile for different values of ρ.

Regularization Types of Regularization Ridge regression / Tikhonov regularization: ρ = 2 (Euclidean norm) LASSO: ρ = 1 (Manhattan norm) Sparsity-inducing regularization: ρ 1 (nonconvex for ρ < 1) Elastic net regularization: mixed L 1 /L 2 penalty: 1 min θ n L(y i, f θ (x i )) + λ [ (1 α) θ 2 ] 2 + α θ 1

L 1 promotes sparsity Regularization Figure : The intersection between the L 1 (left) and the L 2 (right) ball with a hyperplane. L 1 regularization often leads to optimal solutions with many zeros, i.e., the regression function depends only on the (small) number of features with non-zero parameters. figure from M. Elad, Sparse and Redundant Representations, 2010.