Introduction to Machine Learning

Similar documents
Introduction to Machine Learning

Introduction to Machine Learning

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Statistical Data Mining and Machine Learning Hilary Term 2016

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Inf2b Learning and Data

Linear Regression and Discrimination

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

LDA, QDA, Naive Bayes

Classification Methods II: Linear and Quadratic Discrimminant Analysis

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

The Bayes classifier

COM336: Neural Computing

Machine Learning - MT Classification: Generative Models

Introduction to Machine Learning

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning for Signal Processing Bayes Classification

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Introduction to Machine Learning Spring 2018 Note 18

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Machine Learning (CS 567) Lecture 5

Gaussian discriminant analysis Naive Bayes

Naïve Bayes classification

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Support Vector Machines

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Warm up: risk prediction with logistic regression

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Classification 2: Linear discriminant analysis (continued); logistic regression

CMSC858P Supervised Learning Methods

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Machine Learning for Signal Processing Bayes Classification and Regression

Logistic Regression. Machine Learning Fall 2018

Linear Methods for Prediction

Least Squares Regression

Introduction to Machine Learning

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

ECE 5984: Introduction to Machine Learning

CMU-Q Lecture 24:

Machine Learning Linear Classification. Prof. Matteo Matteucci

Linear Models for Classification

MACHINE LEARNING ADVANCED MACHINE LEARNING

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

Bayesian Decision Theory

Recap from previous lecture

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Statistical Machine Learning Hilary Term 2018

Introduction to Machine Learning Midterm, Tues April 8

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Least Squares Regression

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Lecture 9: Classification, LDA

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Notes on Discriminant Functions and Optimal Classification

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Bayesian Decision and Bayesian Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

A Study of Relative Efficiency and Robustness of Classification Methods

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

The generative approach to classification. A classification problem. Generative models CSE 250B

Linear & nonlinear classifiers

Parametric Techniques

6.867 Machine Learning

Does Modeling Lead to More Accurate Classification?

10-701/ Machine Learning - Midterm Exam, Fall 2010

Support Vector Machines for Classification: A Statistical Portrait

Machine Learning, Fall 2012 Homework 2

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Generative v. Discriminative classifiers Intuition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Parametric Techniques Lecture 3

CS 195-5: Machine Learning Problem Set 1

Naive Bayes and Gaussian Bayes Classifier

Generative Model (Naïve Bayes, LDA)

Learning from Data Logistic Regression

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Transcription:

1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer, Jyrki Kivinen, and Teemu Roos) November 1st December 14th 2018

Classification: Probabilistic Methods 2,

3, Statistical learning model (recap) Recall that X are the input variables (features, predictors, etc.), and Y denotes the output variable. In classification, we assume Y to be a set of class labels,, e.g. 0 or 1, or +1 or -1. Assume that there is a fixed but unknown probability distribution P over X Y such that pairs are (x i, y i ) are i.i.d. samples from it. We wish to minimise the generalisation error (also called risk) of ˆf, which is the expected loss E (x,y) P [L(ˆf (x), y)] where E (x,y) P [ ] denotes expectation when a single data point (x, y) is drawn from P

4, Some definitions/notation P(X = x i, Y = y i ) denotes the probability of pair (x i, y i ). (Sometimes we write P(x i, y i ) for short.) P(Y = y i X = x i ) denotes the conditional probability of observing y i given x i. (We can also write P(Y = y i x i ) or P(y i x i ) for short.) If we knew P (which we in practice never do!), we could implement an optimal classifier by assigning x i to the class y i that maximises P(y i x i ). In the following, we are particularly interested in models for P(Y = y i x i ). These models are almost always wrong, but perhaps they give good predictions anyway!

5, Logistic regression Logistic regression models are linear models for probabilistic binary classification (so, not really regression where response is continuous) Given input (vector) x, the output is a probability that Y = 1. Let s denote it by ˆp(Y = 1 x). However, instead of using a linear model directly as in we let log ˆp(Y = 1 x) = β x ˆp(Y = 1 x) ˆp(Y = 0 x) = β x This amounts to the same as exp(β x) ˆp(Y = 1 x) = 1 + exp(β x) = 1 exp( β x) + 1

6, Logistic regression (2) For convenience, we use here class labels 0 and 1 Given probabilistic prediction ˆp(y x), and assuming instance x i has already been observed, the conditional likelihood for a sample point (x i, y i ) is ˆp(Y = 1 x i ) if y i = 1 1 ˆp(Y = 1 x i ) if y i = 0 which we write as ˆp(Y = 1 x i ) y i (1 ˆp(Y = 1 x i )) 1 y i

7, Logistic regression (3) Conditional likelihood of sequence of independent samples (x i, y i ), i = 1,..., n is then n ˆp(Y = 1 x i ) y i (1 ˆp(Y = 1 x i )) 1 y i i=1 we say conditional to emphasise that we take x i as given and only model probability of labels y i To maximise conditional likelihood, we can equivalently maximise conditional log-likelihood n LCL(β) = (y i ln ˆp(Y = 1 x i ) + (1 y i ) ln(1 ˆp(Y = 1 x i ))) i=1 This is the same as log-loss (except that the sign is flipped, i.e., without the minus)!

8, Logistic regression (4) Maximizing the likelihood (or minimizing log-loss) isn t as straightforward as in the case of linear regression Nevertheless, the problem is convex which means that gradient-based techniques exist to find the optimum Standard techniques in R, Python, Matlab,... Often used with regularisation, as in linear regression ridge : arg max(lcl(β) λ β 2 2 ) lasso : arg max(lcl(β) λ β 1 ) In particular, if data is linearly separable, non-regularised solution tends to infinity

9, Generative vs discriminative learning Logistic regression was an example of a discriminative and probabilistic classifier that directly models the class distribution P(y x) Another probabilistic way to approach the problem is to use generative learning that builds a model for the whole joint distribution P(x, y) often using the decomposition P(y)P(x y) Both approaches have their pros and cons: Discriminative learning: only solve the task that you need to solve; may provide better accuracy since focuses on the specific learning task; optimization tends to be harder Generative learning: often more natural to build models for P(x y) than for P(y x); handles missing data more naturally; optimization often easier

10, Generative vs discriminative learning (2) Estimating the class prior P(y) is usually simple For example, in binary classification this time with Y { 1, +1} we can usually just count the number of positive examples Pos and negative examples Neg and set P(Y = +1) = Pos Pos + Neg and P(Y = 1) = Neg Pos + Neg Since P(x, y) = P(x y)p(y), what remains is estimating P(x y). In binary classification, we could now e.g. use the positive examples to build a model for P(x Y = +1) use the negative examples to build a model for P(x Y = 1) To classify a new data point x, we use the Bayes formula P(y x) = P(x y)p(y) P(x) = P(x y)p(y) y P(x y )P(y )

11, Generative vs discriminative learning (3) Examples of discriminative classifiers: logistic regression k-nn decision trees SVM multilayer perceptron (MLP) Examples of generative classifiers: naive Bayes (NB) linear discriminant analysis (LDA) quadratic discriminant analysis (QDA) We will study all of the above except MLP.

Normal distribution For probabilistic models for real-valued features x i R, one basic ingredient is the normal or Gaussian distribution Recall that for a single real-valued random variable, the normal distribution has two parameters µ and σ 2, and density ( ) N (x µ, σ 2 1 ) = exp (x µ)2 2πσ 2 2σ 2 If X has this distribution, then E[X ] = µ and Var[X ] = σ 2 For multivariate case x R p, we shall first consider the case where individual component x i has normal distribution with parameters µ i and σi 2 and the components are independent: p(x) = N (x 1 µ 1, σ 2 1)... N (x p µ p, σ 2 d ) 12,

13, Normal distribution (2) We get p(x) = N (x 1 µ 1, σ1) 2... N (x p µ p, σp) 2 ( ) p 1 = exp (x j µ j ) 2 j=1 2πσj 2 2σj 2 1 = (2π) p/2 exp 1 p (x j µ j ) 2 σ 1... σ p 2 σ 2 j=1 j ( 1 = exp 1 ) (2π) p/2 1/2 Σ 2 (x µ)t Σ 1 (x µ) where µ = (µ 1,..., µ p ) R p and Σ R p p is a diagonal matrix with σ 2 1,..., σ2 p on the diagonal and Σ is determinant of Σ

14, Normal distribution (3) More generally, let µ R p, and let Σ R p p be symmetric: Σ T = Σ positive definite: x T Σx > 0 for all x R { 0 } We then define p-dimensional Gaussian density with parameter µ and Σ as ( 1 N (x µ, Σ) = exp 1 ) (2π) p/2 1/2 Σ 2 (x µ)t Σ 1 (x µ) If Σ is diagonal, we get the special case where x j are independent

15, Normal distribution (4) To understand the multivariate normal distribution, consider a surface of constant density: for some a S = { x R p N (x µ, Σ) = a } By definition of N, this can be written as for some b S = { x R p (x µ) T Σ 1 (x µ) = b } Because Σ is symmetric and positive definite, so is Σ 1, and this set is an ellipsoid with centre µ

16, Normal distribution (5) More specifically, since Σ is symmetric and positive definite, it has an Eigenvalue decomposition Σ = UΛU T where Λ R p p is diagonal and U R T is orthogonal (U T = U 1 ), and further Σ 1 = UΛ 1 U T We then know from analytic geometry that for the ellipsoid S = { x R p (x µ) T Σ 1 (x µ) = b } the directions of the axes are given by the column vectors of U (Eigenvectors of Σ) the squared lengths of the axes are given by the elements of Λ (Eigenvalues of Σ)

17, Normal distribution (6) Let X = (X 1,..., X p ) have normal distribution with parameters µ and Σ Then E[X] = µ and E[(X r µ r )(X s µ s )] = Σ rs Hence, we call the parameter µ the mean and Σ the covariance matrix

18, Normal distribution (7) Let x 1,..., x n, where x i = (x i,1,..., x i,p ), be n independent samples from a p-dimensional normal distribution with unknown mean µ and covariance Σ The maximum likelihood estimates are given by ( ˆµ, ˆΣ) = arg max µ,σ n N (x i µ, Σ) i=1 ˆµ r = 1 n ˆΣ rs = 1 n n i=1 x i,r n (x i,r ˆµ r )(x i,s ˆµ s ) i=1

19, Gaussians in classification LDA and QDA are obtained by modeling positive and negative examples both with their own Gaussian: p(x Y = +1) = N (x µ +, Σ + ) p(x Y = 1) = N (x µ, Σ )) where µ ± and Σ ± are obtained for example as maximum likelihood estimates Decision boundary is given by or equivalently N (x µ +, Σ + ) = N (x µ, Σ ) ln N (x µ +, Σ + ) = ln N (x µ, Σ )

20, Gaussians in classification (2) By substituting the formula for N into and simplifying we get ln N (x µ +, Σ + ) = ln N (x µ, Σ ) (x µ + ) T Σ 1 + (x µ +) (x µ ) T Σ 1 (x µ )+ln Σ Σ + = 0 If Σ + = Σ this is a linear equation, so the decision boundary is a hyperplane: LDA In general case this is a quadratic surface: QDA In QDA, decision regions may be non-connected