Introduction to Machine Learning

Similar documents
Introduction to Machine Learning

Introduction to Machine Learning

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Linear Regression and Discrimination

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Introduction to Machine Learning

Statistical Data Mining and Machine Learning Hilary Term 2016

Introduction to Machine Learning Spring 2018 Note 18

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian discriminant analysis Naive Bayes

LDA, QDA, Naive Bayes

Inf2b Learning and Data

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Support Vector Machines

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning (CS 567) Lecture 5

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

The Bayes classifier

COM336: Neural Computing

Introduction to Machine Learning

Machine Learning - MT Classification: Generative Models

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

MLPR: Logistic Regression and Neural Networks

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Naïve Bayes classification

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Linear Models for Classification

Logistic Regression. Machine Learning Fall 2018

ECE 5984: Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

The generative approach to classification. A classification problem. Generative models CSE 250B

CSC 411: Lecture 09: Naive Bayes

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine Learning for Signal Processing Bayes Classification

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Machine Learning Practice Page 2 of 2 10/28/13

CMSC858P Supervised Learning Methods

Machine Learning, Fall 2012 Homework 2

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Least Squares Regression

Lecture 9: Classification, LDA

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Lecture 9: Classification, LDA

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Naive Bayes and Gaussian Bayes Classifier

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Bayesian Decision Theory

MACHINE LEARNING ADVANCED MACHINE LEARNING

Linear Methods for Prediction

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

CMU-Q Lecture 24:

Support Vector Machines for Classification: A Statistical Portrait

6.867 Machine Learning

Notes on Discriminant Functions and Optimal Classification

Least Squares Regression

Bayesian Decision and Bayesian Learning

Ch 4. Linear Models for Classification

Generative Model (Naïve Bayes, LDA)

Generative v. Discriminative classifiers Intuition

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Machine Learning for Signal Processing Bayes Classification and Regression

Classification 2: Linear discriminant analysis (continued); logistic regression

Lecture 9: Classification, LDA

Linear Classification

Parametric Techniques

Linear discriminant functions

Machine Learning Linear Classification. Prof. Matteo Matteucci

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

CS 195-5: Machine Learning Problem Set 1

Statistical Machine Learning Hilary Term 2018

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Jeff Howbert Introduction to Machine Learning Winter

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Motivating the Covariance Matrix

Transcription:

1, DATA11002 Introduction to Machine Learning Lecturer: Teemu Roos TAs: Ville Hyvönen and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer and Jyrki Kivinen) November 2nd December 15th 2017

Classification: Probabilistic Methods 2,

3, Logistic regression Logistic regression models are linear models for probabilistic binary classification (so, not really regression where response is continuous) Given input (vector) x, the output is a probability that Y = 1. Let s denote it by ˆp(Y = 1 x). However, instead of using a linear model directly as in we let log ˆp(Y = 1 x) = β x ˆp(Y = 1 x) ˆp(Y = 0 x) = β x This amounts to the same as exp(β x) ˆp(Y = 1 x) = 1 + exp(β x) = 1 exp( β x) + 1

4, Logistic regression (2) For convenience, we use here class labels 0 and 1 Given probabilistic prediction ˆp(y x), and assuming instance x i has already been observed, the conditional likelihood for a sample point (x i, y i ) is ˆp(Y = 1 x i ) if y i = 1 1 ˆp(Y = 1 x i ) if y i = 0 which we write as ˆp(Y = 1 x i ) y i (1 ˆp(Y = 1 x i )) 1 y i

5, Logistic regression (3) Conditional likelihood of sequence of independent samples (x i, y i ), i = 1,..., n is then n ˆp(Y = 1 x i ) y i (1 ˆp(Y = 1 x i )) 1 y i i=1 we say conditional to emphasise that we take xi as given and only model probability of labels y i To maximise conditional likelihood, we can equivalently maximise conditional log-likelihood LCL(β) = n (y i ln ˆp(Y = 1 x i ) + (1 y i ) ln(1 ˆp(Y = 1 x i ))) i=1 Note that this is the same as log-loss!

6, Logistic regression (4) Maximizing the likelihood (or minimizing log-loss) isn t as straightforward as in the case of linear regression Nevertheless, the problem is convex which means that gradient-based techniques exist to find the optimum Standard techniques in R, Python, Matlab,... Often used with regularisation, as in linear regression ridge : arg max(lcl(β) λ β 2 2 ) lasso : arg max(lcl(β) λ β 1 ) In particular, if data is linearly separable, non-regularised solution tends to infinity

7, Generative vs discriminative learning Logistic regression was an example of a discriminative and probabilistic classifier that directly models the class distribution P(y x) Another probabilistic way to approach the problem is to use generative learning that builds a model for the whole joint distribution P(x, y) often using the decomposition P(y)P(x y) Both approaches have their pros and cons: Discriminative learning: only solve the task that you need to solve; may provide better accuracy since focuses on the specific learning task; optimization tends to be harder Generative learning: often more natural to build models for P(x y) than for P(y x); handles missing data more naturally; optimization often easier

8, Generative vs discriminative learning (2) Estimating the class prior P(y) is usually simple For example, in binary classification this time with Y { 1, +1} we can usually just count the number of positive examples Pos and negative examples Neg and set P(Y = +1) = Pos Pos + Neg and P(Y = 1) = Neg Pos + Neg Since P(x, y) = P(x y)p(y), what remains is estimating P(x y). In binary classification, we use the positive examples to build a model for P(x Y = +1) use the negative examples to build a model for P(x Y = 1) To classify a new data point x, we use the Bayes formula P(y x) = P(x y)p(y) P(x) = P(x y)p(y) y P(x y )P(y )

9, Generative vs discriminative learning (3) Examples of discriminative classifiers: logistic regression k-nn decision trees SVM multilayer perceptron (MLP) Examples of generative classifiers: naive Bayes (NB) linear discriminant analysis (LDA) quadratic discriminant analysis (QDA) We will study all of the above except MLP.

Normal distribution For probabilistic models for real-valued features x i R, one basic ingredient is the normal or Gaussian distribution Recall that for a single real-valued random variable, the normal distribution has two parameters µ and σ 2, and density ( ) N (x µ, σ 2 1 ) = exp (x µ)2 2πσ 2 2σ 2 If X has this distribution, then E[X ] = µ and Var[X ] = σ 2 For multivariate case x R p, we shall first consider the case where individual component x i has normal distribution with parameters µ i and σi 2 and the components are independent: p(x) = N (x 1 µ 1, σ 2 1)... N (x p µ p, σ 2 d ) 10,

11, Normal distribution (2) We get p(x) = N (x 1 µ 1, σ1) 2... N (x p µ p, σp) 2 ( ) p 1 = exp (x j µ j ) 2 j=1 2πσj 2 2σj 2 1 = (2π) p/2 exp 1 p (x j µ j ) 2 σ 1... σ p 2 σ 2 j=1 j ( 1 = exp 1 ) (2π) p/2 1/2 Σ 2 (x µ)t Σ 1 (x µ) where µ = (µ 1,..., µ p ) R p and Σ R p p is a diagonal matrix with σ 2 1,..., σ2 p on the diagonal and Σ is determinant of Σ

12, Normal distribution (3) More generally, let µ R p, and let Σ R p p be symmetric: Σ T = Σ positive definite: x T Σx > 0 for all x R { 0 } We then define p-dimensional Gaussian density with parameter µ and Σ as ( 1 N (x µ, Σ) = exp 1 ) (2π) p/2 1/2 Σ 2 (x µ)t Σ 1 (x µ) If Σ is diagonal, we get the special case where x j are independent

13, Normal distribution (4) To understand the multivariate normal distribution, consider a surface of constant density: for some a S = { x R p N (x µ, Σ) = a } By definition of N, this can be written as for some b S = { x R p (x µ) T Σ 1 (x µ) = b } Because Σ is symmetric and positive definite, so is Σ 1, and this set is an ellipsoid with centre µ

14, Normal distribution (5) More specifically, since Σ is symmetric and positive definite, it has an Eigenvalue decomposition Σ = UΛU T where Λ R p p is diagonal and U R T is orthogonal (U T = U 1 ), and further Σ 1 = UΛ 1 U T We then know from analytic geometry that for the ellipsoid S = { x R p (x µ) T Σ 1 (x µ) = b } the directions of the axes are given by the column vectors of U (Eigenvectors of Σ) the squared lengths of the axes are given by the elements of Λ (Eigenvalues of Σ)

15, Normal distribution (6) Let X = (X 1,..., X p ) have normal distribution with parameters µ and Σ Then E[X] = µ and E[(X r µ r )(X s µ s )] = Σ rs Hence, we call the parameter µ the mean and Σ the covariance matrix

16, Normal distribution (7) Let x 1,..., x n, where x i = (x i,1,..., x i,p ), be n independent samples from a p-dimensional normal distribution with unknown mean µ and covariance Σ The maximum likelihood estimates are given by ( ˆµ, ˆΣ) = arg max µ,σ n N (x i µ, Σ) i=1 ˆµ r = 1 n ˆΣ rs = 1 n n i=1 x i,r n (x i,r ˆµ r )(x i,s ˆµ s ) i=1

17, Gaussians in classification LDA and QDA are obtained by modeling positive and negative examples both with their own Gaussian: p(x Y = +1) = N (x µ +, Σ + ) p(x Y = 1) = N (x µ, Σ )) where µ ± and Σ ± are obtained for example as maximum likelihood estimates Decision boundary is given by or equivalently N (x µ +, Σ + ) = N (x µ, Σ ) ln N (x µ +, Σ + ) = ln N (x µ, Σ )

18, Gaussians in classification (2) By substituting the formula for N into and simplifying we get ln N (x µ +, Σ + ) = ln N (x µ, Σ ) (x µ + ) T Σ 1 + (x µ +) (x µ ) T Σ 1 (x µ )+ln Σ Σ + = 0 If Σ + = Σ this is a linear equation, so the decision boundary is a hyperplane: LDA In general case this is a quadratic surface: QDA In QDA, decision regions may be non-connected