The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Similar documents
The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The Bayes classifier

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Introduction to Machine Learning

Bias-Variance Tradeoff

Machine Learning

Nonparametric Bayesian Methods (Gaussian Processes)

Machine Learning Gaussian Naïve Bayes Big Picture

Introduction to Machine Learning

Machine Learning (CS 567) Lecture 5

CSC 411: Lecture 04: Logistic Regression

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

CSC321 Lecture 4: Learning a Classifier

CPSC 540: Machine Learning

Machine Learning Lecture 5

Machine Learning Lecture 7

SVMs, Duality and the Kernel Trick

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Bindel, Fall 2011 Intro to Scientific Computing (CS 3220) Week 6: Monday, Mar 7. e k+1 = 1 f (ξ k ) 2 f (x k ) e2 k.

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

CSC321 Lecture 4: Learning a Classifier

Midterm exam CS 189/289, Fall 2015

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 5: Logistic Regression. Neural Networks

Logistic Regression. Machine Learning Fall 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Generative v. Discriminative classifiers Intuition

Regression with Numerical Optimization. Logistic

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Linear and logistic regression

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Least Mean Squares Regression

Machine Learning for Signal Processing Bayes Classification and Regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Statistical Data Mining and Machine Learning Hilary Term 2016

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

TDA231. Logistic regression

Logistic Regression Trained with Different Loss Functions. Discussion

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

ECE521 week 3: 23/26 January 2017

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Generative v. Discriminative classifiers Intuition

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Week 5: Logistic Regression & Neural Networks

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Ch 4. Linear Models for Classification

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression. Seungjin Choi

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Introduction to Logistic Regression and Support Vector Machine

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

CPSC 340: Machine Learning and Data Mining

CPSC 540: Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Linear Models for Classification

Introduction to Machine Learning

Neural Network Training

Advanced statistical methods for data analysis Lecture 2

Linear Models in Machine Learning

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Bayesian Learning (II)

Lecture 5: LDA and Logistic Regression

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Curve Fitting Re-visited, Bishop1.2.5

Iterative Reweighted Least Squares

STA 4273H: Statistical Machine Learning

Lecture 3: Pattern Classification

Lecture 9: Classification, LDA

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

MLE/MAP + Naïve Bayes

Maximum Likelihood Estimation. only training data is available to design a classifier

A first model of learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Machine Learning

Lecture 9: Classification, LDA

Lecture 11. Linear Soft Margin Support Vector Machines

Machine Learning 2017

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Machine Learning. 7. Logistic and Linear Regression

Least Mean Squares Regression. Machine Learning Fall 2018

SGN (4 cr) Chapter 5

Machine Learning Practice Page 2 of 2 10/28/13

CPSC 540: Machine Learning

Transcription:

The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know for Here is the multivariate Gaussian/normal distribution with mean and covariance matrix Alternatively, since the maximum it is sufficient to know, to find Note: Each class has the same covariance matrix Example Challenges for LDA Suppose that The generative model is rarely valid It turns out that by setting Moreover, the number of parameters to be estimated is class prior probabilities: means: covariance matrix: we can re-write this as linear classifier If is small and is large, then we can accurately estimate these parameters (provably, using Hoeffding) If is small and is large, then we have more parameters than observations, and will likely obtain very poor estimates first apply a dimensionality reduction technique to reduce (more on this later) assume a more structured covariance matrix

Another possible escape Recall from the very beginning of the lecture that the Bayes classifier can be stated either in terms of maximizing or In LDA, we are estimating to the full joint distribution of All we really need is to be able to estimate we don t need to know, which is equivalent LDA commits one of the cardinal sins of machine learning: Never solve a more difficult problem as an intermediate step Is there a better approach? Suppose Define Another look at plugin methods In this case, another way to express the Bayes classifier is as Note that we do not actually need to know the full distribution of to express the Bayes classifier All we really need is to decide if Gaussian case Logistic regression Suppose that and that This observation gives rise to another class of plugin methods, the most important of which is logisitic regression, which implements the following strategy 1. Assume (, ) 2. Directly estimate (somehow) from the data 3. Plug the estimate into the formula for the Bayes classifier

The logistic function The function is called a logistic function (or a sigmoid function in other contexts) The logistic regression classifier Denote the logistic regression classifier by Note that So linear classifier Example Estimating the parameters Challenge: How to estimate the parameters for One possibility: Alternative: Maximum likelihood estimation For convenience, let s let Note that is really a function of both and, so we will use the notation to highlight this dependence

The a posteriori probability of our data Suppose that we knew. Then we could compute Maximum likelihood estimation We don t actually know, but we do know Suppose we view to be fixed, and view as just a function of When we do this, called the likelihood (or likelihood function) is Because of independence, we also have that The method of maximum likelihood aims to estimate finding the that maximizes the likelihood In practice, it is often more convenient to focus on maximizing the log-likelihood, i.e., by The log-likelihood Simplifying the log-likelihood To see why, note that the likelihood in our case is given by Notation Thus, the log-likelihood is given by This means that, which lets us write Thus, if we let, then we can write It is often easier to work with summations instead of products, and since the log is a monotonic transformation, maximizing is equivalent to maximizing

Simplifying the log-likelihood Simplifying the log-likelihood Facts: Facts: Thus Maximizing the log-likelihood How can we maximize Computing the gradient It is not too hard to show that with respect to? Find a such that (i.e., compute the partial derivatives and set them to zero) This gives us equations, but they are nonlinear and have no closed-form solution

Optimization Throughout signal processing and machine learning, we will very often encounter problems of the form Gradient descent A simple way to try to find the minimum of our objective function is to iteratively roll downhill (or for today) From, take a step in the direction of the negative gradient : step size In many (most?) cases, we cannot compute the solution simply by setting and solving for However, there are many powerful algorithms for finding using a computer Convergence of gradient descent The core iteration of gradient descent is to compute Step size matters! Even though gradient descent provably converges, it can potentially take a while Note that if minimum and, then we have found the, so the algorithm will terminate If is convex and sufficiently smooth, then gradient descent (with a fixed step size ) is guaranteed to converge to the global minimum of convex not convex

Step size matters! Even though gradient descent provably converges, it can potentially take a while Newton s method Also known as the Newton-Raphson method, this approach can be viewed as simply using the second derivative to automatically select an appropriate step size Hessian matrix Optimization for logistic regression The negative log-likelihood in logistic regression is a convex function Both gradient descent and Newton s method are common strategies for setting the parameters in logistic regression Newton s method is much faster when the dimension small, but is impractical when is large Why? More on this on next week s homework is Comparison of plugin methods Naïve Bayes, LDA, and logistic regression are all plugin methods that result in linear classifiers Naïve Bayes plugin method based on density estimation scales well to high-dimensions and naturally handles mixture of discrete and continuous features Linear discriminant analysis better if Gaussianity assumptions are valid Logistic regression models only the distribution of, not valid for a larger class of distributions fewer parameters to estimate

Beyond plugin methods Plugin methods can be useful in practice, but ultimately they are very limited There are always distributions where our assumptions are violated If our assumptions are wrong, the output is totally unpredictable Can be hard to verify whether our assumptions are right Require solving a more difficult problem as an intermediate step For most of the remainder of this course will focus on nonparametric methods that avoid making such strong assumptions about the (unknown) process generating the data