Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Similar documents
Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

Maximum Likelihood Estimation. only training data is available to design a classifier

Parametric Techniques

Parametric Techniques Lecture 3

Bayesian Decision and Bayesian Learning

Density Estimation. Seungjin Choi

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Density Estimation: ML, MAP, Bayesian estimation

Bayesian Decision Theory

Bayesian Methods: Naïve Bayes

p(d θ ) l(θ ) 1.2 x x x

Bayes Decision Theory

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

COS513 LECTURE 8 STATISTICAL CONCEPTS

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Introduction to Machine Learning

Lecture 4: Probabilistic Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

The Bayes classifier

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Graphical Models for Collaborative Filtering

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Pattern Recognition. Parameter Estimation of Probability Density Functions

Support Vector Machines

Introduction to Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Bayesian Decision Theory

Machine Learning

Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Learning Bayesian network : Given structure and completely observed data

Logistic Regression. Machine Learning Fall 2018

Statistical Pattern Recognition

MLE/MAP + Naïve Bayes

44 CHAPTER 2. BAYESIAN DECISION THEORY

Machine Learning

Machine Learning Gaussian Naïve Bayes Big Picture

STA414/2104 Statistical Methods for Machine Learning II

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

CSC411 Fall 2018 Homework 5

Introduction to Machine Learning. Lecture 2

Machine Learning Lecture 2

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning Lecture 2

Machine Learning Lecture 3

Naïve Bayes Lecture 17

Probability and Estimation. Alan Moses

CSE555: Introduction to Pattern Recognition Midterm Exam Solution (100 points, Closed book/notes)

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Probabilistic Graphical Models

Notes on Machine Learning for and

GAUSSIAN PROCESS REGRESSION

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

COM336: Neural Computing

Intelligent Systems Statistical Machine Learning

COMP90051 Statistical Machine Learning

Bayesian Inference and MCMC

Algorithmisches Lernen/Machine Learning

Naïve Bayes classification

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

p(x ω i 0.4 ω 2 ω

MLE/MAP + Naïve Bayes

Lecture 3: Pattern Classification

Lecture 8: Classification

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Introduction to Machine Learning

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Expectation Maximization

ECE 5984: Introduction to Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Expectation maximization

Bayesian Support Vector Machines for Feature Ranking and Selection

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Mixture of Gaussians Models

Bayesian Learning. Bayesian Learning Criteria

Naive Bayes and Gaussian Bayes Classifier

CSE 473: Artificial Intelligence Autumn Topics

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Learning with Probabilities

STA 4273H: Statistical Machine Learning

p(x ω i 0.4 ω 2 ω

Hidden Markov Models Part 2: Algorithms

Machine Learning Lecture 5

Bias-Variance Tradeoff

Transcription:

HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) Introduction Maximum-Likelihood Estimation Example of a Specific Case The Gaussian Case: unknown µ and σ Bias Introduction Data availability in a Bayesian framework We could design an optimal classifier if we knew: P(ω i ) (priors) P(x ω i ) (class-conditional densities) Unfortunately, we rarely have this complete information! Design a classifier from a training sample No problem with prior estimation Samples are often too small for class-conditional estimation (large dimension of feature space!) 3 Pattern Classification, Chapter 13 A priori information about the problem Normality of P(x ω i ) P(x ω i ) ~ N( µ i, Σ i ) Characterized by 2 parameters Estimation techniques Parameters in ML estimation are fixed but unknown! Best parameters are obtained by maximizing the probability of obtaining the samples observed Bayesian methods view the parameters as random variables having some known distribution 5 Maximum-Likelihood (ML) and the Bayesian estimations Results are nearly identical, but the approaches are different In either approach, we use P(ω i x) for our classification rule! Pattern Classification, Chapter 13 Pattern Classification, Chapter 13 1

Maximum-Likelihood Estimation Has good convergence properties as the sample size increases Simpler than any other alternative techniques 6 Use the information provided by the training samples to estimate θ = (θ 1, θ 2,, θ c ), each θ i (i = 1, 2,, c) is associated with each category 7 Suppose that D contains n samples, x 1, x 2,, x n General principle Assume we have c classes and P(x ω j ) ~ N( µ j, Σ j ) P(x ω j ) P (x ω j, θ j ) where: ML estimate of θ is, by definition the value that maximizes P(D θ) It is the value of θ that best agrees with the actually observed training sample 8 Optimal estimation Let θ = (θ 1, θ 2,, θ p ) t and let θ be the gradient operator 9 We define l(θ) as the log-likelihood function l(θ) = ln P(D θ) New problem statement: determine θ that maximizes the log-likelihood 10 Example of a specific case: unknown µ, Σ known 11 Set of necessary conditions for an optimum is: P(x i µ) ~ N(µ, Σ) (Samples are drawn from a multivariate normal population) θ l = 0 θ = µ therefore: The ML estimate for µ must satisfy: 2

Multiplying by Σ and rearranging, we obtain: Just the arithmetic average of the samples of the training samples! 12 ML Estimation: Gaussian Case: unknown µ and σ θ = (θ 1, θ 2 ) = (µ, σ 2 ) 13 Conclusion: If P(x k ω j ) (j = 1, 2,, c) is supposed to be Gaussian in a d- dimensional feature space; then we can estimate the vector θ = (θ 1, θ 2,, θ c ) t and perform an optimal classification! Summation: 1 Bias 15 ML estimate for σ 2 is biased Combining (1) and (2), one obtains: An elementary unbiased estimator for Σ is: 16 17 Appendix: ML Problem Statement Bayesian Parameter Estimation (part 2) Let D = {x 1, x 2,, x n } P(x 1,, x n θ) = Π 1,n P(x k θ); D = n Our goal is to determine (value of θ that makes this sample the most representative!) Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case Bayesian Parameter Estimation: General Estimation Problems of Dimensionality Computational Complexity Component Analysis and Discriminants Hidden Markov Models 3

Bayesian Estimation 18 19 We assume that In MLE θ was supposed fix In BE θ is a random variable The computation of posterior probabilities P(ω i x) that is used for classification lies at the heart of Bayesian classification Given the sample D, Bayes formula can be written - Samples D i provide info about class i only, where D={D 1,, D c) } - P(ω i ) = P(ω i D i ) (i.e., samples D i determine the prior on ω I ) Goal: compute p(ω i x, D i ) 20 If we knew θ we would be done! But we don t know it. 21 So now what do we do??? Well, the only term we don t know on the right-side of We do know that - θ has a known prior p(θ ) - and we have observed samples D i. So we can re-write the ccd as is p(x ω i,d i ) the class conditional density, but this involves a parameter θ that is a random variable. 3 Bayesian Parameter Estimation: Gaussian Case 22 So now we must calculate 23 Step I: Estimate θ using the a-posteriori density P(θ D) The univariate case: P(µ D) µ is the only unknown parameter Reproducing density is found as where (µ 0 and σ 0 are known!)

2 Bayesian Parameter Estimation: Gaussian Case 25 Step II: p(x D) remains to be computed! So the desired ccd p(x D) can be written as Bayesian Parameter Estimation: Gaussian Case 26 Bayesian Parameter Estimation: General Theory 27 Step III: We do this for each class and combine P(x D j, ω j ) with P(ω j ) along with Bayes rule to get P(x D) computation can be applied to any situation in which the unknown density can be parameterized. The basic assumptions are: - The form of P(x θ) is assumed known, but the value of θ is not - Our knowledge about θ is contained in a known prior density P(θ) - The rest of our knowledge θ is contained in a set D of n random variables x 1, x 2,, x n that follows P(x) The basic problem is: Step I: Compute the posterior density P(θ D) Step II: Derive P(x D) Step III: Compute p(ω x, D) 28 Why Don t We Always Acquire More Features? 29 Using Bayes formula, we have: 7 And by an independence assumption: 5

Problems of Dimensionality Consider case of two classes multivariate normal with the same covariance: 30 If features are independent then: 31 Most useful features are the ones for which the difference between the means is large relative to the standard deviation It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance: we have the wrong model! 7 6