Pattern Recognition. Parameter Estimation of Probability Density Functions

Similar documents
Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Maximum Likelihood Estimation. only training data is available to design a classifier

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Lecture 6: Gaussian Mixture Models (GMM)

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

MACHINE LEARNING ADVANCED MACHINE LEARNING

CMU-Q Lecture 24:

Density Estimation: ML, MAP, Bayesian estimation

Machine Learning 2017

Pattern Classification

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

L11: Pattern recognition principles

Bayes Decision Theory

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Expect Values and Probability Density Functions

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Parametric Techniques Lecture 3

Bayesian Decision and Bayesian Learning

Algorithm-Independent Learning Issues

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Bayesian Methods: Naïve Bayes

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Parametric Techniques

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

The Bayes classifier

Naïve Bayes classification

Hidden Markov Models

Generative v. Discriminative classifiers Intuition

p(d θ ) l(θ ) 1.2 x x x

Lecture 4: Probabilistic Learning

MACHINE LEARNING ADVANCED MACHINE LEARNING

Data Preprocessing. Cluster Similarity

Mixture of Gaussians Models

Bayesian Learning (II)

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

ECE521 week 3: 23/26 January 2017

Bayesian Decision Theory

Performance Evaluation and Comparison

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Algorithmisches Lernen/Machine Learning

Bayesian Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

SGN (4 cr) Chapter 5

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

The problem is to infer on the underlying probability distribution that gives rise to the data S.

Bayesian statistics. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

COM336: Neural Computing

Mixture Models and EM

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

BAYESIAN DECISION THEORY

Introduction to Machine Learning

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Density Estimation. Seungjin Choi

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Linear discriminant functions

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

STA414/2104 Statistical Methods for Machine Learning II

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger

Linear Models for Classification

Generative v. Discriminative classifiers Intuition

Machine Learning Lecture 2

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Statistical learning. Chapter 20, Sections 1 4 1

CS534 Machine Learning - Spring Final Exam

Support Vector Machines

Introduction to Graphical Models

Probability Models for Bayesian Recognition

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Review of probabilities

EM Algorithm LECTURE OUTLINE

Statistical learning. Chapter 20, Sections 1 3 1

STA 4273H: Statistical Machine Learning

Statistics: Learning models from data

PATTERN RECOGNITION AND MACHINE LEARNING

Bayesian Decision Theory

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Bias-Variance Tradeoff

Variational Autoencoders

Probability Distributions Columns (a) through (d)

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

6.867 Machine Learning

Notes on Machine Learning for and

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

5. Discriminant analysis

Transcription:

Pattern Recognition Parameter Estimation of Probability Density Functions

Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The classifier is a function from the feature space onto the set of classes, α : F { ω 1,..., ω c }. (α(x) is the classifier) The parts of the feature space corresponding to classes ω 1,..., ω c are denoted by R 1,..., R c

Classification Problem Feature vectors x that we aim to classify belong to the feature space F. The task is to assign an arbitrary feature vector x Fto one of the cclasses We know the 1. prior probabilities P (ω 1 ),..., P (ω c ) of the classes and 2. the class conditional probability density functions p(x ω 1 ),..., p(x ω c ).

The Bayes classification rule (for two classes M=2) Given classify it according to the rule Equivalently: classify according to the rule x 2 1 2 1 2 1 ) ( ) ( ) ( ) ( ω ω ω ω ω ω > > x x P x P x x P x P If If x 4 For equiprobable classes the test becomes ) ( ) ( ) )( ( ) ( 2 2 1 1 ω ω ω ω P x p P x p >< ) ( ) )( ( 2 1 ω ω x P x p ><

Probability Density Function Aprobability density function(pdf), or densityof a continuous random variable, is a function that describes the relative likelihood for this random variable to take on a given value.

Probability Densities

Cumulative Distribution Function Cumulative distribution function(cdf), or just distribution function, describes the probability that a real-valued random variable Xwith a given probability distribution will be found at a value less than or equal to x.

The Gaussian Distribution

Gaussian Mean and Variance

Estimating Class Conditional PDFs and a-priori Probabilities In practice, PDFs and a-priori probabilities are not known. PDFs and a-priori probabilities are estimated from training samples. (Supervised learning) We assume that the training samples are occurrences of the independent random variables. (That is: they were measured from different objects.) These random variables are assumed to be distributed according to p(x ω i ). (independent and identically distributed (i.i.d.)).

Training Data Types Mixture Sampling: A set of objects are randomly selected, their feature vectors are computed and then the objects are hand- classified to the most appropriate classes. Separate Sampling: The training data for each class is collected separately.

Estimating a-priori Probabilities For the classifier training, the difference of the two sampling techniques is: based on the mixed sampling we can deduce the a-priori probabilities P (ω 1 ),..., P (ω c ) as:

Parametric Estimation of PDFs If we assume that p(x ω i ) belongs to some family of parametric distributions, the class conditional pdfsp(x ω i ) is reduced to the estimation of the parameter vector θ i For example, we can assume that p(x ω i ) is a normal density with unknown parameters θ i = (µ i, Σ i ).

Most Common PDF Normal distribution: Related to real-valued quantities that grow linearly (e.g. errors, offsets)

Normal Distribution

Log-normal distribution Related to positive real-valued quantities that grow exponentially

Log-Normal Distribution

Uniform distribution (discrete) Related to real-valued quantities that are assumed to be uniformly distributed over a (possibly unknown) region

Binomial Distribution The binomial distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p

Maximum Likelihood Estimation The aim is to estimate the value of the parameter vector θ based on the training samples D = { x 1,..., x n }. We assume that the training samples are We assume that the training samples are occurrences of i.i.d. random variables distributed according to the density p(x θ).

Maximum Likelihood Estimation The maximum likelihood estimate or the ML estimate θ maximizes the probability of the training data with respect to θ. Due to the i.i.d. assumption, the probability of D is

Finding ML-Estimate Derive the gradient of the likelihood function or the log-likelihood function Solve the zeros of the gradient and search also all the other critical points of the likelihood function. (e.g. the points where at least one of the partial derivatives of the function are not defined are critical in addition to the points where gradient is zero.)

Finding ML-Estimate Evaluate the likelihood function at the critical points and select the critical point with the highest likelihood value as the estimate. warning: the ML-estimate does not warning: the ML-estimate does not necessarily exist for all distributions, for example the likelihood function could grow without limit.

ML-estimates for the Normal Density ML-estimates for the normal density Assuming covariance is fixed and known we have:

ML-estimates for the Normal Density The gradient is: Setting gradient equal to zero we have:

ML-estimates for the Normal Density ML estimation for covariance is given by:

Numeric Example

Example Assume we have three classes of objects, Feature vector has 4 elements <f1,f2,f3,f4>, A-priori probabilities are equal = 1/3 Pdfs have Normal distribution

Example Sample data for the first class:

Example μ= < 4.8600, 3.3100, 1.4500, 0.2200> = [ ]

Example For class two : μ=<6.1000, 2.8700, 4.3700, 1.3800> =[ ]

Example For class three: μ=< 6.5700, 2.9400, 5.7700, 2.0400 > =[ ]

Classifying a Sample Input Input: x = [ 5.9 4.2 3.0 1.5 ] T Finding the value of The input is classified as a member of class 2 Repeat this example with some more sample inputs

Questions?