Probability Models for Bayesian Recognition

Similar documents
Expect Values and Probability Density Functions

Bayesian Reasoning and Recognition

Linear Classifiers as Pattern Detectors

Generative Techniques: Bayes Rule and the Axioms of Probability

Estimating Parameters for a Gaussian pdf

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Learning Methods for Linear Detectors

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Supervised Learning: Non-parametric Estimation

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Curve Fitting Re-visited, Bishop1.2.5

Non-parametric Methods

Maximum Likelihood Estimation. only training data is available to design a classifier

p(d θ ) l(θ ) 1.2 x x x

The Gaussian distribution

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Parametric Techniques Lecture 3

Linear Classifiers as Pattern Detectors

Parametric Techniques

Nonparametric probability density estimation

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

The Bayes classifier

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Announcements. Proposals graded

Pattern Recognition. Parameter Estimation of Probability Density Functions

L11: Pattern recognition principles

Machine Learning Lecture 2

BAYESIAN DECISION THEORY

Statistical methods in recognition. Why is classification a problem?

Lecture 4: Probabilistic Learning

Density Estimation. Seungjin Choi

Kernel Methods and Support Vector Machines

Pattern Recognition 2

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Multivariate statistical methods and data mining in particle physics

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

12 - Nonparametric Density Estimation

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Machine Learning Lecture 3

Hidden Markov Models and Gaussian Mixture Models

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Clustering VS Classification

PATTERN CLASSIFICATION

Nonparametric Methods Lecture 5

Nearest Neighbor Pattern Classification

Pattern Recognition Problem. Pattern Recognition Problems. Pattern Recognition Problems. Pattern Recognition: OCR. Pattern Recognition Books

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

B4 Estimation and Inference

CS 195-5: Machine Learning Problem Set 1

STA 4273H: Statistical Machine Learning

Algorithm-Independent Learning Issues

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Pattern Classification

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Introduction to Machine Learning

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Kernel-based density. Nuno Vasconcelos ECE Department, UCSD

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Probability Theory for Machine Learning. Chris Cremer September 2015

Lecture Notes on the Gaussian Distribution

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Feature selection and extraction Spectral domain quality estimation Alternatives

Minimum Error-Rate Discriminant

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Bayesian Decision and Bayesian Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Spring Lake Middle School- Accelerated Math 7 Curriculum Map Updated: January 2018

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Bayesian Decision Theory

The Expectation-Maximization Algorithm

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Bayesian Decision Theory

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Review (Probability & Linear Algebra)

Clustering with k-means and Gaussian mixture distributions

Machine Learning Lecture 5

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Curve learning. p.1/35

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

Gaussian with mean ( µ ) and standard deviation ( σ)

01 Probability Theory and Statistics Review

probability of k samples out of J fall in R.

Hidden Markov Models

Linear Regression and Discrimination

Motivating the Covariance Matrix

Statistics: Learning models from data

Transcription:

Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIAG / osig Second Semester 06/07 Lesson 9 0 arch 07 Probability odels for Bayesian Recognition Notation... Supervised Learning for Bayesian Classification...3 Ratio of Histograms (Recall... 4 Variable Sized Histogram Cells...5 Kernel Density Estimators...6 K Nearest Neighbors...8 Probability Density Functions...9 The Normal Density Function...0 Univariate Normal... 0 ultivariate Normal... 0 The ahalanobis Distance... Expected Values and oments... Biased and Unbiased Variance... Error Bookmark not defined. Bibliographical sources: "Pattern Recognition and achine Learning", C.. Bishop, Springer Verlag, 006. "Pattern Recognition and Scene Analysis", R. E. Duda and P. E. Hart, Wiley, 973.

Probability odels for Bayesian Recognition Lesson 9 Notation x A variable X A random variable (unpredictable value. an observation. N The number of possible values for X x A vector of D variables. X A vector of D random variables. D The number of dimensions for the vector x or X C k The class k k Class index K Total number of classes ω k The statement (assertion that E C k P(ω k =P(X C k Probability that the observation E is a member of the class k. k Number of examples for the class k. Total number of examples. = K " k k= { X m } A set of training samples {y m } A set of indicator vectors for the training samples in { X m } p(x Probability density function for a continuous value X p( X Probability density function for continuous X p( X " k Probability density for X the class k. ω k = X C k. Q Number of cells in h(x. Q = N D S A sum of V adjacent histogram cells: S = h( X V X "V The "Volume" of the region of the histogram 9-

Probability odels for Bayesian Recognition Lesson 9 Supervised Learning for Bayesian Classification Our problem is to build a box that maps a set of features X from an observation, X to a class C k from a set of K possible classes. Let ω k be the proposition that the event belongs to class k: ω k = X C k In order to minimize the number of mistakes, we will maximize the probability that " k X $ C k " ˆ k = arg max P(" k X " k { } Our primary tool for this is Baye's Rule : P(" k We have already seen how to use histograms to represent P( use P(" k = k X = P( X " k P(" k P( X = P( X " k K P( X P(" k " k k= X " k, and P( X and to To apply Baye s rule, we require a representation for the probabilities P( X " k, P( X, and p(" k. In the next few lectures we will look at techniques that learn to recognize classes using Bayes rule. These methods are based on an explicit probabilistic model of the training data. We will start with some simple, non-parametric models. We will then look at models using the Gaussian or Normal density functions. This will lead to the use of weighted sums of Gaussians (Gaussian ixture odels. Today will look at three non-parametric representations for P( X " k and P( X Kernel Density Estimators K-Nearest Neighbors 3 Probability density functions (PDF 9-3

Probability odels for Bayesian Recognition Lesson 9 Ratio of Histograms (Recall Consider an example of K classes of objects where objects are described by a feature, X, with N possible values from [,N]. Assume that we have a "training set" of samples { X m } along with indicator variables { y m } where the indicator variable is the class, k, for each training sample. For each class k, we allocate a histogram, h k (, with N cells and count the values in the training set. " m= : h(x h(x m + if y m = k THEN h k (X m h k (X m +; k k + Then p(x = x = h(x p(x = x X " C k = p(x k = k h k (x and P(" k can be estimated from the relative size of the training set. p(x " C k = p( k = k giving: p(" k X = p(x " k p(" k p(x = h k (X k k h = k (X h(x h(x = T (X This can be represented by a lookup table. To illustrate, consider an example with classes (K= and where X can take on 8 values (N=8, D=. Recall that the number of cells in the histogram is Q=N D. Having >> Q is NECESSARY but NOT Sufficient. NOT having >> Q is a guarantee of INSUFFICIENT TRAINING DATA. So what can you do if is not >> Q? Adapt the size of the cell to the data 9-4

Probability odels for Bayesian Recognition Lesson 9 Variable Sized Histogram Cells Suppose that we have a D-dimensional feature vector X with each feature quantized to N possible values, and suppose that we represent p( X as a D-dimensional histogram h( X. Let us fill the histogram with training samples { X m }. Let us define the volume of each cell as. The volume for any block of V cells is V. Then the volume of the entire space is Q=N D. If the quantity of training data is too small, ie if < 8Q, then we can combine adjacent cells so as to amass enough data for a reasonable estimate. Suppose we merge V adjacent cells such that we obtain a combined sum of S. S = h( X X "V The volume of the combined cells would be V. To compute the probability we replace h( X with S V. The probability p( X for X " V is: p( X " V = S V This is typically written as: p( X = S V We can use this equation to develop two alternative non-parametric methods. Fix V and determine S => Kernel density estimator. Fix S and determine V => K nearest neighbors. (note that the symbol K is often used for the sum the cells. This conflicts with the use of K for the number of classes. Thus we will use the symbol S for the sum of adjacent cells. 9-5

Probability odels for Bayesian Recognition Lesson 9 Kernel Density Estimators For a Kernel density estimator, we represent each training sample with a kernel function k( X. Popular Kernel functions include a hypercube centered of side w a triangular function with base of w a sphere of radius w a Gaussian of standard deviation σ. We can define the function for the hypercube as k( u = if u d " for all d =,...,D $ % 0 otherwise This is called a Parzen window. Subtracting a point, z, centers the Parzen window at that point. Dividing by w scales the Parzen window to a hyper-cube of side w. X " z & k% ( is a cube of size w D centered at z. $ w ' The training samples define overlapping Parzen windows. For an feature value, X, the probability p( X is the sum of Parzen windows at S = m= X " k% $ w X m & ( ' The volume of the Parzen window is V = w D. Thus the probability p( X = S V = w D X " X k m & % ( $ w ' A Parzen window is discontinuous at the boundaries, creating boundary effects. We can soften this using a triangular function evaluated within the window. m= X 9-6

Probability odels for Bayesian Recognition Lesson 9 k( u $ = " u if u % & 0 otherwise Even better is to use a Gaussian kernel with standard deviation σ. k( u = (" D/ e$ u We can note that the volume (or integral of e " is V = (" D/ In this case p( X = S V = k X " m= ( X m This corresponds to placing a Gaussian at each training sample and summing the Tails at X. The probability for a value u X is the sum of the Gaussians. In fact, we can choose any function k( u as kernel, provided that k( u " 0 and k( u d u " = 9-7

Probability odels for Bayesian Recognition Lesson 9 K Nearest Neighbors For K nearest neighbors, we hold S constant and vary V. (We have used the symbol S for the number of neighbors, rather than K to avoid confusion with the number of classes. For each training sample, X m, we construct a tree structure (such as a KD Tree that allows us to easily find the S nearest neighbors for any point. To compute p( X we need the volume of a sphere in D dimensions that encloses the nearest S neighbors. Suppose the set of S nearest neighbors is the set {X s }. This is D dimensional sphere of radius R = arg" max D " V = $ & D % + ' ( R D { } X s X " { X s } Where Γ(D = (D- For even D this is easy to evaluate For odd D, use a table to determine "% D $ + & ( ' Then as before: p( X = S V 9-8

Probability odels for Bayesian Recognition Lesson 9 Probability Density Functions A probability density function p(x, is a function of a continuous variable X such that X is a continuous real valued random variable with values between [, ] $ p(x = " Note that p(x is NOT a number but a continuous function. A probability density function defines the relatively likelihood for a specific value of X. Because X is continuous, the value of p(x for a specific X is infinitely small. To obtain a probability we must integrate over some range of X. To obtain a probability we must integrate over some range V of X. In the case of D=, the probability that X is within the interval [A, B] is P(X " [ A, B] = p(xdx B A This integral gives a number that can be used as a probability. Note that we use upper case P(X " [ A, B] to represent a probability value, and lower case p(x to represent a probability density function. Classification using Bayes Rule can use probability density functions P(" k X = p(x " k p(x P(" k = p(x " k K P(" k p(x " k k= Note that the ratio p(x " k p(x IS a number, provided that p(x = p(x " k K k= Probability density functions are easily generalized to vectors of random variables. Let X " R D, be a vector random variables. A probability density function, p( X, is a function of a vector of continuous variables X is a vector of D real valued random variables with values between [, ] $ p( x d x = " 9-9

Probability odels for Bayesian Recognition Lesson 9 The Normal Density Function Univariate Normal Whenever a random variable is determined by a sequence of independent random events, the outcome will be a Normal or Gaussian density function. This is known as the Central Limit Theorem. The essence of the derivation is that repeated random events are modeled as repeated convolutions of density functions, and for any finite density function will tend asymptotically to a Gaussian (or normal function. p(x = N (X;µ," = ( X$µ " e " N(x; µ, µ" µ µ+ x The parameters of N(x; µ, σ are the first and second moments, µ and σ of the function. According to the Central Limit theorem, for any real density p(x : as p(x * N(x; µ, σ ultivariate Normal For a vector of D features,, the Normal density has the form: p( X = N ( X ; µ, " = D ( There are 3 parts to N ( X ; µ, ": det(" e ( X µ T " ( X µ D (" det(, e, and d = ( X µ T " ( X µ 9-0

Probability odels for Bayesian Recognition Lesson 9 e =.788888 Euler s Constant : " e x dx = e x. Used to simplify the algebra. D The term (π det(σ is a normalization factor. D (" det( =... e ( $ $ $ X µ T ( X µ dx dx...dx D The determinant, det(σ is an operation that gives the volume of Σ. " for D= det$ a c b % d& ' = a c b d = a ( d b ( c for D > this continues recursively. " a b c% $ ' ex: D=3 det d e f $ ' = a e f h i + b f d i g + c d e g h g h i & a( e"i f " h + b( f " g d "i + c( d " h e" g 3 The ahalanobis distance. This is where the action is. The ahalanobis Distance The exponent of N ( X ; µ, " is known as the "ahalanobis Distance" named for a famous Indian statistician. d = ( X µ T " ( X µ This is a distance normalized by the covariance. It is positive and nd order (quadratic. The covariance is provides a distance metric that can be used when the features of have different units (for example with height (m and weight (kg. In this case, the features are compared to the standard deviation of a population. 9-

Probability odels for Bayesian Recognition Lesson 9 Expected Values and oments The average value is the first moment of the samples A training set of samples { X m } an be used to calculate moments For samples of a numerical feature value {X m }, the "expected value" E{X} is defined as the average or the mean: E{X} = " m= X m µ x = E{X} is the first moment (or center of gravity of the values of {X m }. µ x = E{X} is also the first moment (or center of gravity of the resulting pdf. E{X} = " X m = & p(x x dx m = This is also true for histograms. % $% The mass of the histogram is = h(x Which is also the number of samples used to compute h(x. The expected value of {X m } is the st moment of h(x. N " x= µ = E{X m } = " X m = " h(x x m= For D dimensions, the center of gravity is a vector µ = E{ X } = N x= µ & E{X } & % ( % ( µ " X = % ( = % E{X } ( %... ( %... ( m= % ( % ( $ ' $ E{X D }' µ D The vector µ the vector of averages for the components, X d of. It is also center of gravity (first moment of the normal density function. µ d = E{X d,m } = " X d,m =... p(x, x,...x D x d dx,dx,...,dx D m= % % & & $% $% % & $% 9-

Probability odels for Bayesian Recognition Lesson 9 " E{X } % " µ % µ = E{ X $ ' $ ' E{X } } = $ ' = $ µ ' $... ' $... ' $ ' $ ' E{X D }& & µ D The variance " is the square of the expected deviation from the average " = E{(X µ } = E{X } µ = E{X } E{X} This is also the second moment of the pdf " = E{(X µ } = (X n µ = p(x$(x µ dx m= and the second moment of a histogram for discrete features. & ' %& " = E{(X E{X} } = (X m µ = h(x $ (x µ m= x= N For D dimensions, the second moment is a co-variance matrix composed of D terms: " ij = E{(X i µ i (X j µ j } = (X i,m µ i (X j,m µ j This is often written " = E{( X E{ X }( X E{ X } T } m= $... D ' & and gives " = &... D &............ & % D D... DD ( This provides the parameters for p( X = N ( X ; µ, " = D ( det(" e ( X µ T " ( X µ 9-3

Probability odels for Bayesian Recognition Lesson 9 The result can be visualized by looking at the equi-probably contours. Ellipses for 99%, 95%, 90%, 75%, 50%, and 0% of the mass If x i and x j are statistically independent, then σ ij =0 For positive values of σ ij, x i and x j vary together. For negative values of σ ij, x i and x j vary in opposite directions. For example, consider features x = height (meters and x = weight (kg In most people height and weight vary together and so σ would be positive 9-4