Expect Values and Probability Density Functions

Intelligent Systems: Reasoning and Recognition James L. Crowley ESIAG / osig Second Semester 00/0 Lesson 5 8 april 0 Expect Values and Probability Density Functions otation... Bayesian Classification (Reminder...3 Expected Values and oments:...4 The average value is the first moment of the samples... 4 Probability Density Functions...5 Expected Values for PDFs... 7 The ormal (Gaussian Density Function...9 ultivariate ormal Density Functions... 0 Sources Bibliographiques : Pattern Recognition and achine Learning, C.. Bishop, Springer Verlag, 006. Pattern Recognition and Scene Analysis, R. E. Duda and P. E. Hart, Wiley, 973.

Classical Baysien Classifiers Séance 5 otation x a variable X a random variable (unpredictable value The number of possible values for X (Can be infinite. x A vector of D variables. X A vector of D random variables. D The number of dimensions for the vector x or X E An observation. An event. C k The class k k Class index K Total number of classes ω k The statement (assertion that E T k p(ω k =p(e T k Probability that the observation E is a member of the class k. ote that p(ω k is lower case. k umber of examples for the class k. (think = ass Total number of examples. = K k = k { X mk } A set of k examples for the class k. {X m } = {X k m } k=,k P(X Probability density function for X P( X Probability density function for X P( X ω k Probability density for X the class k. ω k = E C k. h(n A histogram of random values for the feature n. h k (n A histogram of random values for the feature n for the class k. Q K h(x = h k (x k = umber of cells in h(n. Q = D E{X} = X m m= P( X k = D (# det(c k e ( X µ k T C k ( X µ k 5-

Classical Baysien Classifiers Séance 5 Bayesian Classification (Reminder Our problem is to build a box that maps a set of features X from an Observation, E into a class C k from a set of K possible Classes. x x... x d Class(x,x,..., x d } ^ Let ω k be the proposition that the event belongs to class k: ω k = E T k ω k Proposition that the event E the class k In order to minimize the number of mistakes, we will maximize the probability that that the event E the class k ˆ k = arg# max Pr( k X k { } We will rely on two tools for this: Baye's Rule : p( k X = P( X k p( k P( X ormal Density Functions P( X k = D (# det(c k e ( X µ k T C k ( X µ k Today we concentrate on ormal Density Functions. 5-3

Classical Baysien Classifiers Séance 5 Expected Values and oments: The average value is the first moment of the samples For a numerical feature value for {X m }, the expected value E{X} is defined as the average or the mean: E{X} = m= X m µ x = E{X} is the first moment (or center of gravity of the values of {X m }. This can be seen from the histogram h(x. The mass of the histogram is the zeroth moment, = n= h(n is also the number of samples used to compute h(n. The expected value of is the average µ µ = E{X m } = X m m= This is also the expected value of n. µ = h(n # n n= Thus the center of gravity of the histogram is the expected value of the random variable: µ = E{X m } = X m = h(n # n m= n= The second moment is the expected deviation from the first moment: = E{(X E{X} } = # (X µ m = # h(n $ (n µ m= n= 5-4

Classical Baysien Classifiers Séance 5 Probability Density Functions In many cases, the number of possible feature values,, or the number of features, D, make a histogram based approach infeasible. In such cases we can replace h(x with a probability density function (pdf. A probability density function of a continuous random variable is a function that describes the relative likelihood for this random variable to occur at a given point in the observation space. ote: Likelihood is not probability. Definition: Likelihood is a relative measure of belief or certainty. We will use the likelihood to determine the parameters for parametric models of probability density functions. To do this, we first need to define probability density functions. A probability density function, p( X, is a function of a continuous variable or vector, X R D, of random variables such that : X is a vector of D real valued random variables with values between [, ] # $ p( X = # In this case we replace h(n p( X For Bayesian conditional density (where ω k = E T k h(n k # p( X k k Thus: p( k X = p( X k p( X p( k ote that the ratio of two pdfs gives a probability value This equation can be interpreted as Posterior Probability = Likelihood x prior probability 5-5

Classical Baysien Classifiers Séance 5 The probability for a random variable to fall within a given set is given by the integral of its density. p(x in [A, B] = p(x [ A,B] = p(xdx Thus some authors work with Cumulative Distribution functions: P(x = x $ x =# p(xdx B # A Probability Density functions are a primary tool for designing recognition machines. There is one more tool we need : Gaussian (ormal density functions: 5-6

Classical Baysien Classifiers Séance 5 Expected Values for PDFs Just as with histograms, the expected value is the first moment of a pdf. Remember that for a pdf the mass is by definition: # $ p(x dx = # For the continuous random variable {X m } and the pdf P(X. E{X} = X m = & p(x# x dx m = The second moment is $ ote that = E{(X µ } = # (X m µ = p(x$(x µ dx m= & ' & = E{(X µ } = E{X } µ = E{X } E{X} ote that this is a Biased variance. The unbiased variance would be = # $ m= (X m µ If we draw a random sample {X m } of random variables from a ormal density with parameters (µ, σ {X m } (x;µ, # Then we compute the moments, we obtain. and µ = E{X m } = X m m= ˆ = # (X µ m Where = # ˆ m= 5-7

Classical Baysien Classifiers Séance 5 ote the notation: ~ means true, ^ means estimated. The expectation underestimates the variance by /. Later we will see that the expected RS error for estimating p(x from samples is related to the bias. But first, we need to examine the Gaussian (or normal density function. 5-8

Classical Baysien Classifiers Séance 5 The ormal (Gaussian Density Function Whenever a random variable is determined by a sequence of independent random events, the outcome will be a ormal or Gaussian density function. This is demonstrated by the Central Limit Theorem. The essence of the derivation is that repeated convolution of any finite density function will tend asymptotically to a Gaussian (or normal function. p(x = (x; µ, σ = πσ e (x µ σ (x; µ, µ µ µ+ x The parameters of (x; µ, σ are the first and second moments. This is often written as a conditional: This is sometimes expressed as a conditional (X µ, σ In most cases, for any density p(x : as p(x * (x; µ, σ This is the Central Limit theorem. An exception is the dirac delta p(x = δ(x. 5-9

Classical Baysien Classifiers Séance 5 ultivariate ormal Density Functions In most practical cases, an observation is described by D features. In this case a training set { X m } an be used to calculate an average feature µ µ = E{ X } = # µ & # E{X } & ( ( µ X = ( = E{X } (... (... ( m= ( ( $ ' $ E{X D }' µ D If the features are mapped onto integers from [, ]: { X m } { n m } we can build a multi-dimensional histogram using a D dimensional table: m =, :h( n m # h( n m + As before the average feature vector, µ, is the center of gravity (first moment of the histogram. µ d = E{n d } = n dm =... h(n,n,...,n D # n d = h( n # n d = µ d µ = E{ n } = For Real valued X: m= n =n = n D = $ h( ' & n # n & n = $ µ ' n m = & m= h( n # n = h( & & n # n = & µ n = n = & &... &... & µ & h( n # n D ( D ( µ d = E{X d } = X dm =... p(x, x,...x D # x d dx,dx,...,dx D m = In any case: E{X } µ µ = E{ X $ ' $ ' E{X } } = $ ' = $ µ ' $... ' $... ' $ ' $ ' # E{X D }& # & µ D & & $ $ & $ n = For D dimensions, the second moment is a co-variance matrix composed of D terms: n = 5-0

Classical Baysien Classifiers Séance 5 ij = E{(X i µ i (X j µ j } = # (X im µ i (X jm µ j This is often written = E{( X E{ X }( X E{ X } T } m= and gives $ # #... # D ' & # = & #... # D &............ & # D # D... # DD ( This provides the parameters for p( X = ( X µ, = D (# det( e ( X µ T ( X µ The exponent is positive and quadratic (nd order. This value is known as the Distance of ahalanobis. d( X ; µ, C = ( X µ T ( X µ This is a distance normalized by the covariance. In this case, the covariance is said to provide the distance metric. This is very useful when the components of X have different units. The result can be visualized by looking at the equi-probably contours. x p(x µ, C Contours d'équi-probabilité x 5-

Classical Baysien Classifiers Séance 5 If x i and x j are statistically independent, then σ ij =0 For positive values of σ ij, x i and x j vary together. For negative values of σ ij, x i and x j vary in opposite directions. For example, consider features x = height (m and x = weight (kg In most people height and weight vary together and so σ would be positive 5-