Probability Distributions

Size: px

Start display at page:

Download "Probability Distributions"

Justin Summers
5 years ago
Views:

Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles.

1 Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples of probability distributions and their properties. As well as being of great interest in their own right, these distributions can for building blocks for ore coplex odels and will be used extensively throughout the book. The distributions introduced in this chapter will also serve another iportant purpose, naely to provide us with the opportunity to discuss soe key statistical concepts, such as Bayesian inference, in the context of siple odels before we encounter the in ore coplex situations in later chapters. One role for the distributions discussed in this chapter is to odel the probability distribution p(x) of a rando variable x, given a finite set x,...,x N of observations. This proble is known as density estiation. For the purposes of this chapter, we shall assue that the data points are independent and identically distributed. It should be ephasized that the proble of density estiation is fun- 67

2 68. PROBABILITY DISTRIBUTIONS daentally ill-posed, because there are infinitely any probability distributions that could have given rise to the observed finite data set. Indeed, any distribution p(x) that is nonzero at each of the data points x,...,x N is a potential candidate. The issue of choosing an appropriate distribution relates to the proble of odel selection that has already been encountered in the context of polynoial curve fitting in Chapter and that is a central issue in pattern recognition. We begin by considering the binoial and ultinoial distributions for discrete rando variables and the Gaussian distribution for continuous rando variables. These are specific exaples of paraetric distributions, so-called because they are governed by a sall nuber of adaptive paraeters, such as the ean and variance in the case of a Gaussian for exaple. To apply such odels to the proble of density estiation, we need a procedure for deterining suitable values for the paraeters, given an observed data set. In a frequentist treatent, we choose specific values for the paraeters by optiizing soe criterion, such as the likelihood function. By contrast, in a Bayesian treatent we introduce prior distributions over the paraeters and then use Bayes theore to copute the corresponding posterior distribution given the observed data. We shall see that an iportant role is played by conjugate priors, that lead to posterior distributions having the sae functional for as the prior, and that therefore lead to a greatly siplified Bayesian analysis. For exaple, the conjugate prior for the paraeters of the ultinoial distribution is called the Dirichlet distribution, while the conjugate prior for the ean of a Gaussian is another Gaussian. All of these distributions are exaples of the exponential faily of distributions, which possess a nuber of iportant properties, and which will be discussed in soe detail. One liitation of the paraetric approach is that it assues a specific functional for for the distribution, which ay turn out to be inappropriate for a particular application. An alternative approach is given by nonparaetric density estiation ethods in which the for of the distribution typically depends on the size of the data set. Such odels still contain paraeters, but these control the odel coplexity rather than the for of the distribution. We end this chapter by considering three nonparaetric ethods based respectively on histogras, nearest-neighbours, and kernels... Binary Variables We begin by considering a single binary rando variable x {, }. For exaple, x ight describe the outcoe of flipping a coin, with x =representing heads, and x =representing tails. We can iagine that this is a daaged coin so that the probability of landing heads is not necessarily the sae as that of landing tails. The probability of x =will be denoted by the paraeter µ so that p(x = µ) =µ (.)

.. Binary Variables 69 where µ, fro which it follows that p(x = µ) = µ. The probability distribution over x can therefore be written in the for Bern(x µ) =µ x ( µ) x (.) Exercise.

3 .. Binary Variables 69 where µ, fro which it follows that p(x = µ) = µ. The probability distribution over x can therefore be written in the for Bern(x µ) =µ x ( µ) x (.) Exercise. which is known as the Bernoulli distribution. It is easily verified that this distribution is noralized and that it has ean and variance given by E[x] = µ (.3) var[x] = µ( µ). (.4) Now suppose we have a data set D = {x,...,x N } of observed values of x. We can construct the likelihood function, which is a function of µ, on the assuption that the observations are drawn independently fro p(x µ), so that p(d µ) = N p(x n µ) = n= N µ x n ( µ) x n. (.5) In a frequentist setting, we can estiate a value for µ by axiizing the likelihood function, or equivalently by axiizing the logarith of the likelihood. In the case of the Bernoulli distribution, the log likelihood function is given by n= Section.4 ln p(d µ) = ln p(x n µ) = n= {x n ln µ + ( x n ) ln( µ)}. (.6) n= At this point, it is worth noting that the log likelihood function depends on the N observations x n only through their su n x n. This su provides an exaple of a sufficient statistic for the data under this distribution, and we shall study the iportant role of sufficient statistics in soe detail. If we set the derivative of ln p(d µ) with respect to µ equal to zero, we obtain the axiu likelihood estiator µ ML = N x n (.7) n= Jacob Bernoulli Jacob Bernoulli, also known as Jacques or Jaes Bernoulli, was a Swiss atheatician and was the first of any in the Bernoulli faily to pursue a career in science and atheatics. Although copelled to study philosophy and theology against his will by his parents, he travelled extensively after graduating in order to eet with any of the leading scientists of his tie, including Boyle and Hooke in England. When he returned to Switzerland, he taught echanics and becae Professor of Matheatics at Basel in 687. Unfortunately, rivalry between Jacob and his younger brother Johann turned an initially productive collaboration into a bitter and public dispute. Jacob s ost significant contributions to atheatics appeared in The Art of Conjecture published in 73, eight years after his death, which deals with topics in probability theory including what has becoe known as the Bernoulli distribution.

4 7. PROBABILITY DISTRIBUTIONS Figure. Histogra plot of the binoial distribution (.9) as a function of for N = and µ = which is also known as the saple ean. If we denote the nuber of observations of x =(heads) within this data set by, then we can write (.7) in the for µ ML = N (.8) Exercise.3 so that the probability of landing heads is given, in this axiu likelihood fraework, by the fraction of observations of heads in the data set. Now suppose we flip a coin, say, 3 ties and happen to observe 3 heads. Then N = =3and µ ML =. In this case, the axiu likelihood result would predict that all future observations should give heads. Coon sense tells us that this is unreasonable, and in fact this is an extree exaple of the over-fitting associated with axiu likelihood. We shall see shortly how to arrive at ore sensible conclusions through the introduction of a prior distribution over µ. We can also work out the distribution of the nuber of observations of x =, given that the data set has size N. This is called the binoial distribution, and fro (.5) we see that it is proportional to µ ( µ) N. In order to obtain the noralization coefficient we note that out of N coin flips, we have to add up all of the possible ways of obtaining heads, so that the binoial distribution can be written ( ) N Bin( N,µ) = µ ( µ) N (.9) where ( ) N N! (N )!! (.) is the nuber of ways of choosing objects out of a total of N identical objects. Figure. shows a plot of the binoial distribution for N = and µ =.5. The ean and variance of the binoial distribution can be found by using the result of Exercise., which shows that for independent events the ean of the su is the su of the eans, and the variance of the su is the su of the variances. Because = x x N, and for each observation the ean and variance are

5 .. Binary Variables 7 Exercise.4 Exercise.5 Exercise.6 given by (.3) and (.4), respectively, we have E[] Bin( N,µ) = Nµ (.) var[] = ( E[]) Bin( N,µ) = Nµ( µ). (.) = These results can also be proved directly using calculus... The beta distribution We have seen in (.8) that the axiu likelihood setting for the paraeter µ in the Bernoulli distribution, and hence in the binoial distribution, is given by the fraction of the observations in the data set having x =. As we have already noted, this can give severely over-fitted results for sall data sets. In order to develop a Bayesian treatent for this proble, we need to introduce a prior distribution p(µ) over the paraeter µ. Here we consider a for of prior distribution that has a siple interpretation as well as soe useful analytical properties. To otivate this prior, we note that the likelihood function takes the for of the product of factors of the for µ x ( µ) x. If we choose a prior to be proportional to powers of µ and ( µ), then the posterior distribution, which is proportional to the product of the prior and the likelihood function, will have the sae functional for as the prior. This property is called conjugacy and we will see several exaples of it later in this chapter. We therefore choose a prior, called the beta distribution, given by Beta(µ a, b) = Γ(a + b) Γ(a)Γ(b) µa ( µ) b (.3) where Γ(x) is the gaa function defined by (.4), and the coefficient in (.3) ensures that the beta distribution is noralized, so that Beta(µ a, b)dµ =. (.4) The ean and variance of the beta distribution are given by E[µ] = a a + b (.5) var[µ] = ab (a + b) (a + b + ). (.6) The paraeters a and b are often called hyperparaeters because they control the distribution of the paraeter µ. Figure. shows plots of the beta distribution for various values of the hyperparaeters. The posterior distribution of µ is now obtained by ultiplying the beta prior (.3) by the binoial likelihood function (.9) and noralizing. Keeping only the factors that depend on µ, we see that this posterior distribution has the for p(µ, l, a, b) µ +a ( µ) l+b (.7)

6 7. PROBABILITY DISTRIBUTIONS 3 a =. 3 a = b =. b =.5 µ 3 a = b =3.5 µ 3 a =8 b =4.5 µ.5 µ Figure. Plots of the beta distribution Beta(µ a, b) given by (.3) as a function of µ for various values of the hyperparaeters a and b. where l = N, and therefore corresponds to the nuber of tails in the coin exaple. We see that (.7) has the sae functional dependence on µ as the prior distribution, reflecting the conjugacy properties of the prior with respect to the likelihood function. Indeed, it is siply another beta distribution, and its noralization coefficient can therefore be obtained by coparison with (.3) to give p(µ, l, a, b) = Γ( + a + l + b) Γ( + a)γ(l + b) µ+a ( µ) l+b. (.8) We see that the effect of observing a data set of observations of x =and l observations of x =has been to increase the value of a by, and the value of b by l, in going fro the prior distribution to the posterior distribution. This allows us to provide a siple interpretation of the hyperparaeters a and b in the prior as an effective nuber of observations of x =and x =, respectively. Note that a and b need not be integers. Furtherore, the posterior distribution can act as the prior if we subsequently observe additional data. To see this, we can iagine taking observations one at a tie and after each observation updating the current posterior

7 .. Binary Variables 73 prior likelihood function posterior.5 µ.5 µ.5 µ Figure.3 Illustration of one step of sequential Bayesian inference. The prior is given by a beta distribution with paraeters a =, b =, and the likelihood function, given by (.9) with N = =, corresponds to a single observation of x =, so that the posterior is given by a beta distribution with paraeters a =3, b =. Section.3.5 distribution by ultiplying by the likelihood function for the new observation and then noralizing to obtain the new, revised posterior distribution. At each stage, the posterior is a beta distribution with soe total nuber of (prior and actual) observed values for x =and x =given by the paraeters a and b. Incorporation of an additional observation of x =siply corresponds to increenting the value of a by, whereas for an observation of x =we increent b by. Figure.3 illustrates one step in this process. We see that this sequential approach to learning arises naturally when we adopt a Bayesian viewpoint. It is independent of the choice of prior and of the likelihood function and depends only on the assuption of i.i.d. data. Sequential ethods ake use of observations one at a tie, or in sall batches, and then discard the before the next observations are used. They can be used, for exaple, in real-tie learning scenarios where a steady strea of data is arriving, and predictions ust be ade before all of the data is seen. Because they do not require the whole data set to be stored or loaded into eory, sequential ethods are also useful for large data sets. Maxiu likelihood ethods can also be cast into a sequential fraework. If our goal is to predict, as best we can, the outcoe of the next trial, then we ust evaluate the predictive distribution of x, given the observed data set D. Fro the su and product rules of probability, this takes the for p(x = D) = p(x = µ)p(µ D)dµ = µp(µ D)dµ = E[µ D]. (.9) Using the result (.8) for the posterior distribution p(µ D), together with the result (.5) for the ean of the beta distribution, we obtain p(x = D) = + a + a + l + b (.) which has a siple interpretation as the total fraction of observations (both real observations and fictitious prior observations) that correspond to x =. Note that in the liit of an infinitely large data set, l the result (.) reduces to the axiu likelihood result (.8). As we shall see, it is a very general property that the Bayesian and axiu likelihood results will agree in the liit of an infinitely

8 74. PROBABILITY DISTRIBUTIONS Exercise.7 Exercise.8 large data set. For a finite data set, the posterior ean for µ always lies between the prior ean and the axiu likelihood estiate for µ corresponding to the relative frequencies of events given by (.7). Fro Figure., we see that as the nuber of observations increases, so the posterior distribution becoes ore sharply peaked. This can also be seen fro the result (.6) for the variance of the beta distribution, in which we see that the variance goes to zero for a or b. In fact, we ight wonder whether it is a general property of Bayesian learning that, as we observe ore and ore data, the uncertainty represented by the posterior distribution will steadily decrease. To address this, we can take a frequentist view of Bayesian learning and show that, on average, such a property does indeed hold. Consider a general Bayesian inference proble for a paraeter θ for which we have observed a data set D, described by the joint distribution p(θ, D). The following result E θ [θ] =E D [E θ [θ D]] (.) where E θ [θ] E D [E θ [θ D]] p(θ)θ dθ (.) { } θp(θ D)dθ p(d)dd (.3) says that the posterior ean of θ, averaged over the distribution generating the data, is equal to the prior ean of θ. Siilarly, we can show that var θ [θ] =E D [var θ [θ D]] + var D [E θ [θ D]]. (.4) The ter on the left-hand side of (.4) is the prior variance of θ. On the righthand side, the first ter is the average posterior variance of θ, and the second ter easures the variance in the posterior ean of θ. Because this variance is a positive quantity, this result shows that, on average, the posterior variance of θ is saller than the prior variance. The reduction in variance is greater if the variance in the posterior ean is greater. Note, however, that this result only holds on average, and that for a particular observed data set it is possible for the posterior variance to be larger than the prior variance... Multinoial Variables Binary variables can be used to describe quantities that can take one of two possible values. Often, however, we encounter discrete variables that can take on one of K possible utually exclusive states. Although there are various alternative ways to express such variables, we shall see shortly that a particularly convenient representation is the -of-k schee in which the variable is represented by a K-diensional vector x in which one of the eleents x k equals, and all reaining eleents equal

9 Exercises 7 An interesting property of the nearest-neighbour (K =) classifier is that, in the liit N, the error rate is never ore than twice the iniu achievable error rate of an optial classifier, i.e., one that uses the true class distributions (Cover and Hart, 967). As discussed so far, both the K-nearest-neighbour ethod, and the kernel density estiator, require the entire training data set to be stored, leading to expensive coputation if the data set is large. This effect can be offset, at the expense of soe additional one-off coputation, by constructing tree-based search structures to allow (approxiate) near neighbours to be found efficiently without doing an exhaustive search of the data set. Nevertheless, these nonparaetric ethods are still severely liited. On the other hand, we have seen that siple paraetric odels are very restricted in ters of the fors of distribution that they can represent. We therefore need to find density odels that are very flexible and yet for which the coplexity of the odels can be controlled independently of the size of the training set, and we shall see in subsequent chapters how to achieve this. Exercises. ( ) www Verify that the Bernoulli distribution (.) satisfies the following properties p(x µ) = (.57) x= E[x] = µ (.58) var[x] = µ( µ). (.59) Show that the entropy H[x] of a Bernoulli distributed rando binary variable x is given by H[x] = µ ln µ ( µ) ln( µ). (.6). ( ) The for of the Bernoulli distribution given by (.) is not syetric between the two values of x. In soe situations, it will be ore convenient to use an equivalent forulation for which x {, }, in which case the distribution can be written ( ) ( x)/ ( ) (+x)/ µ +µ p(x µ) = (.6) where µ [, ]. Show that the distribution (.6) is noralized, and evaluate its ean, variance, and entropy..3 ( ) www In this exercise, we prove that the binoial distribution (.9) is noralized. First use the definition (.) of the nuber of cobinations of identical objects chosen fro a total of N to show that ( N ) + ( N ) ( ) N + =. (.6)

10 8. PROBABILITY DISTRIBUTIONS Use this result to prove by induction the following result ( + x) N = = ( ) N x (.63) which is known as the binoial theore, and which is valid for all real values of x. Finally, show that the binoial distribution is noralized, so that = ( ) N µ ( µ) N = (.64) which can be done by first pulling out a factor ( µ) N out of the suation and then aking use of the binoial theore..4 ( ) Show that the ean of the binoial distribution is given by (.). To do this, differentiate both sides of the noralization condition (.64) with respect to µ and then rearrange to obtain an expression for the ean of n. Siilarly, by differentiating (.64) twice with respect to µ and aking use of the result (.) for the ean of the binoial distribution prove the result (.) for the variance of the binoial..5 ( ) www In this exercise, we prove that the beta distribution, given by (.3), is correctly noralized, so that (.4) holds. This is equivalent to showing that µ a ( µ) b dµ = Γ(a)Γ(b) Γ(a + b). (.65) Fro the definition (.4) of the gaa function, we have Γ(a)Γ(b) = exp( x)x a dx exp( y)y b dy. (.66) Use this expression to prove (.65) as follows. First bring the integral over y inside the integrand of the integral over x, next ake the change of variable t = y + x where x is fixed, then interchange the order of the x and t integrations, and finally ake the change of variable x = tµ where t is fixed..6 ( ) Make use of the result (.65) to show that the ean, variance, and ode of the beta distribution (.3) are given respectively by E[µ] = var[µ] = ode[µ] = a a + b (.67) ab (a + b) (a + b + ) (.68) a a + b. (.69)

11 Exercises 9.7 ( ) Consider a binoial rando variable x given by (.9), with prior distribution for µ given by the beta distribution (.3), and suppose we have observed occurrences of x =and l occurrences of x =. Show that the posterior ean value of x lies between the prior ean and the axiu likelihood estiate for µ. To do this, show that the posterior ean can be written as λ ties the prior ean plus ( λ) ties the axiu likelihood estiate, where λ. This illustrates the concept of the posterior distribution being a coproise between the prior distribution and the axiu likelihood solution..8 ( ) Consider two variables x and y with joint distribution p(x, y). Prove the following two results E[x] = E y [E x [x y]] (.7) var[x] = E y [var x [x y]] + var y [E x [x y]]. (.7) Here E x [x y] denotes the expectation of x under the conditional distribution p(x y), with a siilar notation for the conditional variance..9 ( ) www. In this exercise, we prove the noralization of the Dirichlet distribution (.38) using induction. We have already shown in Exercise.5 that the beta distribution, which is a special case of the Dirichlet for M =, is noralized. We now assue that the Dirichlet distribution is noralized for M variables and prove that it is noralized for M variables. To do this, consider the Dirichlet distribution over M variables, and take account of the constraint M k= µ k =by eliinating µ M, so that the Dirichlet is written M p M (µ,...,µ M )=C M k= µ α k k ( M j= µ j ) αm (.7) and our goal is to find an expression for C M. To do this, integrate over µ M, taking care over the liits of integration, and then ake a change of variable so that this integral has liits and. By assuing the correct result for C M and aking use of (.65), derive the expression for C M.. ( ) Using the property Γ(x + ) = xγ(x) of the gaa function, derive the following results for the ean, variance, and covariance of the Dirichlet distribution given by (.38) where α is defined by (.39). E[µ j ] = α j α (.73) var[µ j ] = α j(α α j ) α (α + ) (.74) cov[µ j µ l ] = α jα l α (α + ), j l (.75)

Estimating Parameters for a Gaussian pdf

Estimating Parameters for a Gaussian pdf Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3