FW 544: Computer Lab Probability basics in R

Size: px

Start display at page:

Download "FW 544: Computer Lab Probability basics in R"

Gabriel Harmon
5 years ago
Views:

1 FW 544: Computer Lab Probability basics in R During this laboratory, students will be taught the properties and uses of several continuous and discrete statistical distributions that are commonly used in ecological models. The students will learn how to generate random data from each distribution using R software and how to develop simple simulation models by passing randomly generated values from one distribution to another. This will provide students with the understanding of probability distributions that will be needed to quantify uncertainty and comprehend the basics of Bayesian probability and Monte Carlo simulation. Laboratory exercises will evaluate the ability of students to build simple simulation models that use randomly generated data to approximate ecological processes, such as survival or the occurrence of a disturbance. Overview of probability and random variables We introduced random (stochastic) variables and statistical distributions last week. Here we will somewhat more formally define these ideas, and get into the details of some important statistical distributions. Probability (P) can be thought of as a measure of the uncertainty in a random outcome. If we say that event X occurs with P=1 then we are certain about X; if we say P=0 then we are certain that X does not occur; and if we say P=0.5 we are equally uncertain about whether X occurs or does not. The value or outcome X referred to is a random variable, as distinguished from a deterministic variable whose values may vary, but do so in a predictable or deterministic manner. A probability distribution or statistical distribution (or distribution for short) is a model that describes the relationship between values of a random variable and the probabilities of assuming these values. The basic types of distributions are discrete and continuous. Discrete distributions model outcomes that occur in discrete classes or integer values; examples include the Bernoulli, Binomial, and Poisson, all discussed in more detail below. Continuous distributions model outcomes that take on continuous (generally,

2 real) values and include the Uniform, Normal, Beta, and Gamma distributions, also discussed in more detail below. The probability density function for a distribution describes the probability that the random variables take on particular values (for a discrete distribution) or are in the neighborhood of a value (for a continuous distribution). The density is often written as f(x). For example, for a discrete random variable (e.g., from a Poisson) distribution we may write f(4)=0.45 indicating that the value of 4 is taken with probability For a discrete distribution f(x) will sum to 1 over the support (region of x where f(x)>0) of the distribution. For example, if we have a binomial distribution with parameters n=5 and p=0., the distribution has support for x =0,1,,3,4,5 with f(0)+f(1)+f()+f(3)+f(4)+f(5) = =1. The density for continuous distributions follows a similar idea but because the support is continuous (and this uncountable) f(x) is not directly interpretable as a point probability. However, by analogy to the discrete distribution the f(x) integrates to one over the support of f(x). The probability distribution function (or cumulative distribution function) represents the probability that a the random variable x is less than or equal to a particular value F(x) = Prob(X x). For discrete distributions, F(x) is readily obtained by summation, e.g., for the binomial example: F(3) = f(0)+f(1)+f()+f(3) = =

3 Calculation of F(x) for continuous distributions is trickier and requires integration. By definition x F ( x) f ( v) dv where the lower limit may sometimes be higher (e.g., if the support starts at zero). Usually these computations are done by computer functions (or looked up in standard tables). Once F(x) is available (for either discrete or continuous distributions) we can easily ask questions like what is the probability that x is between a and b? since Prob( a X b) b a f ( v) dv b f ( v) dv a f ( v) dv F( b) F( a). So for example if we have a Normal with mean 0 and standard deviation 1 F()=0.8=9775, F(1)= and Prob(1 X 1) F() F(1) We can reverse the idea of distributions and, for a given probability level of a distribution function obtain the value of x or quantile associated with that value. The quantiles are essentially found by inverting the distribution function and solving for x, though for discrete distribution they can easily be gotten by examination and interpolating between values. In practice we will get the quantiles of standard distributions using built-in functions in R. To take the example of the normal distribution (mean=0, sd=1), the quantiles associated with F(x) =0.01, 0.5, and 0.99 are -.33,0.00, and.33, respectively. You will often encounter the term moments, which refer to a number of important functions of distributions. The most important moments are the mean and the variance. The mean and variance are formal defined in terms of the density functions as for discrete distributions and E ( x) xf ( x) x

4 E ( x) vf( v) dv for continuous distributions, where in both cases summation or integration is over the support of x. The variance of x follows from the definition of expectation and the relationship x V ( x) E( x ). We generally estimate these population moments by their sample equivalents, the sample mean and variance. and n is the size of a random sample. ˆ x n i1 x i / n ( xi x) i1 ˆ s. n 1 n The Normal Distribution is an example where the distributions parameters-the constants that determine the behavior of the distribution and what it will predict about the data are familiar: the parameters of the Normal are just the mean ( ) and the variance ( ). We will introduce parameters for other distributions when we consider the distributions in detail, below. Random number generation The idea of random number generation is to produce a value (or a list/sample of values) of a random variable x, given some assumptions about the distribution of x and its parameters. For example, we which to obtain a simulated sample of 100 values that come from a normal distribution with mean 5 and standard deviation 10. Depending on whether the random variable is discrete or continuous and the complexity of the distribution function, there are a variety of procedures to generate random variables. Most of the common ones rely on being able to find the inverse distribution function, that is the function F 1 ( U ) that, given a value for U, the cumulating probability of x, returns the value for x. The idea is then to generate a uniform random variable between 0 and

5 1 (the range for a probability) and then solve F 1 ( U ) to get x. Thus, many random number generators start from the capacity to generate a uniform random number, which can then be used to create random variables from other distributions. In practice, R goes through these steps for you, but we will illustrate them for a few simple examples so that you can see that there is often more than one way to obtain a simulated random variable. Technically, we are not generate true random numbers with these procedures, but rather computer generated sequences of numbers that behave like random numbers, known as pseudorandom numbers. The exact means by which pseudorandom numbers are generated is an advanced topic beyond the scope of this course, and has been the subject of intensive development and refinement over the years. Suffice it to say that some pseudorandom number generators perform better (i.e., act like the real deal ) than others, so it is important to be sure that you are using a generator that has been thoroughly tested. Fortunately for us, the developers of R and the R user community have thoroughly vetted the pseudorandom number generators in R, so you can be confident when you use these procedures that the results will be essentially random. Probability distributions in R R provides a very convenient way to calculate and plot many common statistical distributions and related functions and generate random variables, so we will perform most of these tasks using built-in R functions. In a few cases we ll be able to see how to build functions from scratch or nearly so, which may help you to generalize these principles. R-code for all the examples is accumulated and saved in an R script file available on Blackboard. Uniform Distribution Density, distribution, and quantiles Perhaps the simplest distribution is the continuous Uniform (or Rectangular) Distribution, which assumes that values of x over the support of f(x) occur with equal probability. The parameters of the Uniform are simply the lower and upper bounds for

6 x, so that x is equally likely to be anywhere inside the interval a x b, but cannot occur outside the interval (i.e. the support is totally in the interval). Formally the density for x is then The distribution function is simply The mean of the uniform is obtained from f ( x; a, b) 1/( b a), a x b 0 x a or x b F( x) ( x a) /( b a), a x ( a b) / and is just the midpoint between the minimum (a) and maximum (b), the parameters of the distribution, while the variance is ( b a) /1. The density is easily implemented in R by the command >dunif(x,a,b) where a and b are the parameters and x is a value or list of values. So for a simple example we can compute and plot the density for Uniform(a=,b=8) over the range x from 0 to 10. #generate 1000 equally spaced values between 0 and 10 >x<-(0:1000)*0.01 #compute the uniform density for each value of x >density<-dunif(x,,8) Alternatively, we could have written a short function in R to do the same thing: #user-defined density >my_dunif<-function(x,a,b){1/(b-a)*(x>=a & x<=b)} >d<-my_dunif(x,,8) Either approach should produce a plot from >plot(x,density) like this

7 Notice what happens to the density when x< or x>8. Likewise we can produce and plot a distribution for x by >distrib<-punif(x,,8) >plot(x,distrib) producing

8 Quantiles at specified probability levels are produced by the qunif() command, for example: #quantiles at standard probability levels >prob_levels<-c(0,.05,.5,.5,.75,.95,1) >quants<-qunif(prob_levels,,8) >quants [1] Likelihood function We are used to thinking of probability functions as describing the probability of an outcome x given the underlying model and parameter values (e.g., Uniform(,8) above). We can turn this idea around though and ask the question of how likely a given

9 parameter value is, given the data we have and an underlying model. In this way of the thinking the data are fixed and the model parameter(s) is (are) variable. Mathematically, the calculation is the same but we are just varying different quantities. In the example we just considered, we can ask the question: how likely, given a= and a value of x=5, are integer values of the parameter b in the range 3 to 8? >a<- >x<-5 >b<-3:1 >b [1] >like<-dunif(x,a,b) >like [1] [9] We see that there is no likelihood that b is 3 or 4 (obviously ruled out by the value x=5) but that given the single observation x=5 we can t rule out b being 6,7,8 or even higher. We will come back to the likelihood and just how we use data to estimate parameters, in a later lab. Random number generation It is very easy to generate random uniform number in R using the runif() function. The first value for the function specifies the number of values you want, and the next to specify the minimum and maximum (a,b) parameters. By default a=0 and b=1 so runif(100) for example would produce 100 uniform random number between 0 and 1, something that is often the starting point for simulating other, more complicated distributions. To take a specific case, suppose we want to generate 100 uniform random number between 5.5 and #generating n Uniform(a,b) random numbers >n<-100 >a<-5.5

10 >b<-10.4 >x<-runif(n,a,b) would produce a list of numbers (x) with these characteristics. You can calculate the sample mean from the simulated data >mean(x) and confirm that while this gets close to the distribution mean of (a+b)/ it s not exact why is that? Normal Distribution Density, distribution, and quantiles The Normal distribution is perhaps the most familiar statistical distribution. It is symmetric about the mean, with the familiar bell-shaped curve, and is used to model continuous, real values with theoretical range from negative to positive infinity. It is the limiting distribution of many test statistics and functions and is commonly used as an approximation, even when the data are thought to follow some other distribution, often after transformation to reduce skewness or discontinuities in the data. The normal density is determined by the parameters and ( >0) as 1 ( x ) f ( x;, ) exp, x For example, the Normal density function over -50, 50 for =5 and =15 is produced by #normal distribution #generate equally spaced values between -50 and 50 >x<-(-5000:5000)*0.01 >mu<-5 >sigma<-15

11 >density<-dnorm(x,mu,sigma) >plot(x,density) We can produce a comparable distribution function by #distribution function >distrib<-pnorm(x,mu,sigma) >plot(x,distrib) producing

12 Specified probability quantiles are easily obtained from the qnorm() function, for example >prob_levels<-c(0.001,.05,.5,.5,.75,.95,0.999) > quants<-qnorm(prob_levels,mu,sigma) > quants [1] Equivalently, we could say that we are 90% confident that x is between and 9.67, with 10% probability (5% in each tail) outside this range. Notice that in the density dnorm() distribution pnorm() functions, we passed the data as a list to the function, for scalar (1-dimensioned) values of the parameter. Generally

13 speaking any of these function arguments can be lists, and it will make sense below (under the likelihood function) to reverse which ones are. Likelihood function Again, we can turn the model around and ask the question: how likely is a specific parameter value, given an observation (or a sample of observations)? To keep things simple for the normal, let s assume that we ve observed the values x=5 and x = 10, and assume that the standard deviation is fixed at 1. Assuming the normal model, how likely are various values of (say between and 16)? We can compute a likelihood for each data value by dnorm() First, let s make life simpler by introducing an R function that will generate a regular sequence at a specified interval, seq(). We use this to produce values for mu in the range of to 16, at 0.5 spacing (finer if we wish), and then feed them into the likelihoods for x=5 and x = 10. #likelihood >mu<-seq(,16,0.5) >like1<-dnorm(5,mu,1) >like<-dnorm(10,mu,1) At this point, we can first recognize that, assuming that the observations of x are independent, we can multiple their likelihoods or add them on the log scale to get a joint (log) likelihood for the data. >loglike<-dnorm(5,mu,1,log=true)+dnorm(10,mu,1,log=true) Finally, we can examine our log likelihood, see which one is biggest, and see which value of mu produced that log likelihood. R has a nice built in index function that will do this which works like this: >mu[loglike==max(loglike)] which basically says find the index of loglik associated with the biggest value and then tell me what the corresponding mu value is at that same index. In this example, the

14 result is 7.5, which (not coincidentally) is the arithmetic mean of 5 and 10. What we just did is a very crude way (but sometimes effective) way to get the maximum likelihood estimate under a specified model, something we ll explore more in a later lab. Random number generation The easiest way to generate random Normal numbers is by using the built-in R function rnorm(). #Generate 100 random numbers for mu=5 and sigma =10 #method 1 n<-100 mu<-5 sigma<-10 x<-rnorm(n,mu,sigma) The second (and just as valid) way is to first generate 100 random Uniform(0,1) numbers, and then treat these as #method #first generate 100 random uniform(0,1) deviates U<-runif(100) #now treat these as probability values in qnorm(), which functions as the inverse distribution function to return values of x given U. x<-qnorm(u,mu,sigma) You should be able to test these approaches out convince yourself that they produce equivalent results.

15 Poisson Distribution The Poisson Distribution is a very important discrete distribution that models outcomes that take on non-negative integer values (0, 1,,., n ). Examples include counts of animals, plants, =the process generating the counts in space is random in the sense that counts are not clustered separated except by chance. Density, distribution, and quantiles The Poisson Distribution is specified by the single parameter which is equal to both the population mean and variance. Thus sometimes the ratio of the sample mean to the variance is used as evidence (or lack thereof) of Poisson assumptions, with values of this ration ~1 taken as support for a Poisson count model. The density function of the Poisson is given by x e f ( x; ) x!, x=0, 1,, 3, where e is the base of the natural logarithm, λ>0, and x! denotes the factorial function x*(x-1)*(x-)..1. The distribution function is simply given by summation over the discrete values of x of the density F( x) x f ( k; ) k0 k0 k! x k e, x=0, 1,, 3,. The Poisson density and distribution are easily implemented in R., for example for λ=5: #poisson distribution #generate a sequence between 0 and 0 >x<-0:0 >lambda<-5 >density<-dpois(x,lambda) >plot(x,density,"h")

16 #distribution function >distrib<-ppois(x,lambda) >plot(x,distrib,"h") Likewise, standard quantiles are easily computed- > #quantiles > prob_levels<-c(0.001,.05,.5,.5,.75,.95,0.999) > quants<-qpois(prob_levels,lambda) > quants [1]

17 Likelihood function As with the other distributions we have considered, the likelihood function is formed from the density, but reversing the roles of parameters and data, with the latter now fixed and the former varying over some specified range. We will cheat here because we know that given the data λ cannot be huge (say >15) and we know that it has to be >0, so we will only look at the likelihood in that range: > #likelihood > lambda<-seq(0.01,15.01,0.01) > x<-8 > like<-dpois(x,lambda) > loglike<-dpois(x,lambda,log=true) > plot(lambda,like) We can also use a device similar to what we used for the normal to get a fairly good approximation of the maximum likelihood value for λ > #find the maximum over list of lambdas > lambda[loglike==max(loglike)] [1] 8 which is not surprising: given the simplicity of this model (mean=variance) and the single observation x=8, we expect λ to be around 8. Random number generation Given the discrete nature of the random variable in the Poisson distribution, there are several options for generating random variables, some of them quite simple, others not so simple but more flexible. We illustrate all 3 with the example of generating 100 random values from a Poisson(10) distribution. Method 1- built in R function

18 First, we can of course rely on the standard, built-in R function rpois(). #Generate 100 random numbers for lambda=10 #method 1 n<-100 lambda<-10 x<-rpois(n,lambda) #Method Uniform deviate, quantile (inverse distribution function) The second method also relies on an R-function, the quantile or inverse distribution function, but performs the calculations by first computing 100 random(0,1) numbers and then transforming them with the inverse distribution (quantile) function. #method >n<-100 >lamdbda<-100 >U<-runif(100) >x<-qpois(u,lambda) Method 3- Uniform deviate and interpolation from distribution In the third approach we have built a random number generator based directly on the cumulative discrete distribution F(x). As in Method we generate 100 random Uniform variables and then use these and the definition of F(x) to obtain values of x (see Evans et a. 000); this requires a user-defined function (pfun) to map the continuous values of U into discrete values of x: #method 3 >n<-100 >lambda<-10 #generate F(x) from 0 to 50 values<-0:50

19 >F<-ppois(values,10) #define the function to do interpolation from F(x) >pfun<-function(f,u,v) > { > x<-array(0,dim<-c(length(u))) > for (j in 1:length(u)) > { > for (i in 1:50) > { > if (f[i]<=u[j] & u[j]<f[i+1]) {x[j]<-v[i+1]} > } > } > x > } >#generate the values of U and x >U<-runif(N) >x<-pfun(f,u,values) You can confirm that all 3 methods give similar results using large values for n and that the last methods give identical results for the same vector of Uniform numbers. However, Method 3 is much slower than Methods or 1, which simply confirms that (usually) the built-in functions in R tend to be more computationally efficient than what beginning users can build. Building a function like this on your own though does illustrate that in can be done, and this can be handy in situations where no built-in function exists in R. For example, suppose we have a discrete distribution F(x) without a known mathematical form, but for which we can write out numerical values F(x). A simple example of this is where we use the quantile function to summarize the data from a sample into an empirical distribution function and treat these as F(x). We can then use an approach such as Method 3 to simulate values from this distribution, even though we have no idea of its

20 mathematical form. We ll return to these ideas later when we get more deeply into simulation in a later lab. Bernoulli Distribution/ Binomial Distribution The Bernoulli Distribution is the natural distribution for modeling outcomes that can occur in 1 of classes, such as success or failure, lived or died, heads or tails, male or female. The Bernoulli Distribution has a single parameter p that describes the probability of a success (however it is defined). The Binomial Distribution defines the number of successes that occur in n independent Bernoulli trials, each with the same probability of success p. The Binomial is thus based on summing Bernoulli distributions, and has parameters, n and p. Because these distributions are so closely related we will consider them together below. Density, distribution, and quantiles The Bernoulli random variable x takes on possible values, either 1 (indicating success) or 0 (failure), and has a single parameter, p, denoting the probability of success. The probability density function is written as x f ( x; p) p (1 p) 1x, x=0, 1 which simplifies to f ( 0; p) (1 p) and f ( 1; p) p. Note that we assume that there are only possible outcomes, a success with probability p and a failure with probability 1-p, and that by definition the probability that it is either a success or failure adds to 1. The mean of the Bernoulli distribution is E(x)=µ=p and the variance is Var(x)= p(1-p). The Binomial distribution is closely related, with the Binomial variable x defined as number of success in n independent Bernoulli trials, each with probability p of success. The Binomial thus has parameters (n and p) though one of these (n) ordinarily is known and will not be estimated from data. The Binomial density function is f ( x; n, p) n x x nx p (1 p)., x=0, n

21 The Binomial distribution function is F( x; n, p) x k nk p (1 p) k0 n k, x=0, n The mean and variance are given by E(x)=µ=np and the variance is Var(x)= np(1-p). The Binomial density and distribution are easily implemented in R by the dbinom() and pbinom() functions (there is no separate Bernoulli function in R, with the Bernoulli simply being a Binomial with a single trial n=1). e.g., for a Bernoulli with p= 0.4 >#Bernoulli >p<-0.4 >x<-0:1 >density<-dbinom(x,1,p) >distrib<-pbinom(x,1,p) >plot(x,density,"h",ylim=c(0,1)) >plot(x,distrib,"h",ylim=c(0,1)) This produces plots for the density and distribution of:

22 Taking a Binomial with p=0.4 and n=10 trials we have >#Binomial >n<-10 >p<-0.4 >x<-0:n >density<-dbinom(x,n,p) >distrib<-pbinom(x,n,p) >plot(x,density,"h",ylim=c(0,1)) >plot(x,distrib,"h",ylim=c(0,1)) This produces plots

23 and

24 Quantiles at specified p-values are easy to produce using the qbinom() function, e.g., >n<-10 > p<-0.4 > #quantiles > prob_levels<-c(0.001,.05,.5,.5,.75,.95,0.999) > quants<-qbinom(prob_levels,n,p) > quants [1] Likelihood function As with other distributions, we can reverse the roles of the data and the parameters and now treat the parameters as variables. In the case of either the Bernoulli or the Binomial there is generally only one parameter of interest, since we usually know how many trials there are. Take a case where we have 10 trials and we observe 4 successes. We can examine the likelihood over the range of p =(0,1) and try a brute force maximization as before: > #Likelihood

25 > p<-seq(0,1,0.001) > n<-10 > x<-4 > like<-dbinom(x,n,p) > loglike<-dbinom(x,n,p,log=true) > plot(p,like) > plot(p,loglike) > #find the maximum over list of lambdas > p[loglike==max(loglike)] [1] 0.4

26 The results suggest a value of p =0.4 maximizes the log likelihood. However, notice how flat the log likelihood function is, with many values of p larger and smaller than 0.4 returning similar values. This suggests that the data (4 successes but only 10 trials) provides relatively poor information about the parameter value. We will return to this point when we consider estimation in more depth later in the course. Random number generation Generating random number for the Bernoulli and Binomial is quite easy and can be accomplished with either a simple random uniform number generator or with the built-in function rbinom(). The first approach computes Bernoulli outcome by simply comparing a Uniform(0,1) random number to p; if U> p then x=0, otherwise x =1 >#generating bernoulli random variables >#specify p >#specify n_reps >p<-0.35 >#method 1 >x<-(runif(n_reps)<p)*1 >#method >x<-rbinom(n_reps,1,p) Generating Binomial random variables can be accomplished by generating a series of n Bernoulli variables and then summing these. #generating Binomial random variables #specify n #specify p #specify n_reps n_reps<-100 n<-10 p<-0.35

27 #method 1 x<-array(0,c(n_reps)) for (i in 1:n_reps) { x[i]<-sum(runif(n)<=p)*1 } Alternatively, you can directly used the rbinom() function in R #method x<-rbinom(n_reps,n,p) The advantage of the first approach is that sometimes we will not want to assume that the parameter p remains constant, but instead allow it to vary from sample to sample (or even among Bernoulli trials within a sample). In such cases we can still simulate or model the data but no longer under Binomial assumptions (which require p to be constant). We will look at an example of this in a bit. Multinomial Distribution The Multinomial Distribution is similar to the Binomial, but instead of modeling outcomes that occur in ways ( success or failure ) the outcomes can occur in 3 or more ways. For example, suppose that an animal can die, and if it lives can either reproduce or not reproduce, and that these are the only possibilities. If we assign the probabilities to these events as p 1=probability of death, p =probability of living and reproducing, and p =probability of living and not reproducing, by definition p p p 1. Thus, if we know of the 3 probabilities we know the 3 rd by subtraction, e.g., p3 1 p1 p. In general, if we have k categories of outcomes we have k-1 probabilities to describe them, with the last by subtraction. Like the Binomial, the Multinomial is built from a series of n independent trials, each with the same probabilities describing the outcomes. The random variable x is now a vector, denoting the number of the n trials that fall into each category. For example, if we have

28 100 animals, the outcomes might be 5 die, 50 live and reproduce, and 5 live but do not reproduce. The Multinomial density is n x1 x f ( x; n, p) p1 p... p x1x... x k x k k Because of its multivariate nature it is difficult to visualize the density, but density and distribution values are readily computed in R. For example, the density for a 3- category multinomial with 10 trials is calculated by > #example > n<-10 > p<-c(.5,.5,.5) > x<-c(1,5,4) > density<-dmultinom(x,n,p) > density<-dmultinom(x,n,p) > density [1] Random Multinomial variables are generated by the rmultinom() function. For instance, to generate 0 instances of the above 10-trial trinomial we would use: > #Random variables > rmultinom(0,10,p) [,1] [,] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,1] [,13] [,14] [1,] [,] [3,]

29 [,15] [,16] [,17] [,18] [,19] [,0] [1,] [,] [3,] > Beta Distribution The Beta Distribution is a continuous distribution that models a random variable x that can take on values in the range 0 x 1. The Beta is therefore appropriate for modeling the distribution of probability values, and in particular for modeling heterogeneity in probabilities. The parameters α and β (or a and b) control the location and shape of the Beta Distribution. The Uniform(0,1) distribution is a special case of the Beta with α=β=1. The Beta Distribution assumes additional importance as a natural (or conjugate) distribution for the Binomial, describing uncertainty in the Binomial parameter p before (prior) and after (posterior) data collection. Finally, the Beta and the Binomial can be combined hierarchically in model (the Beta-Binomial) in which the random outcome is a binary, but the process describing success is heterogeneous. We will return to both of these themes in later labs. Density, distribution, and quantiles The mathematical form of the Beta density is ( ) ( ) f ( x;, ) x ( ) where (c) is the Gamma function 1 (1 x) 1 ; 0 x 1; 0, c1 ( c) exp( u) u du 0 The kernel of the Beta (part that involves the random variable x) is actually quite simple and, not coincidentally resembles the Binomial distribution The mean of the Beta distribution is x 1 (1 x) 1.

30 and the variance is V E (x) x) ( ) ( 1) ( We can use the relationship between the mean and variance and the parameters to solve for parameter estimates via the Method of Moments. More later. The Beta variable x is sometimes interpreted as modeling the probability of success based on previously observing 1successes and 1failures. The Beta distribution function is obtained by integrating the density from 0 to x. Both the density and the distribution can easily be evaluated in R using the dbeta() and pbeta() functions. For example, for 10 15we can produce density and distribution values over the range of x. #Beta distribution >a<-10 >b<-15 >x<-seq(0,1,0.001) >#density >density<-dbeta(x,as,b) >distrib<-pbeta(x,a,b) >plot(x,density) >plot(x,distrib) This code will produce a plot of the density

31 and of the distribution

32 Notice that the density is centered near 0.4, but takes on a fairly wide range, indicating that Beta(10,15) would be appropriate for modeling success probability that averaged about 0.4 but exhibits heterogeneity. We will return to this theme later when we consider the Beta-Binomial distribution. Standard quantiles of the Beta are easily produced with the qbeta() function, for example > #quantiles > prob_levels<-c(0.001,.05,.5,.5,.75,.95,0.999) > quants<-qbeta(prob_levels,a,b) > quants [1] Likelihood function

33 As in our previous examples, we can consider the data (x) as fixed (observed) and treat the parameters as variables, producing a likelihood function. Note that with the Beta distribution, like the Normal, we have parameters, so we have to find a combination of and that maximizes the likelihood. The likelihood is easy to compute using R, for example if we observe x= 0.4 >#Likelihood >a<-seq(5,5,0.001) >b<-seq(10,30,0.001) >x<-0.4 >like<-dbeta(x,a,b) >loglike<-dbeta(x,a,b,log=true) It is a bit trickier to use brute force methods to get the maximum, and instead we will use graphical methods to get an approximation. Because the parameter space is dimensional, we need to display the likelihood in 3 dimensions The scatterplot function in R will produce a 3-D scatterplot >library(scatterplot3d) >scatterplot3d(a,b,loglike)

34 The graph indicates that the log likelihood has a maximum at around α=0 and β=0. Graphical methods become cumbersome (and inaccurate) for or more parameters; we will consider more exact methods for maximizing the likelihood in a later chapter. Random number generation Random number generation is easily performed in R using the rbeta() function. For example, the following code will generate 100 Beta(10,15) random variables. >#Random betas >n<-100 >a<-10 >b<-15 >#Method 1 >x<-rbeta(n,a,b) If α and β are integers, the following code can also be used to generate Beta random variables from Gamma random variables, which in turn are generated by a log transformation from Uniform distributions:

35 >#Method (if a and b are integers) >x<-array(0,c(n)) >for (i in 1:n) >{ #g1 and g are Gamma random variables >g1<-sum(-log(runif(a))) >g<-sum(-log(runif(b))) >x[i]<-g1/(g1+g) >} Gamma The Gamma distribution is a continuous distribution which has parameters, and where x takes on nonnegative values ( 0 x ). The Gamma is important in statistics because several other important distributions such as the Chi-square and Exponential are special cases. We also saw above how Gamma distributions can be used to generate Beta random variables. However, most of our interest in the Gamma will be because of its special relationship to the Poisson distribution, both for modeling heterogeneity in the Poisson parameter and for as a conjugate distribution for the Poisson in Bayesian analysis. Density, distribution, and quantiles The density function of the Gamma is 1 f ( x; b, c) ( x / b) ( c) c1 exp(-x/b); 0 x ; b 0, c 0 The distribution function of the Gamma is given by integration from 0 to x. F( x; b, c) x o 1 ( v / b) ( c) c1 exp(-v/b)dv; 0 x ; b 0, c 0 The mean and variance of the Gamma are related to the parameters in a straightforward way by E( x) bc and

36 V ( x) b c. As we will see these relationships lead to easy (but not particularly optimal) parameter estimation by the Method of Moments. The density and distribution are easily generated in R using the dgamma() and pgamma() functions. For example, we can plot the density and distribution for Gamma(b=1,c=5) by >#Gamma distribution >c<-5 >b<-1 >x<-seq(0,10,0.001) >#density note R gamma functions use inverse scale (rate) = >1/b >density<-dgamma(x,c,1/b) >distrib<-pgamma(x,c,/1b) >plot(x,density) >plot(x,distrib) This produces a density over the range of 0 to 10 of and a distribution of

37 Quantiles are produced by the qgamma() function. For the same parameter values we can produce several quantiles by > #quantiles > prob_levels<-c(0.001,.05,.5,.5,.75,.95,0.999) > quants<-qgamma(prob_levels,c,1/b) > quants [1] This indicates, for example, that median (0.5 quantile) of Gamma(1,5) is around 4.7, and that 99% of the data can be expected to lie below Likelihood function As with other distributions, we can form the likelihood by considering the data (x) as fixed and allowing the parameter values to vary. For example suppose we observe x=5, we can plot the log likelihood versus values of b and c >#likelihood >b<-seq(0.01,1,0.001)

38 >c<-b*10 >x<-5 >like<-dgamma(x,c,1/b) >loglike<-dgamma(x,c,1/b,log=true) >library(scatterplot3d) >scatterplot3d(c,b,loglike) By eyeballing this graphic, we can see that values of c 6 and b<0.5 appear to maximize the log likelihood. Random number generation We present methods for producing random numbers from the Gamma distribution. The easiest is the built in rgamma() function. For example to generate 1000 Gamma(1,5) random variables: >#random number generation >#method 1 >n<-1000 >c<-5 >b<-1

39 >x<-rgamma(n,c,1/b) Gamma variables can be generated directly from Uniform(0,1) random variables via a log transformation (we used this approach already for Beta random variables), if the parameter c has an integer value: >#method - if c is integer >n<-1000 >c<-5 >b<-1 > x<-array(0,c(n)) >for (i in 1:n) >{ >x[i]<-b*sum(-log(runif(c))) >} Estimation methods Fundamentally, all estimation methods are based on considering the sample data x as known, and then using the statistical model to derive values of the parameters based on the data. We will consider approaches: the Method of Moments and Maximum Likelihood, with most emphasis on the second of these. Method of Moments The Method of Moments is very simple and can provide reasonable estimates of parameter is some situations. The basic steps are Determine the population moments (expected value, variance, etc.) as functions of the parameter(s) Set the population moments equal to the sample (data based) moments Solve for the parameter(s) as functions of the data.

40 To take a very simple case, consider a Binomial experiment where we have 10 independent Bernoulli trials, we observe 6 success, and we wish to estimate p the probability of success (assumed homogeneous among trials). The population moment is E( x) np Setting the population moment equal to the sample moment (in the case, simply x) provides x np and solving for p provides p ˆ x / n 6/ A somewhat more complicated example involves the Beta distribution and moments: the mean and the variance. Recall that for the Beta the mean and variance are and E(x) V x) ( ) ( 1) ( Because there are unknowns (, ) and equations, we should be able to solve for the parameters, and we can. First, we equate the expectation of the moments with the sample moments

41 x and s ( ) ( 1) Then we solve for α and β ˆ x{[ x(1 x)]/ s 1} and ˆ (1 x){[ x(1 x)]/ s 1}. We have written a small R function to provide these calculations > beta.mom<-function(mean,sd){ + v<-sd** + x<-mean + a<-x*(x*(1-x)/v-1) + b<-(1-x)*(x*(1-x)/v-1) + c(a,b) + + } For example, if we have a sample mean of 0.5 and SD of 0.1 in our data, the program provides > beta.mom(.5,.1) [1] 1 1

42 or estimates of ˆ 1, ˆ 1. We can confirm that these correspond to the moments by plugging them into the population moment formula > beta.stats<-function(a,b){ + x<-a/(a+b) + v<-a*b/((a+b)^*(a+b+1)) + c(x,sqrt(v)) + } > beta.stats(1,1) [1] which returns the correct mean and SD. Unfortunately, the Method of Moments can produce bizarre results. For example > beta.mom(.1,.4) [1] However, both parameter of the Beta must be positive numbers, so the Method of Moments in this case does not work. The Beta Method of Moments behaves well for many cases, but can easily produce inadmissible values for the parameters, as just illustrated. The Beta example illustrates one drawback of the Method of Moments, which is that is sometimes can produce nonsensical results (outside the admissible parameter space). The method also does not necessarily provide a way to assess parameter confidence (variance, confidence intervals). Finally, the Method of Moments does not share some of the desirable properties of the next method, such as sufficiency, minimum variance, and asymptotic normality. For this reason, most practitioners use the Method of Moments only as a method for quick approximation, if at all.

43 For completeness, here s a function for estimating the gamma parameters using the method of moments. Remember that this is a quick and dirty and could give negative (incorrect) values. > gamma.mom<-function(mu,sd){ + v<- sd** + c=v/mu + b=(mu/sd)^ + c(b,c) + } > ## gamma MOM using mean 7 and sd of 11 > theta<-gamma.mom(7,11) > ## again take note of use of inverse scale or rate > ## lets see how close we were > mean(x<- rgamma(10000,theta[1],1/theta[])) [1] > sd(x) [1] Maximum Likelihood Maximum likelihood methods have several advantages not necessarily shared by other approaches, and therefore are favored in much of statistics. Generally speaking maximum likelihood estimators (MLEs) Are asymptotically (i.e., with large samples) unbiased Are asymptotically Normally distributed Have minimum variance (i.e., have variance smaller than any other estimator) Provide variance estimates directly as part of estimation The basic idea of MLE is simple: given the data, we consider the parameter(s) to be unknown variables; the density function now behaves instead as a likelihood function. We then solve for the parameter values that maximize the likelihood function, give the data values. There are several ways to do this:

44 By graphic the likelihood function against candidate parameter values By brute force searching over the parameter spaces By exact solution using The Calculus By numerical optimization methods. We can illustrate all these approaches by taking a simple case involving the Binomial distribution. Supposed we conduct 100 Bernoulli trials and observe 40 successes. For example, the 100 trials could be 100 nests that we have discovered and have followed from initiation to success (fledging) or failure. Because we know the number of trials (n=100) we will focus on estimating the probability of success. The statistical model is f ( x; n, p) n x x nx p (1 p) However, we now know that n=100 and x=40 so we will recast this as a likelihood function L ( p; x 40, n 100) p (1 p Now the task is to find a value for p that maximizes this function. Usually, it will be more convenient to work with the natural logarithm of the likelihood function. Because the logarithmic transformation is monotonic, if we find value of p that maximize log(l(p)) we ve also found the value that maximize L(p). For this example the log of the likelihood is 100 ln L( p; x 40, n 100) ln 40 or in general (for any integers n and x n ) ) 60 40ln p 60 ln(1 p) n ln L( p; x, n) ln xln p ( n x)ln(1 p) x. As noted, there are several ways we can go about finding the maximum of this function, and we visited briefly in the previous lab. The first method is based in graphing the likelihood and log likelihood. Rather than use the built-in dbinom() function, we have bimomial_likelihood R script.r to graph the likelikhood and log likelihood. We do this for

45 reasons: first, we want students to see explicitly what the likelihood function and its log looks like, and second, we are going to do some mathematical manipulation in a minute that would not be easy using the built-in R function. Graphical approach When we plot ln L( p; x, n) vs. p we get a curve centered about a value of p ~ 0.4. Similarly the log likelihood seems to peak around 0.4. So, p =0.4 is looking to be a good candidate for the MLE.

46 Brute force As we saw earlier, we can fairly easily find the maximum by brute force if 1) we have a single parameter (p) in this case and ) the parameter is constrained over a reasonable range ( 0 to 1 here). Using our explicit code and the list maximize trick, we get the following > #Brute force > #Likelihood > p<-seq(0,1,0.001)

47 > n<-100 > x<-40 > binomial_like<-function(x,n,p_){ + like=log(choose(n,x))+x*log(p_)+(n-x)*log(1-p_) #choose function evaluates n choose x + return(like) + } > loglike<-binomial_like(x,n,p) > #find the maximum over list of lambdas > p[loglike==max(loglike)] [1] 0.4 Again, this confirms that p=0.4 appears to be viable as the MLE. Exact approach using The Calculus The Calculus provides an exact solution to the likelihood maximization under certain conditions. In particular if the likelihood is continuous and twice differentiable then a necessary condition that L(p*) is a maximum is that the first derivative with respect to p is zero. If the second derivative is negative, then this assures that L(p*) is a maximum and not a minimum. For the Binomial Likelihood this is best approached by operating with the log likelihood. The first derivative of the log likelihood is d ln L( p; x 40, n 100) dp p (1 p) Setting this to zero yields

48 40 60 p (1 p) and with a little algebra p ˆ 40/ More generally d ln L( p; x, n) dp x p n x (1 p) x p n x ( 1 p) p ˆ x / n We can confirm graphically that the derivative becomes zero at p=0.4.

49 Direct solution of the log-likelihood equations by algebra is possible for many statistical models and their parameters. In addition to the Binomial parameter p the Poisson parameter λ can be estimated is way, as can the Normal parameters µ and σ, although analysis becomes more complicated when or more parameters are involved. For example, estimation of the Normal parameters µ and σ requires taking partial derivatives of the log-likelihood with respect to each parameter and setting each of these equations to zero. Solution of these equations for µ and σ then provides the estimates ˆ ˆ x n i1 n i1 i x i / n ( x x) Astute students will notice that the second formula differs slightly from the usual sample variance. / n s n i1 ( x x) i /( n 1) The reason is that ˆ (the MLE) slightly biased for small samples, and suse of reduces this bias. Numerical methods Explicit formulas for MLEs exist and are readily computed for many common statistical models. However, as models become more complex (more parameters and structure) it can be difficult or impossible to obtain algebraic solutions to the MLEs. Fortunately, high speed computers are capable of solving the likelihood equations via numerical approaches. These approaches really are a special application of optimization approaches that we will consider in more detail later. They generally require the following: A mathematical expression (or computer code) for computing the log-likelihood for a given parameter value

50 An initial guess for the parameter value (sometimes based on simple statistics from the data) A means of searching to see if improvements (higher log-likelihood values) can be made by changing the parameter value A stopping rule to determine that the parameter values has converged on the apparent MLE. Gradient descent methods and Newton s Method are of the more familiar (and simpler) optimization methods. Both require the ability to evaluate 1 st and nd derivatives (partial derivatives if there is more than 1 parameter) with respect to each candidate parameter value (combination of values). The derivatives can be either explicitly written (i.e., algebraic) or computed via approximations. We have Newtons method script.r that applies Newton s Method to solving for the MLE of the Binomial parameter p. The basic steps are simple: 1. Start with an initial value for p, p 0. Compute the gradient evaluated at the current value of p(p i ) d ln L( pi ) g( pi ) p i d ln L( pi ) 3. Compute g ( pi ) dpi pi 1 pi g( pi) / g ( pi) 4. Update p by 5. Return to Step and repeat until convergence Convergence can be evaluated by evaluating how much (or little) p changes and/or by determining that g p ) is sufficiently close to zero (i.e., differs from zero by some ( i specified small amount). In the example code (n=100, x=30) p in initialized at 0.1 and converges rapidly to 0.3.

51 R also has a built-in optimization function optimize() that performs maximization or minimization of specified function. The attached code applies this function to the above binomial example. MLE for higher-dimensioned problems In principal exactly the same approaches used for single-parameter models extend to parameters with multiple parameters. However, both graphical and brute force approaches become cumbersome beyond about parameters (try visualizing 4- dimensional graph!) and a generally eschewed in favor of either direct or numerical solution of a system of likelihood equations. Example- Normal Likelihood We can take the example of the Normal likelihood and a sample x of n observations. Assuming that the data are independent, the joint likelihood is formed by product of n likelihoods: n i x i x L 1 ) ( exp 1 ), ( and the log-likelihood is n i x i n x L 1 ) ( ) log( ), ( log The partial derivatives of the log-likelihood with respect to the parameters simplify to ) ( 0 ), ( log x n x L

52 3 1 ) ( ), ( log n i i x n x x n x L These equations can be solved directly by n i x i n x 1 / ˆ n i i n x x 1 / ) ( ˆ or by trial and error, gradient, Newton s Method, or other numerical methods. Application of Newton s Method and other derivative-based methods requires evaluation of the matrix of partial second derivatives ln ln ln ln L L L L I The matrix I is sometimes known as the Hessian or Information Matrix. The vector of first partial derivates / / L L G Solutions to the likelihood equations occur when G = 0; the inverse of F provides the estimated Variance-covariance Matrix, with the variances on the diagonal and the covariances on the off-diagonal. This same approach applies to any dimension MLE problem, with the sizes of G and F determined by the number of parameters (k, so G is length k and F is k x k). The optim() procedure in R can be generalized to solve for the MLEs for more complicated likelihoods involving multiple parameters. In optimize.r we perform ML optimization for Binomial, Normal, and Beta examples. Note that the optim() performs

53 by minimization, so to get maximum likelihood we compute the negative log likelihood and then find the parameter values that minimize the function. The parameter method = BFGS specifies the use of a quasi-newton method (similar to Newton s Method above) and hessian=true species that the algorithm will produce the Hessian matrix, which we can then use to get the variance-covariance matrix. R built in functions As you may have guessed, R users have created several packages that can be used to > library(mass) > beta.data<- c(0.05,0.,0.03,0.4,0.15) > fitdistr(x=beta.data,"beta",start=list(shape1=1,shape=1) ) shape1 shape ( ) ( ) Warning messages: 1: In densfun(x, parm[1], parm[],...) : NaNs produced : In densfun(x, parm[1], parm[],...) : NaNs produced > > gammer.dater<- c(3,.1,17,1,0.5,1.3,.01) > fitdistr(x=gammer.dater,"gamma") shape rate ( ) ( ) We will get more use out of these functions as the course progresses. Writing simulation programs in R We have already done a great deal of simulation with individual distributions in R; here we will focus on putting things together into more complicated analyses, and in efficiency.

CS 361: Probability & Statistics

October 17, 2017 CS 361: Probability & Statistics Inference Maximum likelihood: drawbacks A couple of things might trip up max likelihood estimation: 1) Finding the maximum of some functions can be quite