3.3 Estimator quality, confidence sets and bootstrapping

Size: px
Start display at page:

Download "3.3 Estimator quality, confidence sets and bootstrapping"

Transcription

1 Estimator quality, confidence sets and bootstrapping Estimator quality, confidence sets and bootstrapping A comparison of two estimators is always a matter of comparing their respective distributions. A good estimator is an estimator with a distribution (under P θ ) that is closely centred around θ for all θ Θ. It is worth noticing that an estimator should work well for all values of the parameter. Fixing a single value, θ 0 Θ, say, we can define the estimator ˆθ by ˆθ(x) = θ 0 for all x. This estimator works extremely well when θ = θ 0, but it works terribly if θ is far from θ 0. Problems like this make it troublesome in general to decide what properties an optimal estimator should have. Some attempts to define good and bad properties of an estimator in terms of its distribution are described in the next section. The distribution of an estimator is not only used for comparing estimators but also for judging how uncertain an estimate is. This is usually done via confidence intervals. The construction of confidence intervals relies in principle on explicit knowledge about the distribution of the estimator something we often don t have but approximating methods can be used instead. One of these, bootstrapping, is a generally applicable method that in practice relies only on the ability to do some simulations Mean Squared Error The expectation of a random variable X under the probability measure P θ is denoted E θ X. Definition If ˆθ is an estimator of a real valued parameter, Θ R, then the mean squared error of the estimator is MSE θ (ˆθ) = E θ (ˆθ θ) 2. Note that the mean squared error is a function of θ. For a given θ the number MSE θ (ˆθ) is a measure of how far the estimator on average is from θ. This is a reasonable measure of how good (or bad) the estimator is. The smaller MSE θ (ˆθ) the better. As shown in (2.11) the mean squared error can be decomposed into two terms: MSE θ (ˆθ) = V θˆθ + (Eθˆθ θ) 2. The first term is by definition the variance of the estimator and the second term is called the squared bias of the estimator. An estimator is called unbiased if the second term vanish for all θ, i.e. an estimator is unbiased if E θˆθ = θ for all θ Θ. Unbiasedness seems to be a desirable property an unbiased estimator is on average right and a lot of effort has been invested in developing bias-reducing methods. We saw for instance that the MLE of the variance for a normal distribution was corrected

2 110 Statistics to give an unbiased estimator of the variance. But changing the estimator to remove the squared bias term from MSE θ (ˆθ) may, however, increase the variance of the estimator to such an extent that the mean squared error becomes larger. If the objective is to minimise the mean squared error of the estimator then one should try to find a suitable tradeoff between the squared bias term and the variance term. This bias-variance-tradeoff is a matter of balancing the different types of errors the systematic error (the bias) and the random error (measured by the variance). Example If X 1,...,X n under P p are iid Bernoulli variables, p [0,1], with the parameter p being the probability that X i equals 1, then ˆp = 1 n n X i is an estimator of p (in fact the MLE). The distribution of nˆp = n X i under P p is a binomial distribution with success parameter p and size parameter n, but we don t need this information to find directly that E pˆp = 1 n n E p X i = 1 n np = p so the estimator is unbiased. The mean squared error of the estimator therefore equals the variance MSE p (ˆp) = V pˆp = 1 n n 2 V p X i = 1 p(1 p) n2np(1 p) =. n Example (Evolutionary distance). We continue Example and consider the MLE ˆt = 1 4α log 3n 3n 1 n 2 for the evolutionary distance under the Jukes-Cantor model. Note that the estimator is only defined when 3n 1 > n 2. The distribution and hence the mean and variance of this estimator is therefore not really defined, but if we condition on the event that 3n 1 > n 2 we can find the conditional distribution and hence the conditional mean and variance of the estimator. Rewriting the formula for the estimator gives that ˆt = 1 ( log 3n log(4n1 n) ) 4α where we have used that n 1 + n 2 = n. Recall that n 1 is the number of pairs (x i,y i ) with x i = y i. Introducing the Bernoulli random variable Z i = 1(X i = Z i ), which is one if X i = Y i and zero otherwise, we see that n 1 is the realisation of N 1 = n Z i.

3 Estimator quality, confidence sets and bootstrapping 111 Bias of the MLE in the Jukes Cantor model Variance of the MLE in the Jukes Cantor model Squared bias Variance time time Figure 3.6: The graphs show the squared bias (left) and the variance (right) of the MLE as a function of the time parameter t in a Jukes-Cantor model with α = 10 4 when having observed 100 aligned letters. The bias as well as the variance are in principle computed conditionally on 3n 1 > n 2, but for t in the range considered here this event essentially has probability 1. The estimator is clearly biased and the bias and variance both increase as t increases. The mean squared error of the estimator is the sum of the squared bias and the variance. The bias is, for t in the this range, negligible compared to the variance. See also Figure 3.7. The distribution of N 1 is a binomial distribution with success probability p(t) = exp( 4αt) under P t and size parameter n. Regarded as a random variable the distribution of ˆt = 1 ( log 3n log(4n1 n) ) 4α conditionally on 4N 1 > n is given as a transformation of a binomial distribution conditionally on 4N 1 > n. For instance we find that the conditional mean is E t (ˆt 4N 1 > n) = 1 ( log 3n Et (log(4n 1 n) 4N 1 > n) ) 4α where n E t (log(4n 1 n) 4N 1 > n) = 1 :4n 1 >n log(4n 1 n) ( n ) n 1 p(t) n 1 (1 p(t)) n n 1. P t (4N 1 > n) The variance can be represented likewise. It is difficult to use this formula and the corresponding variance formula to get any further theoretically, but they can be used computationally. Figure 3.6 shows how the (conditional) bias and variance depend upon the time

4 112 Statistics Bias of the MLE in the Jukes Cantor model Variance of the MLE in the Jukes Cantor model Squared bias 0.0 e e e e Variance time time Figure 3.7: The graphs show the squared bias (left) and the variance (right) of the MLE as a function of the time parameter t in a Jukes-Cantor model with α = 10 4 when having observed 100 aligned letters. Compare with Figure 3.6. It is noticeable that the variance suddenly drops around t=9000 while the squared bias steadily increases. This is explained by the condition 3n 1 > n 2 for the existence of the MLE. Note also the the squared bias is not negligible compared to the variance. parameter t in the range 1 to 1000 for α = 10 4 and n = 100. The MLE is clearly biased but we observe that in this range the bias is negligible compared to the variance. Figure 3.7 shows another picture for t in the range 8000 to Here the variance reaches a maximum at around 9000 and declines thereafter whereas the squared bias increases steadily. The squared bias is no longer negligible. The explanation is that when t gets large the event 3n 1 > n 2 occurs with probability approaching 0.5 (and the probability is in particular less than 1 also from a practical point of view). The estimator will, when 3n 1 > n 2, take values less than (but close to) 2500log = The result is that as t grows the squared bias increases whereas the variance of the estimator stays bounded Confidence intervals If (P θ ) θ Θ is a parameterised family of probability measures on E, and if we have an observation x E, then an estimator ˆθ : E Θ produces an estimate ˆϑ = ˆθ(x) Θ. If the observation came to be as a realisation of an experiment that was governed by one

5 Estimator quality, confidence sets and bootstrapping µ^ µ^ µ µ Figure 3.8: These figures show the density for the distribution of the estimator ˆµ regarded as a two-dimensional function, see Example 3.3.4, with n = 10 (left) and n = 50 (right). The darker a colour the higher a value of the density. For a given estimate ˆµ(x) = y we can read of which µ (those where the point (y, µ) is coloured) that could produce such an estimate. Note that the dark band is most narrow when n is largest. probability measure P θ in our parameterised family (thus the true parameter is θ), then it is rather unlikely that we have ˆθ(x) = θ but we certainly hope that the estimate and the true value are not too far apart. We have seen how the distribution of the estimator when regarded as a random variable tells us how estimates will deviate from the true θ. The information flow in the sentence above is that if θ is the true parameter, then the distribution of ˆθ under P θ tells how far from θ we can expect to find realisations of theta. ˆ We want to turn things upside down and for a given estimate tell how far it is from the true θ. What we are going to do is to combine knowledge about the distribution of ˆθ for all θ Θ with an observation x E and convert this into knowledge about which parameters we believe could have produced x. We first illustrate the main line of thought with two examples. Example Let X 1,...,X n be iid N(µ,1) distributed with µ R. Thus our parameter space is R and the unknown parameter is the mean µ in the normal distribution. Our sample space is R n and the observation is an n-dimensional vector x = (x 1,...,x n ). We will consider the estimator ˆµ = 1 n X k, n k=1 which is the empirical mean (and in this case the MLE as well). The distribution of ˆµ is a N(µ,1/n) distribution.

6 114 Statistics p^ p^ p p Figure 3.9: These figures show the point probabilities for the distribution of the estimator ˆp regarded as a two-dimensional function, see Example 3.3.4, with n = 10 (left) and n = 50 (right). The darker a colour the higher a value of the point probabilities. For a given estimate ˆp(x) = y we can read of which p (those where the point (p, y) is coloured) that could produce such an estimate. Note the cigar shape of the high values. Consider the density for the distribution of ˆµ as a function of two variables (y,µ) 1 ) (y µ)2 exp (, (y,µ) R R. 2πn 2n For a given µ this is simply the density for the distribution of ˆµ as a function of y, but when we change µ the density changes as well. A combination of (y,µ) where the density takes a high value has the interpretation that y is a likely estimate if µ is the true parameter. On Figure 3.8 we see two examples (n = 10 and n = 50) that illustrate how the density as a function of (y,µ) behaves. For a given estimate y = ˆµ(x) we can from the figure read of, which values of µ that make the estimate likely and which that do not. Example Let X 1,...,X n be iid Bernoulli distributed with success probability p [0,1]. Our parameter space is [0,1] and the unknown parameter is the success probability p. Our sample space is {0,1} n and the observation is an n-dimensional vector x = (x 1,...,x n ) of 0-1-variables. We will consider the estimator ˆp = 1 n n X k, k=1 which is the relative frequency of 1 s (and the MLE as well). The distribution of nˆp =

7 Estimator quality, confidence sets and bootstrapping 115 n k=1 X k is a binomial distribution with parameters (n,p), which implicitly 2 gives the distribution of ˆp. We consider here the point probabilities for the distribution ˆp as a function of two variables ( ) n (y,p) p ny (1 p) n ny, (y,p) {0,1/n,...,1} [0,1]. ny For a given p these probabilities are the point probabilities for observing ˆp(x) = y as a function of y, but when we change p the point probabilities change as well. A combination of (y,p) with a large point probability has the interpretation that y is a likely estimate if p if the true parameter is p. On Figure 3.9 we see two examples (n = 10 and n = 50) that illustrate how the point probabilities change as a function of (y, p). For a given estimate y = ˆp(x) we can from the figure read of, which values of p that make the estimate likely and which that do not. Note in comparison with the normal distribution, as considered in Example and on Figure 3.8, that the shape of the large values changes from a simple band around the diagonal to a cigar shape in this binomial example. This is because the variance of the estimator ˆp changes with p and is largest for p = 0.5 and smallest when p approaches 0 or 1. What we lack in the examples above is to quantify precisely how uncertain a given estimate is. We have illustrated how the distribution of the estimator as a function of θ can be turned around to give information about which values of θ that could have produced a given estimate. In both examples the estimate was judged to be likely for values of the parameter close to the estimate and unlikely for values far from the estimate (high values are close to the diagonal). We would like to report an interval, say, around the estimate such that we are pretty confident that the true parameter is within the interval, but how large should a reasonable interval be? That essentially depends upon how certain or confident we want to be on that the true parameter is in the interval. The following definition captures this in a general formulation. Definition A confidence set for the parameter θ given the observation x E is a subset I(x) Θ. If we for each x E have given a confidence set I(x) we say that the family (I(x)) x E are level 1 α confidence sets for the unknown parameter if for all θ Θ P θ (θ I(X)) 1 α. (3.9) If Θ R and I(x) is an interval we call I(x) a confidence interval. Note that P θ (θ I(X)) is a probability statement about whether the random confidence set I(X) will contain the parameter prior to conducting the experiment, and not whether the parameter belongs to the confidence set I(x) after having observed the realisation x of X. This is a very subtle point about the interpretation of confidence sets. It is the observation and therefore the confidence set that is a realisation of the random experiment and not the unknown 2 the distribution of ˆp is a distribution on {0, 1/n, 2/n,..., 1} a set that changes with n and the convention is to report the distribution in terms of nˆp, which is a distribution on Z.

8 116 Statistics parameter. For a given realisation x we simple can t tell whether I(x) contains θ or not, since θ is unknown. But if we a priori to making the experiment decide upon a family of 1 α confidence sets that we will choose among depending on the observation, then we know that the probability that the confidence set we end up with really contains θ is at least 1 α no matter what θ is. If α is chosen small, α = 0.05 say, then we are pretty confident that θ is actually in I(x) and if α = 0.01 we are even more so. How are we then in practice going to construct confidence sets? In this set of notes we will always base the confidence set upon a given function H : Θ Θ R and define for A R the set I(x) by I(x) = {θ Θ H(ˆθ(x),θ) A}. One has to decide upon which set A to choose to attain the desired level for such a confidence sets. We find that P θ (θ I(X)) = P θ (H(ˆθ,θ) A), so to choose A to make the sets have level 1 α is then a matter of finding the distribution of the real valued random variable H(ˆθ,θ) under P θ. This distribution is a transformation of the distribution of ˆθ using the function ˆθ H(ˆθ,θ). Example Assume that Θ R and denote by σ(θ) = V θˆθ the standard deviation of the estimator ˆθ under P θ. Define H : Θ Θ R by H(ˆϑ,θ) = ˆϑ θ σ(ˆϑ). With A = [ z,z] and x E it follows (with ˆϑ = ˆθ(x)) that I(x) = {θ Θ H(ˆϑ,θ) A} = { θ Θ zσ(ˆϑ) ˆϑ θ zσ(ˆϑ) } = [ˆϑ zσ(ˆϑ), ˆϑ + zσ(ˆϑ) ]. Note that these confidence sets are always intervals. A typical choice of z is z = 1.96 (sometimes the less precise choice of z = 2 is used). This relies on an important theoretical result valid for many estimators, namely that the distribution of H(ˆθ,θ) under P θ is approximately a N(0, 1)-distribution for all θ when n is large enough. The probability that a normally distributed random variable with mean 0 and variance 1 falls in the interval [ z,z] is 1 z 2π z exp ) ( x2 dx, 2 and it is easy numerically to compute these integrals for different values of z. They equal 0.95 for z = 1.96, i.e. with z = 1.96 the intervals I(x) defined above are approximately 0.95 confidence intervals.

9 Estimator quality, confidence sets and bootstrapping 117 In the example above we introduced a construction of confidence sets that only approximately have the desired level 1 α. This leads to the definition of the actual and nominal coverage probability. The actual coverage probability is the function θ P θ (θ I(X)), and the sets I(x), x E, are level 1 α confidence sets if the actual coverage probability is larger than 1 α for all θ. If we aim at producing level 1 α confidence sets we call 1 α the nominal coverage probability. The intervals produced as in the example above are therefore said to have nominal coverage probability 0.95, but the actual coverage probability is unknown. Hopefully, if the approximation used is good, it is not far from the nominal Example Still with Θ R and σ(θ) = V θˆθ we may choose the function H : Θ Θ R as H(ˆϑ,θ) = ˆϑ θ σ(θ). The subtle difference compared to the previous example is that we divide by the variance of the estimator under P θ instead of under Pˆϑ. With this H and A = [ z,z] we find that I(x) = { θ Θ z H(ˆϑ,θ) z } = { θ Θ σ(θ)z ˆϑ θ σ(θ)z } but this is as far as we get. If we don t know more about the standard deviation σ(θ) as a function of θ, we can not find a more explicit expression for the confidence sets. Moreover, even if we have an analytic formula for the standard deviation it is not likely to be easy to solve the inequalities and there is no particular reason that we should get a nice set, e.g. an interval, out of it. The choice of z = 1.96 will, however, still produce (approximate) 0.95 confidence sets since also in this setup does H(ˆθ,θ) have approximately a N(0,1)-distribution. If σ(θ) is not too rapidly varying as a function of θ it plays only a minor role whether the procedure in this example or the procedure in Example is used. However, if σ(θ) is rapidly varying the actual coverage probability of the confidence intervals produced from Example may be substantially worse than the actual coverage probability produced by this Example. Example In Example and we could consider squaring the H-function, thus corresponding to example we obtain H(ˆϑ,θ) = (ˆϑ θ) 2 σ 2 (ˆϑ) (3.10) Then proceeding as in those examples, except choosing A = [0,z], the confidence set becomes I(x) = { θ Θ H(ˆϑ,θ) z } = [ˆϑ σ(ˆϑ) z, ˆϑ + σ(ˆϑ) z ]

10 118 Statistics Hence we obtain the same interval as in Example The point is that a similar approach works for multidimensional parameter sets, see Math Box Math Box (Multidimensional confidence sets). The approach in Example can be generalised to multidimensional parameters. If Θ R p the θ parameter is a p-dimensional (column) vector and if Σ(θ) denotes the covariance matrix of ˆθ under P θ the natural generalisation of Example is given by H(ˆϑ,θ) = (ˆϑ θ) T Σ(ˆϑ) 1 (ˆϑ θ). (3.11) The notation (ˆϑ θ) T denotes the transposed (row) vector of the column vector ˆϑ θ. The corresponding confidence set, with A = [0,z], becomes I(x) = { θ Θ H(ˆϑ,θ) z } = { θ Θ (ˆϑ θ) T Σ(ˆϑ) 1 (ˆϑ θ) z }. Such a set is known as an ellipsoid in R p, and if p = 2 the set is an ellipse. The generalisation corresponding to Example is given by H(ˆϑ,θ) = (ˆϑ θ) T Σ(θ) 1 (ˆϑ θ). (3.12) Just as in Example there is no nice interpretation of the corresponding confidence set. The most serious problem for constructing confidence sets is to find the distribution of H(ˆθ,θ) under P θ. When considering the two examples above, Example and Example 3.3.8, we overcame this problem by considering the distribution H(ˆθ,θ) to be approximately a N(0, 1)-distribution. The real obstacle left is then to find the standard deviation σ(θ) of the estimator ˆθ under P θ. Additional problems with solving certain inequalities also occurred in Example In practice, as implemented in a number of statistical software packages, the variance is also approximated by an asymptotic formula, which is obtainable for standard estimators like the MLE, and the approach in Example is used. Standard programs rarely wants to bother with solving inequalities as in Example and reporting non-intervals is certainly not tractable. The typical confidence intervals reported by statistical software are therefore based on a number of approximations that are valid only if we have sufficiently many replications of our experiment. It is important to be aware of the fact that this can lead to confidence intervals with actual coverage probability lower than the nominal level. Example As a continuation of Example let X 1,...,X n be iid N(µ,σ 2 ) distributed. Consider the estimator ˆµ = 1 n X i n

11 Estimator quality, confidence sets and bootstrapping 119 of the mean and regard the variance as fixed. We know that the mean and variance of ˆµ is µ and σ 2 /n respectively. The distribution of ˆµ is N(µ,σ 2 /n) and therefore the distribution of n(ˆµ µ) H(ˆµ,µ) = σ is N(0,1), and the construction of confidence intervals as in Example is not an approximation for this particular case. This H function is the mathematical incarnation of Figure 3.8. Example Let X 1,...,X n be iid N(µ,σ0 2(µ)) distributed with σ 0(µ) some function depending upon µ. That is, the standard deviation changes as a function of the mean. Consider the same estimator ˆµ = 1 n X i n as above. Then ˆµ is N(µ,σ0 2 (µ)/n)-distributed, but n(ˆµ µ) H(ˆµ,µ) = σ 0 (ˆµ) does not in general have a normal distribution. In the notation used previously the standard deviation of the estimator is σ(µ) = V θ (ˆθ) = σ 0(µ), n and n(ˆµ µ) H 2 (ˆµ,µ) = σ(µ) is N(0,1)-distributed. Confidence sets based on H 2 and Example are therefore not approximate but the sets are not necessarily intervals. A real horror example on the deficits of using H is obtained by letting { 10 if µ 0 σ(µ) = 1 if µ > 0 Take µ = 0 and fix n = 10. With the observation x = (x 1,...,x 10 ) assume that ˆµ = x i > 0 (something that happens with probability 0.5 under P 0 ) a 0.95 confidence interval based upon H is I(x) = [ˆµ 1.96σ 0 (ˆµ)/ 10, ˆµ σ 0 (ˆµ)/ 10] = [ˆµ 0.620, ˆµ ]. We observe that I(x) doesn t contain the value µ = 0 if ˆµ > and this happens with probability The conclusion is that the actual coverage probability is lower than 0.58

12 120 Statistics for µ = 0. Of course µ = 0 is the worst possible choice of µ, but e.g. for µ = 1 the actual coverage probability is still lower than Using H 2 instead we are guaranteed to get the right coverage probability but the confidence sets are in some cases not intervals. We find that I(x) = {µ 1.96 H 2 (ˆµ,µ) 1.96} = {µ 0.620σ(µ) ˆµ µ 0.620σ(µ)} = [ˆµ 0.620, ˆµ ] [0, ) [ˆµ 6.20, ˆµ ] (,0]. The problem in the previous example is that the distribution of H(ˆµ,µ) changes in a very rapid way if µ changes from being negative to being positive. As a rule of thumb, the less the distribution of H(ˆθ,θ) changes with θ the more can we trust that the confidence intervals have an actual coverage probability close to the nominal level 1 α. One should note that if Θ R p is a multidimensional parameter space all the methods discussed in this section can be applied to each of the coordinates of the parameter. That is, an estimator is a p-dimensional map ˆθ = (ˆθ 1,..., ˆθ p ) and each of the coordinates ˆθ i is a real valued estimator. Considering each coordinate can then give marginal information about the uncertainty of a concrete estimate by producing e.g. marginal confidence intervals for each coordinate. One should be careful though to make any kind of multidimensional interpretation from such one-dimensional confidence intervals Bootstrapping The idea in bootstrapping for constructing confidence sets is to find an approximation of the distribution of H(ˆθ,θ), usually by doing some simulations, that depends upon the observed dataset x E. What we try is to approximate the distribution of H(ˆθ,θ) under P θ for all θ Θ by a single distribution, namely the distribution of H(ˆθ, ˆϑ) under a cleverly chosen probability measure P x, which may depend upon x. Since the distribution is allowed to depend upon the concrete observation x the construction of the confidence set that provides information about the uncertainty of the estimate ˆϑ = ˆθ(x) depends upon the observation itself. Hence what we are going to suggest is to pull information about the uncertainty of an estimate out from the very same data used to make the estimate, and for this reason the method is known as bootstrapping. Supposedly one of the stories in The Surprising Adventures of Baron Munchausen by Rudolf Erich Raspe ( ) contains a passage where the Baron pulls himself out of a deep lake by pulling his own bootstraps. Such a story can, however, not be found in the original writings by Raspe, but the stories of Baron Munchausen were borrowed and expanded by other writers, and one can find versions where the Baron indeed did something like that. To bootstrap is nowadays, with reference to the Baron Munchausen story, used to describe certain seemingly paradoxical constructions or actions. To boot a computer is an abbreviation of running a so-called bootstrap procedure that gets the computer up and running from scratch. The problem of finding the distribution of H(ˆθ,θ) for all θ is replaced by the problem of finding a single distribution under a probability measure that depends upon x. Different

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

STAT 830 Bayesian Estimation

STAT 830 Bayesian Estimation STAT 830 Bayesian Estimation Richard Lockhart Simon Fraser University STAT 830 Fall 2011 Richard Lockhart (Simon Fraser University) STAT 830 Bayesian Estimation STAT 830 Fall 2011 1 / 23 Purposes of These

More information

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Joint Probability Distributions and Random Samples (Devore Chapter Five) Joint Probability Distributions and Random Samples (Devore Chapter Five) 1016-345-01: Probability and Statistics for Engineers Spring 2013 Contents 1 Joint Probability Distributions 2 1.1 Two Discrete

More information

Statistics 3858 : Maximum Likelihood Estimators

Statistics 3858 : Maximum Likelihood Estimators Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables? Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what

More information

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One. Outline. Charlotte Wickham  1. Basic ideas about estimation Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

More information

ECE531 Lecture 10b: Maximum Likelihood Estimation

ECE531 Lecture 10b: Maximum Likelihood Estimation ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23 Introduction So

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 15: Examples of hypothesis tests (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 32 The recipe 2 / 32 The hypothesis testing recipe In this lecture we repeatedly apply the

More information

Chapters 9. Properties of Point Estimators

Chapters 9. Properties of Point Estimators Chapters 9. Properties of Point Estimators Recap Target parameter, or population parameter θ. Population distribution f(x; θ). { probability function, discrete case f(x; θ) = density, continuous case The

More information

Loglikelihood and Confidence Intervals

Loglikelihood and Confidence Intervals Stat 504, Lecture 2 1 Loglikelihood and Confidence Intervals The loglikelihood function is defined to be the natural logarithm of the likelihood function, l(θ ; x) = log L(θ ; x). For a variety of reasons,

More information

Introduction to Maximum Likelihood Estimation

Introduction to Maximum Likelihood Estimation Introduction to Maximum Likelihood Estimation Eric Zivot July 26, 2012 The Likelihood Function Let 1 be an iid sample with pdf ( ; ) where is a ( 1) vector of parameters that characterize ( ; ) Example:

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 12: Frequentist properties of estimators (v4) Ramesh Johari ramesh.johari@stanford.edu 1 / 39 Frequentist inference 2 / 39 Thinking like a frequentist Suppose that for some

More information

STAT 830 Non-parametric Inference Basics

STAT 830 Non-parametric Inference Basics STAT 830 Non-parametric Inference Basics Richard Lockhart Simon Fraser University STAT 801=830 Fall 2012 Richard Lockhart (Simon Fraser University)STAT 830 Non-parametric Inference Basics STAT 801=830

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

POLI 8501 Introduction to Maximum Likelihood Estimation

POLI 8501 Introduction to Maximum Likelihood Estimation POLI 8501 Introduction to Maximum Likelihood Estimation Maximum Likelihood Intuition Consider a model that looks like this: Y i N(µ, σ 2 ) So: E(Y ) = µ V ar(y ) = σ 2 Suppose you have some data on Y,

More information

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester Physics 403 Parameter Estimation, Correlations, and Error Bars Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Best Estimates and Reliability

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Confidence Intervals. Confidence interval for sample mean. Confidence interval for sample mean. Confidence interval for sample mean

Confidence Intervals. Confidence interval for sample mean. Confidence interval for sample mean. Confidence interval for sample mean Confidence Intervals Confidence interval for sample mean The CLT tells us: as the sample size n increases, the sample mean is approximately Normal with mean and standard deviation Thus, we have a standard

More information

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom Central Limit Theorem and the Law of Large Numbers Class 6, 8.5 Jeremy Orloff and Jonathan Bloom Learning Goals. Understand the statement of the law of large numbers. 2. Understand the statement of the

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Statistical Inference

Statistical Inference Chapter 14 Confidence Intervals: The Basic Statistical Inference Situation: We are interested in estimating some parameter (population mean, μ) that is unknown. We take a random sample from this population.

More information

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments STAT 135 Lab 3 Asymptotic MLE and the Method of Moments Rebecca Barter February 9, 2015 Maximum likelihood estimation (a reminder) Maximum likelihood estimation Suppose that we have a sample, X 1, X 2,...,

More information

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables? Linear Regression Machine Learning CSE546 Sham Kakade University of Washington Oct 4, 2016 1 What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do

More information

Better Bootstrap Confidence Intervals

Better Bootstrap Confidence Intervals by Bradley Efron University of Washington, Department of Statistics April 12, 2012 An example Suppose we wish to make inference on some parameter θ T (F ) (e.g. θ = E F X ), based on data We might suppose

More information

V. Properties of estimators {Parts C, D & E in this file}

V. Properties of estimators {Parts C, D & E in this file} A. Definitions & Desiderata. model. estimator V. Properties of estimators {Parts C, D & E in this file}. sampling errors and sampling distribution 4. unbiasedness 5. low sampling variance 6. low mean squared

More information

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1 Estimation Theory Overview Properties Bias, Variance, and Mean Square Error Cramér-Rao lower bound Maximum likelihood Consistency Confidence intervals Properties of the mean estimator Properties of the

More information

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn! Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn! Questions?! C. Porciani! Estimation & forecasting! 2! Cosmological parameters! A branch of modern cosmological research focuses

More information

Review of Discrete Probability (contd.)

Review of Discrete Probability (contd.) Stat 504, Lecture 2 1 Review of Discrete Probability (contd.) Overview of probability and inference Probability Data generating process Observed data Inference The basic problem we study in probability:

More information

7.1 Basic Properties of Confidence Intervals

7.1 Basic Properties of Confidence Intervals 7.1 Basic Properties of Confidence Intervals What s Missing in a Point Just a single estimate What we need: how reliable it is Estimate? No idea how reliable this estimate is some measure of the variability

More information

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the

More information

Chapter 12: An introduction to Time Series Analysis. Chapter 12: An introduction to Time Series Analysis

Chapter 12: An introduction to Time Series Analysis. Chapter 12: An introduction to Time Series Analysis Chapter 12: An introduction to Time Series Analysis Introduction In this chapter, we will discuss forecasting with single-series (univariate) Box-Jenkins models. The common name of the models is Auto-Regressive

More information

The Surprising Conditional Adventures of the Bootstrap

The Surprising Conditional Adventures of the Bootstrap The Surprising Conditional Adventures of the Bootstrap G. Alastair Young Department of Mathematics Imperial College London Inaugural Lecture, 13 March 2006 Acknowledgements Early influences: Eric Renshaw,

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Harvard University. Rigorous Research in Engineering Education

Harvard University. Rigorous Research in Engineering Education Statistical Inference Kari Lock Harvard University Department of Statistics Rigorous Research in Engineering Education 12/3/09 Statistical Inference You have a sample and want to use the data collected

More information

Lecture 2: Statistical Decision Theory (Part I)

Lecture 2: Statistical Decision Theory (Part I) Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical

More information

Statistical inference

Statistical inference Statistical inference Contents 1. Main definitions 2. Estimation 3. Testing L. Trapani MSc Induction - Statistical inference 1 1 Introduction: definition and preliminary theory In this chapter, we shall

More information

HT Introduction. P(X i = x i ) = e λ λ x i

HT Introduction. P(X i = x i ) = e λ λ x i MODS STATISTICS Introduction. HT 2012 Simon Myers, Department of Statistics (and The Wellcome Trust Centre for Human Genetics) myers@stats.ox.ac.uk We will be concerned with the mathematical framework

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013 Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you

More information

Notes 3: Statistical Inference: Sampling, Sampling Distributions Confidence Intervals, and Hypothesis Testing

Notes 3: Statistical Inference: Sampling, Sampling Distributions Confidence Intervals, and Hypothesis Testing Notes 3: Statistical Inference: Sampling, Sampling Distributions Confidence Intervals, and Hypothesis Testing 1. Purpose of statistical inference Statistical inference provides a means of generalizing

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Slope Fields: Graphing Solutions Without the Solutions

Slope Fields: Graphing Solutions Without the Solutions 8 Slope Fields: Graphing Solutions Without the Solutions Up to now, our efforts have been directed mainly towards finding formulas or equations describing solutions to given differential equations. Then,

More information

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That Statistics Lecture 2 August 7, 2000 Frank Porter Caltech The plan for these lectures: The Fundamentals; Point Estimation Maximum Likelihood, Least Squares and All That What is a Confidence Interval? Interval

More information

University of Regina. Lecture Notes. Michael Kozdron

University of Regina. Lecture Notes. Michael Kozdron University of Regina Statistics 252 Mathematical Statistics Lecture Notes Winter 2005 Michael Kozdron kozdron@math.uregina.ca www.math.uregina.ca/ kozdron Contents 1 The Basic Idea of Statistics: Estimating

More information

CS 5014: Research Methods in Computer Science. Bernoulli Distribution. Binomial Distribution. Poisson Distribution. Clifford A. Shaffer.

CS 5014: Research Methods in Computer Science. Bernoulli Distribution. Binomial Distribution. Poisson Distribution. Clifford A. Shaffer. Department of Computer Science Virginia Tech Blacksburg, Virginia Copyright c 2015 by Clifford A. Shaffer Computer Science Title page Computer Science Clifford A. Shaffer Fall 2015 Clifford A. Shaffer

More information

DIFFERENTIAL EQUATIONS

DIFFERENTIAL EQUATIONS DIFFERENTIAL EQUATIONS Basic Concepts Paul Dawkins Table of Contents Preface... Basic Concepts... 1 Introduction... 1 Definitions... Direction Fields... 8 Final Thoughts...19 007 Paul Dawkins i http://tutorial.math.lamar.edu/terms.aspx

More information

Chapter 3. Estimation of p. 3.1 Point and Interval Estimates of p

Chapter 3. Estimation of p. 3.1 Point and Interval Estimates of p Chapter 3 Estimation of p 3.1 Point and Interval Estimates of p Suppose that we have Bernoulli Trials (BT). So far, in every example I have told you the (numerical) value of p. In science, usually the

More information

Lecture 6: Finite Fields

Lecture 6: Finite Fields CCS Discrete Math I Professor: Padraic Bartlett Lecture 6: Finite Fields Week 6 UCSB 2014 It ain t what they call you, it s what you answer to. W. C. Fields 1 Fields In the next two weeks, we re going

More information

Advanced Signal Processing Introduction to Estimation Theory

Advanced Signal Processing Introduction to Estimation Theory Advanced Signal Processing Introduction to Estimation Theory Danilo Mandic, room 813, ext: 46271 Department of Electrical and Electronic Engineering Imperial College London, UK d.mandic@imperial.ac.uk,

More information

Interval estimation. October 3, Basic ideas CLT and CI CI for a population mean CI for a population proportion CI for a Normal mean

Interval estimation. October 3, Basic ideas CLT and CI CI for a population mean CI for a population proportion CI for a Normal mean Interval estimation October 3, 2018 STAT 151 Class 7 Slide 1 Pandemic data Treatment outcome, X, from n = 100 patients in a pandemic: 1 = recovered and 0 = not recovered 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 0

More information

ACMS Statistics for Life Sciences. Chapter 13: Sampling Distributions

ACMS Statistics for Life Sciences. Chapter 13: Sampling Distributions ACMS 20340 Statistics for Life Sciences Chapter 13: Sampling Distributions Sampling We use information from a sample to infer something about a population. When using random samples and randomized experiments,

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Estimators as Random Variables

Estimators as Random Variables Estimation Theory Overview Properties Bias, Variance, and Mean Square Error Cramér-Rao lower bound Maimum likelihood Consistency Confidence intervals Properties of the mean estimator Introduction Up until

More information

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment }

More information

Chapter 8 - Statistical intervals for a single sample

Chapter 8 - Statistical intervals for a single sample Chapter 8 - Statistical intervals for a single sample 8-1 Introduction In statistics, no quantity estimated from data is known for certain. All estimated quantities have probability distributions of their

More information

Topic 5 Notes Jeremy Orloff. 5 Homogeneous, linear, constant coefficient differential equations

Topic 5 Notes Jeremy Orloff. 5 Homogeneous, linear, constant coefficient differential equations Topic 5 Notes Jeremy Orloff 5 Homogeneous, linear, constant coefficient differential equations 5.1 Goals 1. Be able to solve homogeneous constant coefficient linear differential equations using the method

More information

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator Estimation Theory Estimation theory deals with finding numerical values of interesting parameters from given set of data. We start with formulating a family of models that could describe how the data were

More information

1 Using standard errors when comparing estimated values

1 Using standard errors when comparing estimated values MLPR Assignment Part : General comments Below are comments on some recurring issues I came across when marking the second part of the assignment, which I thought it would help to explain in more detail

More information

SIO 221B, Rudnick adapted from Davis 1. 1 x lim. N x 2 n = 1 N. { x} 1 N. N x = 1 N. N x = 1 ( N N x ) x = 0 (3) = 1 x N 2

SIO 221B, Rudnick adapted from Davis 1. 1 x lim. N x 2 n = 1 N. { x} 1 N. N x = 1 N. N x = 1 ( N N x ) x = 0 (3) = 1 x N 2 SIO B, Rudnick adapted from Davis VII. Sampling errors We do not have access to the true statistics, so we must compute sample statistics. By this we mean that the number of realizations we average over

More information

4.2 Estimation on the boundary of the parameter space

4.2 Estimation on the boundary of the parameter space Chapter 4 Non-standard inference As we mentioned in Chapter the the log-likelihood ratio statistic is useful in the context of statistical testing because typically it is pivotal (does not depend on any

More information

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Monte Carlo Studies. The response in a Monte Carlo study is a random variable. Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating

More information

Algebra Exam. Solutions and Grading Guide

Algebra Exam. Solutions and Grading Guide Algebra Exam Solutions and Grading Guide You should use this grading guide to carefully grade your own exam, trying to be as objective as possible about what score the TAs would give your responses. Full

More information

Part 4: Multi-parameter and normal models

Part 4: Multi-parameter and normal models Part 4: Multi-parameter and normal models 1 The normal model Perhaps the most useful (or utilized) probability model for data analysis is the normal distribution There are several reasons for this, e.g.,

More information

Choosing among models

Choosing among models Eco 515 Fall 2014 Chris Sims Choosing among models September 18, 2014 c 2014 by Christopher A. Sims. This document is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported

More information

Lecture 10: Generalized likelihood ratio test

Lecture 10: Generalized likelihood ratio test Stat 200: Introduction to Statistical Inference Autumn 2018/19 Lecture 10: Generalized likelihood ratio test Lecturer: Art B. Owen October 25 Disclaimer: These notes have not been subjected to the usual

More information

An analogy from Calculus: limits

An analogy from Calculus: limits COMP 250 Fall 2018 35 - big O Nov. 30, 2018 We have seen several algorithms in the course, and we have loosely characterized their runtimes in terms of the size n of the input. We say that the algorithm

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Confidence Intervals

Confidence Intervals Quantitative Foundations Project 3 Instructor: Linwei Wang Confidence Intervals Contents 1 Introduction 3 1.1 Warning....................................... 3 1.2 Goals of Statistics..................................

More information

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,

More information

Statistical Data Analysis

Statistical Data Analysis DS-GA 0 Lecture notes 8 Fall 016 1 Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the

More information

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Theory of Maximum Likelihood Estimation. Konstantin Kashin Gov 2001 Section 5: Theory of Maximum Likelihood Estimation Konstantin Kashin February 28, 2013 Outline Introduction Likelihood Examples of MLE Variance of MLE Asymptotic Properties What is Statistical

More information

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain 0.1. INTRODUCTION 1 0.1 Introduction R. A. Fisher, a pioneer in the development of mathematical statistics, introduced a measure of the amount of information contained in an observaton from f(x θ). Fisher

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Bias and variance (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 49 Our plan today We saw in last lecture that model scoring methods seem to be trading off two different

More information

CSE 103 Homework 8: Solutions November 30, var(x) = np(1 p) = P r( X ) 0.95 P r( X ) 0.

CSE 103 Homework 8: Solutions November 30, var(x) = np(1 p) = P r( X ) 0.95 P r( X ) 0. () () a. X is a binomial distribution with n = 000, p = /6 b. The expected value, variance, and standard deviation of X is: E(X) = np = 000 = 000 6 var(x) = np( p) = 000 5 6 666 stdev(x) = np( p) = 000

More information

Algorithm Independent Topics Lecture 6

Algorithm Independent Topics Lecture 6 Algorithm Independent Topics Lecture 6 Jason Corso SUNY at Buffalo Feb. 23 2009 J. Corso (SUNY at Buffalo) Algorithm Independent Topics Lecture 6 Feb. 23 2009 1 / 45 Introduction Now that we ve built an

More information

Elements of statistics (MATH0487-1)

Elements of statistics (MATH0487-1) Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium November 12, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -

More information

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1 Estimation September 24, 2018 STAT 151 Class 6 Slide 1 Pandemic data Treatment outcome, X, from n = 100 patients in a pandemic: 1 = recovered and 0 = not recovered 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 0 1 1 1

More information

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math.

Regression, part II. I. What does it all mean? A) Notice that so far all we ve done is math. Regression, part II I. What does it all mean? A) Notice that so far all we ve done is math. 1) One can calculate the Least Squares Regression Line for anything, regardless of any assumptions. 2) But, if

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Machine Learning, Fall 2012 Homework 2

Machine Learning, Fall 2012 Homework 2 0-60 Machine Learning, Fall 202 Homework 2 Instructors: Tom Mitchell, Ziv Bar-Joseph TA in charge: Selen Uguroglu email: sugurogl@cs.cmu.edu SOLUTIONS Naive Bayes, 20 points Problem. Basic concepts, 0

More information

Simulations. . p.1/25

Simulations. . p.1/25 Simulations Computer simulations of realizations of random variables has become indispensable as supplement to theoretical investigations and practical applications.. p.1/25 Simulations Computer simulations

More information

Topic 12 Overview of Estimation

Topic 12 Overview of Estimation Topic 12 Overview of Estimation Classical Statistics 1 / 9 Outline Introduction Parameter Estimation Classical Statistics Densities and Likelihoods 2 / 9 Introduction In the simplest possible terms, the

More information

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM Subject Business Economics Paper No and Title Module No and Title Module Tag 8, Fundamentals of Econometrics 3, The gauss Markov theorem BSE_P8_M3 1 TABLE OF CONTENTS 1. INTRODUCTION 2. ASSUMPTIONS OF

More information

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing STAT 135 Lab 5 Bootstrapping and Hypothesis Testing Rebecca Barter March 2, 2015 The Bootstrap Bootstrap Suppose that we are interested in estimating a parameter θ from some population with members x 1,...,

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

1 What does the random effect η mean?

1 What does the random effect η mean? Some thoughts on Hanks et al, Environmetrics, 2015, pp. 243-254. Jim Hodges Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota USA 55414 email: hodge003@umn.edu October 13, 2015

More information

Primer on statistics:

Primer on statistics: Primer on statistics: MLE, Confidence Intervals, and Hypothesis Testing ryan.reece@gmail.com http://rreece.github.io/ Insight Data Science - AI Fellows Workshop Feb 16, 018 Outline 1. Maximum likelihood

More information

Linear Independence Reading: Lay 1.7

Linear Independence Reading: Lay 1.7 Linear Independence Reading: Lay 17 September 11, 213 In this section, we discuss the concept of linear dependence and independence I am going to introduce the definitions and then work some examples and

More information

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3 Hypothesis Testing CB: chapter 8; section 0.3 Hypothesis: statement about an unknown population parameter Examples: The average age of males in Sweden is 7. (statement about population mean) The lowest

More information

IEOR E4703: Monte-Carlo Simulation

IEOR E4703: Monte-Carlo Simulation IEOR E4703: Monte-Carlo Simulation Output Analysis for Monte-Carlo Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com Output Analysis

More information

Math Review Sheet, Fall 2008

Math Review Sheet, Fall 2008 1 Descriptive Statistics Math 3070-5 Review Sheet, Fall 2008 First we need to know about the relationship among Population Samples Objects The distribution of the population can be given in one of the

More information

Practice Problems Section Problems

Practice Problems Section Problems Practice Problems Section 4-4-3 4-4 4-5 4-6 4-7 4-8 4-10 Supplemental Problems 4-1 to 4-9 4-13, 14, 15, 17, 19, 0 4-3, 34, 36, 38 4-47, 49, 5, 54, 55 4-59, 60, 63 4-66, 68, 69, 70, 74 4-79, 81, 84 4-85,

More information

Expectation is linear. So far we saw that E(X + Y ) = E(X) + E(Y ). Let α R. Then,

Expectation is linear. So far we saw that E(X + Y ) = E(X) + E(Y ). Let α R. Then, Expectation is linear So far we saw that E(X + Y ) = E(X) + E(Y ). Let α R. Then, E(αX) = ω = ω (αx)(ω) Pr(ω) αx(ω) Pr(ω) = α ω X(ω) Pr(ω) = αe(x). Corollary. For α, β R, E(αX + βy ) = αe(x) + βe(y ).

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:

More information