Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating process. At each scenario, a Monte Carlo study generates multiple realizations of the response, and we use aggregations of those means, etc. How many realizations? The Monte Carlo sample size. Because a Monte Carlo study of a statistical method involves sample sizes, we must be clear on what affects the variances of the aggregates (means, etc.). Monte Carlo sample size: m. 1

Example: Variances in a Monte Carlo Study of a Statistical Hypothesis Test Consider two-sample t test for equality of means. H 0 : µ 1 = µ 2 vs. H 1 : µ 1 µ 2 Sample sizes n 1 and n 2. When is the test valid, most powerful, etc. (all the good things you can say about a test)? What about other cases: 1) N(µ 1, σ 2 1 ) vs. N(µ 2, σ 2 2 ) 2) N(µ, σ 2 ) vs. Distribution 2 3) Distribution 1 vs. Distribution 2? Lots of scenarios. What are they? 2

Variances in the Monte Carlo Study of a Statistical Hypothesis Test For a given scenario, we generate m s pairs of samples, perform the test, add to count of number rejections. Monte Carlo estimate of β(s): r m s In statistics, when we give an estimate, we also give an estimate of the variance of the estimator. What is an estimate of variance of β(s)? 3

Estimates of Variances in the Monte Carlo Study sample variance or because it is binomial, β(s)(1 β(s)) m s What is the point here? The standard deviation is O(m 1 s ). We choose m s. How? 4

Preliminaries (Mostly from Chapter 1) Data structures and structure in data Multiple analyses and multiple views Modeling and computational inference Probability models The role of the empirical cumulative distribution function Statistical functions of the CDF and the ECDF Plug-in estimators Order statistics, quantiles, and empirical quantiles The role of optimization in inference *** carry over to next lecture Estimation by minimizing residuals Estimation by maximum likelihood Inference about functions Probability statements in statistical inference Arithmetic on the computer 5

As we go along, we will encounter important statistical concepts in statistical inference: sufficiency, unbiasedness, mean-squared error, etc. 6

Data-Generating Processes and Statistical Models Our understanding of phenomena is facilitated by means of a model. A model is a description of the phenomenon of interest. We can formulate a model either as a description of a datagenerating process, or as a prescription for processing data. The model is often expressed as a set of equations that relate data elements to each other. It may include probability distributions for the data elements. If any of the data elements are considered to be realizations of random variables, the model is a stochastic model. 7

Models A class of models may have a common form within which the members of the class are distinguished by values of parameters. In models that are not mathematically tractable computationally intensive methods involving simulations, resamplings, and multiple views may be used to make inferences about the parameters of a model. 8

Structure in Data The components of statistical datasets are observations and variables. In general, data structures are ways of organizing data to take advantage of the relationships among the variables constituting the dataset. Data structures may express hierarchical relationships, crossed relationships (as in relational databases), or more complicated aspects of the data (as in object-oriented databases). In data analysis, structure in the data is of interest. 9

Structure in Data Structure in the data includes such nonparametric features as modes, gaps, or clusters in the data, the symmetry of the data, and other general aspects of the shape of the data. Because many classical techniques of statistical analysis rely on an assumption of normality of the data, the most interesting structure in the data may be those aspects of the data that deviate most from normality. Graphical displays may be used to discover qualitative structure in the data. 10

Model Building The process of building models involves successive refinements. The evolution of the models proceeds from vague, tentative models to more complete ones, and our understanding of the process being modeled grows in this process. The usual statements about statistical methods regarding bias, variance, and so on are made in the context of a model. 11

Model Building It is not possible to measure bias or variance of a procedure to select a model, except in the relatively simple case of selection from some well-defined and simple set of possible models. Only within the context of rigid assumptions (a metamodel ) can we do a precise statistical analysis of model selection. Even the simple cases of selection of variables in linear regression analysis under the usual assumptions about the distribution of residuals (and this is a highly idealized situation) present more problems to the analyst than are generally recognized. 12

Descriptive Statistics, Inferential Statistics, and Model Building We can distinguish statistical activities that involve: data collection; descriptions of a given dataset; inference within the context of a model or family of models; and model selection. 13

Once data are available, either from a survey or designed experiment, or just observational data, a statistical analysis begins by considering general descriptions of the dataset. These descriptions include ensemble characteristics, such as averages and spreads, and identification of extreme points. The descriptions are in the form of various summary statistics and graphical displays. The descriptive analyses may be computationally intensive for large datasets, especially if there are a large number of variables. 14

Computational Statistics The computationally intensive approach also involves multiple views of the data, including consideration of various transformations of the data. A stochastic model is often expressed as a probability density function or as a cumulative distribution function of a random variable. In a simple linear regression model with normal errors, Y = β 0 + β 1 x + E, for example, the model may be expressed by use of the probability density function for the random variable E. The probability density function for Y is p(y) = 1 2πσ e (y β 0 β 1 x) 2 /(2σ 2). The elements of a stochastic model include observable random variables, observable covariates, unobservable parameters, and constants. 15

Statistical Models The parameters may be considered to be unobservable random variables, and in that sense, a specific data model is defined by a realization of the parameter random variable. In the model, written as Y = f(x; β) + E, we identify a systematic component, f(x; β), and a random component, E. The selection of an appropriate model may be very difficult, and almost always involves not only questions of how well the model corresponds to the observed data, but also the tractability of the model. The methods of computational statistics allow a much wider range of tractability than can be contemplated in mathematical statistics. 16

Classical Statistical Inference Formal statistical inference involves use of a sample to make decisions about stochastic models based on probabilities that would result if a given model was indeed the data-generating process. Estimation. Testing. The heuristic paradigm calls for rejection of a model if the probability is small that data arising from the model would be similar to the observed sample. In either case, classical statistical inference may use asymptotic approximations. Asymptotic inference. 17

Computational Inference Computationally intensive methods include exploration of a range of models, many of which may be mathematically intractable. In a different approach employing the same paradigm, the statistical methods may involve direct simulation of the hypothesized data-generating process rather than formal computations of probabilities that would result under a given model of the data-generating process. We refer to this approach as computational inference. In a variation of computational inference, we may not even attempt to develop a model of the data-generating process; rather, we build decision rules directly from the data. 18

The Empirical Cumulative Distribution Function Methods of statistical inference are based on an assumption (often implicit) that a discrete uniform distribution with mass points at the observed values of a random sample is asymptotically the same as the distribution governing the data-generating process. Thus, the distribution function of this discrete uniform distribution is a model of the distribution function of the data-generating process. For a given set of univariate data, y 1,..., y n, the empirical cumulative distribution function, or ECDF, is P n (y) = #{y i, s.t. y i y}. n The ECDF is the basic function used in many methods of computational inference. 19

The Empirical Cumulative Distribution Function It is easy to see that the ECDF is pointwise unbiased for the CDF. That is, if the y i are independent realizations of random variables Y i, each with CDF P( ), for a given y, E(P n (y)) = E 1 n = 1 n n i=1 n i=1 = Pr(Y y) = P(y). I (,y] (Y i ) E ( I (,y] (Y i ) ) 20

So E(P n (y)) = P(y). Similarly, we find V(P n (y)) = P(y)(1 P(y))/n; indeed, at a fixed point y, np n (y) is a binomial random variable with parameters n and π = P(y). See Exercise 1.2. Because P n is a function of the order statistics, which form a complete sufficient statistic for P, there is no unbiased estimator of P(y) with smaller variance. 21

The Empirical Probability Density Function We also define the empirical probability density function (EPDF) as the derivative of the ECDF: p n (y) = 1 n n i=1 where δ is the Dirac delta function. δ(y y i ), The EPDF is just a series of spikes at points corresponding to the observed values. It is not as useful as the ECDF. It is, however, unbiased at any point for the probability density function at that point. The ECDF and the EPDF can be used as estimators of the corresponding population functions, but there are better estimators. 22

Statistical Functions of the CDF and the ECDF Statistical Functions of the CDF and the ECDF In many models of interest, a parameter can be expressed as a functional of the probability density function or of the cumulative distribution function of a random variable in the model. The mean of a distribution, for example, can be expressed as a functional Θ of the CDF P: Θ(P) = y dp(y). IR d A functional that defines a parameter is called a statistical function. 23

Estimation of Statistical Functions A common task in statistics is to use a random sample to estimate the parameters of a probability distribution. If the statistic T from a random sample is used to estimate the parameter θ, we measure the performance of T by the magnitude of the bias, by the variance, V(T) = E by the mean squared error, E(T) θ, ( (T E(T))(T E(T)) T), E ( (T θ) T (T θ) ), and by other expected values of measures of the distance from T to θ. 24

Properties of Estimators The order of the mean squared error is an important characteristic of an estimator. For good estimators of location, the order of the mean squared error is typically O(n 1 ). Good estimators of probability densities, however, typically have mean squared errors of at least order O(n 4/5 ). 25

Estimation Using the ECDF There are many ways to construct an estimator and to make inferences about the population. In the univariate case especially, we often use data to make inferences about a parameter by applying the statistical function to the ECDF. An estimator of a parameter that is defined in this way is called a plug-in estimator. A plug-in estimator for a given parameter is the same functional of the ECDF as the parameter is of the CDF. 26

Plug-In Estimators For the mean of the model, for example, we use the estimate that is the same functional of the ECDF as the population mean, Θ(P n ) = = = 1 n y dp n(y) = 1 n = ȳ. y d1 n n i=1 n y i i=1 n i=1 I (,y] (y i ) y di (,y] (y i) The sample mean is thus a plug-in estimator of the population mean. 27

Plug-In Estimators An estimator, such as the sample mean, is called a method of moments estimator. Method of moments estimators are an important type of plug-in estimator. The method of moments results in estimates of the parameters E(Y r ) that are the corresponding sample moments. Statistical properties of plug-in estimators are generally relatively easy to determine. In some cases, the statistical properties, such as expectation and variance, are optimal in some sense. 28

Estimation Using the ECDF In addition to estimation based on the ECDF, other methods of computational statistics make use of the ECDF. In some cases, such as in bootstrap methods, the ECDF is a surrogate for the CDF. In other cases, such as Monte Carlo methods, an ECDF for an estimator is constructed by repeated sampling, and that ECDF is used to make inferences using the observed value of the estimator from the given sample. Use of the ECDF in statistical inference does not require many assumptions about the distribution. 29

Estimation Using the ECDF Viewed as a statistical function, Θ denotes a specific functional form. Any functional of the ECDF is a function of the data, so we may also use the notation Θ(Y 1,..., Y n ). Often, however, the notation is cleaner if we use another letter to denote the function of the data; for example, T(Y 1,..., Y n ), even if it might be the case that T(Y 1,..., Y n ) = Θ(P n ). 30

Quantiles A useful distributional measure for describing a univariate distribution with CDF P is is a quantity y π, such that for π (0,1). Pr(Y y π ) π, and Pr(Y y π ) 1 π, This quantity is called a π quantile. For an absolutely continuous distribution with CDF P, y π = P 1 (π). If P is not absolutely continuous, or in the case of a multivariate random variable, y π in this equation may not be unique. 31

The Quantile Function In the case of a univariate random variable, we can define a useful concept of quantile that always exists. For a probability distribution with CDF P, we define the function P 1 on the open interval (0,1) as P 1 (π) = inf{x, s.t. P(x) π}. We call P 1 the quantile function. Notice that if P is strictly increasing, the quantile function is the ordinary inverse of the cumulative distribution function. If P is not strictly increasing, the quantile function can be interpreted as a generalized inverse of the cumulative distribution function. Notice that for the random variable X with CDF P, if x (π) = P 1 (π), then x (π) is the π quantile of X as above. 32

Quantiles For a univariate distribution we can define a unique π quantile as a weighted average of values around y π, where π π and P(y π ) = π. It is clear that y π is a functional of the CDF, say Ξ π (P). The functional is very simple. It is Ξ π (P) = P 1 (π), where P 1 is the quantile function. For a univariate random variable, the π quantile is a single point. For a d-variate random variable, a similar definition leads to a (d 1)-dimensional object that is generally nonunique. (Quantiles are not so useful in the case of multivariate distributions.) 33

Empirical Quantiles For a given sample of size n, the order statistics y (1),..., y (n) constitute an obvious set of empirical quantiles. The probabilities from the ECDF that are associated with the order statistic y (i) are i/n. But these lead to a probability of 1 for the largest sample value, y (n), and a probability of 1/n for the smallest sample value, y (1). 34

Distribution of Order Statistics If Y (1),..., Y (n) are the order statistics in a random sample of size n from a distribution with PDF p Y ( ) and CDF P Y ( ), then the PDF of the i th order statistic is p Y(i) (y (i) ) = ( n) ( i P Y (y (i) )) i 1 py (y (i) ) ( 1 P Y (y (i) )) n i. Interestingly, the order statistics from a U(0, 1) distribution have beta distributions. 35

Estimation of Quantiles Empirical quantiles can be used as estimators of the population quantiles, but there are generally other estimators that are better, as we can deduce from basic properties of statistical inference. The first thing that we note is that the extreme order statistics have very large variances if the support of the underlying distribution is infinite. We would therefore not expect them alone to be the best estimator of an extreme quantile unless the support is finite. A fundamental principle of statistical inference is that a sufficient statistic should be used, if one is available. 36

Estimation of Quantiles No order statistic alone is sufficient, except for the minimum or maximum order statistic in the case of a distribution with finite support. The set of all order statistics, however, is always sufficient. Because of the Rao-Blackwell theorem, this would lead us to expect that some combination of order statistics would be a better estimator of any population quantile than a single estimator. 37

The Harrell-Davis Estimator The Harrell-Davis estimator uses a weighted combination of all order statistics where the weights are from a beta distribution. This comes from the fact that for any continuous CDF P if Y is a random variable from the distribution with CDF P, then U = P(Y ) has a U(0,1) distribution, and the order statistics from a uniform have beta distributions. See Exercise 1.7 in revised Chapter 1. 38

The Harrell-Davis Estimator The Harrell-Davis estimator for the π quantile uses the beta distribution with parameters π(n + 1) and (1 π)(n + 1). Let P βπ ( ) be the CDF of the beta distribution with those parameters. The Harrell-Davis estimator for the π quantile is where ŷ π = n i=1 w i y (i), w i = P βπ (i/n) P βπ ((i 1)/n). 39

Monte Carlo Study of the Harrell-Davis Estimator Let s conduct an empirical study of the relative performance of the sample median and the Harrell-Davis estimator as estimators of the population median. First, write a function to compute this estimator for any given sample size and given probability. For example, in R: hd <- function(y,p){ n <- length(y) a <- p*(n+1) b <- (1-p)*(n+1) q <-sum(sort(y)*(pbeta((1:n)/n,a,b)- pbeta((0:(n-1))/n,a,b))) q } 40

Monte Carlo Study of the Harrell-Davis Estimator Use samples of size 25, and use 1000 Monte Carlo replicates. In each case, for each replicate, generate a pseudorandom sample of size 25, compute the two estimators of the median and obtain the squared error, using the known population value of the median. Use normal, Cauchy, and gamma distributions. The average of the squared errors over the 1000 replicates is your Monte Carlo estimate of the MSE. Summarize your findings in a clearly-written report. What are the differences in relative performance of the sample median and the Harrell-Davis quantile estimator as estimators of the population median? What characteristics of the population seem to have an effect on the relative performance? 41

Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from fitted model, e.g. least squares maximize likelihood (what is likelihood?) These involve optimization. **** Next week we ll pick here... 42