Basic Computations in Statistical Inference

Size: px
Start display at page:

Download "Basic Computations in Statistical Inference"

Transcription

1 1 Basic Computations in Statistical Inference The purpose of an exploration of data may be rather limited and ad hoc, or the purpose may be more general, perhaps to gain understanding of some natural phenomenon. The questions addressed may be somewhat open-ended. The process of understanding often begins with general questions about the structure of the data. At any stage of the analysis, our understanding is facilitated by means of a model. A model is a description that embodies our current understanding of a phenomenon. In an operational sense, we can formulate a model either as a description of a data-generating process, or as a prescription for processing data. The model is often expressed as a set of equations that relate data elements to each other. It may include probability distributions for the data elements. If any of the data elements are considered to be realizations of random variables, the model is a stochastic model. A model should not limit our analysis; rather, the model should be able to evolve. The process of understanding involves successive refinements of the model. The refinements proceed from vague models to more specific ones. An exploratory data analysis may begin by mining the data to identify interesting properties. These properties generally raise questions that are to be explored further. A family of models may have a common form within which the members of the family are distinguished by values of parameters. For example, the family of normal probability distributions has a single form of a probability density function that has two parameters. If this form of model is chosen to represent the properties of a dataset, we may seek confidence intervals for values of the two parameters or perform statistical tests of hypothesized values of the parameters. In models that are not as mathematically tractable as the normal probability model and many realistic models are not computationally intensive methods involving simulations, resamplings, and multiple views may be used to make inferences about the parameters of a model.

2 6 1 Basic Computations in Statistical Inference 1.1 Discovering Structure: Data Structures and Structure in Data The components of statistical datasets are observations and variables. In general, data structures are ways of organizing data to take advantage of the relationships among the variables constituting the dataset. Data structures may express hierarchical relationships, crossed relationships (as in relational databases), or more complicated aspects of the data (as in object-oriented databases). In data analysis, structure in the data is of interest. Structure in the data includes such nonparametric features as modes, gaps, or clusters in the data, the symmetry of the data, and other general aspects of the shape of the data. Because many classical techniques of statistical analysis rely on an assumption of normality of the data, the most interesting structure in the data may be those aspects of the data that deviate most from normality. Sometimes, it is possible to express the structure in the data in terms of mathematical models. Prior to doing this, graphical displays may be used to discover qualitative structure in the data. Patterns observed in the data may suggest explicit statements of the structure or of relationships among the variables in the dataset. The process of building models of relationships is an iterative one, and graphical displays are useful throughout the process. Graphs comparing data and the fitted models are used to refine the models. Multiple Analyses and Multiple Views Effective use of graphics often requires multiple views. For multivariate data, plots of individual variables or combinations of variables can be produced quickly and used to get a general idea of the properties of the data. The data should be inspected from various perspectives. Instead of a single histogram to depict the general shape of univariate data, for example, multiple histograms with different bin widths and different bin locations may provide more insight. Sometimes, a few data points in a display can completely obscure interesting structure in the other data points. A zooming window to restrict the scope of the display and simultaneously restore the scale to an appropriate viewing size can reveal structure. A zooming window can be used with any graphics software whether the software supports it or not; zooming can be accomplished by deletion of the points in the dataset outside of the window. Scaling the axes can also be used effectively to reveal structure. The relative scale is called the aspect ratio. In Figure 1.1, which is a plot of a bivariate dataset, we form a zooming window that deletes a single observation. The greater magnification and the changed aspect ratio clearly show a relationship between X and Y in a region close to the origin that may not hold for the full range of data. A simple statement of this relationship, however, would not extrapolate outside the window to the outlying point.

3 1.1 Discovering Structure 7 The use of a zooming window is not deletion of outliers ; it is focusing in on a subset of the data and is done independently of whatever is believed about the data outside of the window. Y Y X X Figure 1.1. Scales Matter One type of structure that may go undetected is that arising from the order in which the data were collected. For data that are recognized as a time series by the analyst, this is obviously not a problem, but often there is a time dependency in the data that is not recognized immediately. Time or location may not be an explicit variable on the dataset, even though it may be an important variable. The index of the observation within the dataset may be a surrogate variable for time, and characteristics of the data may vary as the index varies. Often it is useful to make plots in which one axis is the index number of the observations. More subtle time dependencies are those in which the values of the variables are not directly related to time, but relationships among variables are changing over time. The identification of such time dependencies is much more difficult, and often requires fitting a model and plotting residuals. Another strictly graphical way of observing changes in relationships over time is by using a sequence of graphical displays. Simple Plots May Reveal the Unexpected A simple plot of the data will often reveal structure or other characteristics of the data that numerical summaries do not.

4 8 1 Basic Computations in Statistical Inference An important property of data that is often easily seen in a graph is the unit of measurement. Data on continuous variables are often rounded or measured on a coarse grid. This may indicate other problems in the collection of the data. The horizontal lines in Figure 1.2 indicate that the data do not come from a continuous distribution. Whether we can use methods of data analysis that assume continuity depends on the coarseness of the grid or of the measurement; that is, on the extent to which the data are discrete or the extent to which they have been discretized. Y X Figure 1.2. Discrete Data, Rounded Data, or Data Measured Imprecisely We discuss graphics further in Chapter 7. The emphasis is on the use of graphics for discovery. The field of statistical graphics is much broader, of course, and includes many issues of design of graphical displays for conveying (rather than discovering) information. 1.2 Modeling and Computational Inference The process of building models involves successive refinements. The evolution of the models proceeds from vague, tentative models to more complete ones, and our understanding of the process being modeled grows in this process. The usual statements about statistical methods regarding bias, variance, and so on are made in the context of a model. It is not possible to measure bias or variance of a procedure to select a model, except in the relatively simple case

5 1.2 Modeling and Computational Inference 9 of selection from some well-defined and simple set of possible models. Only within the context of rigid assumptions (a metamodel ) can we do a precise statistical analysis of model selection. Even the simple cases of selection of variables in linear regression analysis under the usual assumptions about the distribution of residuals (and this is a highly idealized situation) present more problems to the analyst than are generally recognized. Probability Models Some of the simplest models used in statistics are probability models for random variables. For a random variable X, the model specifies the probability that X is in a given set of real numbers (specifically, a Borel set, but we will not dwell on technical details here). This probability can be expressed in terms of the probability that the random variable is less than or equal to a given number; hence, the fundamental function in a probability model is the cumulative distribution function, or CDF, which for the random variable X yields Pr(X x) for any real number x. We often denote the CDF of X as P X (x) or F X (x). It is clear that the CDF P X (x) is nondecreasing in x, that P X (x) 0 as x, that P X (x) 1 as x, and that P X (x) is continuous from the right. Another very important property of the CDF is that if the random variable X has CDF P X (x), then P X (X) U(0, 1), (1.1) where the symbol means is distributed as, and U(0, 1) represents the uniform distribution over the interval (0, 1). We call the derivative of a CDF P X the probability density function, or PDF, and we often denote it by the corresponding lower-case letter, p X or f X. (If the CDF is not differentiable in the usual sense, the PDF is a special type of derivative, and may denote the probability of a single point.) There are several probability models that are useful over various ranges of applications. We call a probability model a distribution. Two common ones are the normal model, which we denote as N(µ, σ 2 ), and the uniform model mentioned above. Notation of the form N(µ, σ 2 ) may denote a specific distribution (if µ and σ 2 are assumed fixed), or it may denote the family of distributions. The concept of families of distributions is pervasive in statistics; the simple tasks of statistical inference often involve inference about specific values of the parameters in a given (assumed) family of distributions. Given any distribution, transformations of the random variable can be used to form various families of related distribution. For example, if X is a random variable, and Y = ax + b, with a 0, the family of distributions of all such Y is called a location-scale family. Many common families of distributions, for example the normal family, are location-scale families.

6 10 1 Basic Computations in Statistical Inference Often in statistics, we suppose that we have a random sample from a particular distribution. A random sample is the same as a set of random variables that are independent and have the same distribution; that is, the random variables are independent and identically distributed, or iid. We often use notation of the form X 1,..., X n iid N(µ, σ 2 ), for example, to indicate that the n random variables are independent and identically normally distributed. Descriptive Statistics, Inferential Statistics, and Model Building We can distinguish statistical activities that involve: data collection; descriptions of a given dataset; inference within the context of a model or family of models; and model selection. In any given application, it is likely that all of these activities will come into play. Sometimes (and often, ideally!), a statistician can specify how data are to be collected, either in surveys or in experiments. We will not be concerned with this aspect of the process in this text. Once data are available, either from a survey or designed experiment, or just observational data, a statistical analysis begins by considering general descriptions of the dataset. These descriptions include ensemble characteristics, such as averages and spreads, and identification of extreme points. The descriptions are in the form of various summary statistics and graphical displays. The descriptive analyses may be computationally intensive for large datasets, especially if there are a large number of variables. The computationally intensive approach also involves multiple views of the data, including consideration of various transformations of the data. We discuss these methods in Chapters 5 and 7 and in Part II. A stochastic model is often expressed as a probability density function or as a cumulative distribution function of a random variable. In a simple linear regression model with normal errors, Y = β 0 + β 1 x + E, (1.2) for example, the model may be expressed by use of the probability density function for the random variable E. (Notice that Y and E are written in uppercase because they represent random variables.) The probability density function for Y is p(y) = 1 e (y β0 β1x)2 /(2σ 2). (1.3) 2πσ In this model, x is an observable covariate; σ, β 0, and β 1 are unobservable (and, generally, unknown) parameters; and 2 and π are constants. Statistical inference about parameters includes estimation or tests of their values or

7 1.2 The Role of the ECDF in Inference 11 statements about their probability distributions based on observations of the elements of the model. The elements of a stochastic model include observable random variables, observable covariates, unobservable parameters, and constants. Some random variables in the model may be considered to be responses. The covariates may be considered to affect the response; they may or may not be random variables. The parameters are variable within a class of models, but for a specific data model the parameters are constants. The parameters may be considered to be unobservable random variables, and in that sense, a specific data model is defined by a realization of the parameter random variable. In the model, written as Y = f(x; β) + E, (1.4) we identify a systematic component, f(x; β), and a random component, E. The selection of an appropriate model may be very difficult, and almost always involves not only questions of how well the model corresponds to the observed data, but also the tractability of the model. The methods of computational statistics allow a much wider range of tractability than can be contemplated in mathematical statistics. Statistical analyses generally are undertaken with the purpose of making a decision about a dataset, about a population from which a sample dataset is available, or in making a prediction about a future event. Much of the theory of statistics developed during the middle third of the twentieth century was concerned with formal inference; that is, use of a sample to make decisions about stochastic models based on probabilities that would result if a given model was indeed the data-generating process. The heuristic paradigm calls for rejection of a model if the probability is small that data arising from the model would be similar to the observed sample. This process can be quite tedious because of the wide range of models that should be explored and because some of the models may not yield mathematically tractable estimators or test statistics. Computationally intensive methods include exploration of a range of models, many of which may be mathematically intractable. In a different approach employing the same paradigm, the statistical methods may involve direct simulation of the hypothesized data-generating process rather than formal computations of probabilities that would result under a given model of the data-generating process. We refer to this approach as computational inference. We discuss methods of computational inference in Chapters 2, 3, and 4. In a variation of computational inference, we may not even attempt to develop a model of the data-generating process; rather, we build decision rules directly from the data. This is often the approach in clustering and classification, which we discuss in Chapter 10. Computational inference is rooted in classical statistical inference. In subsequent sections of the current chapter, we discuss general techniques used in statistical inference.

8 12 1 Basic Computations in Statistical Inference 1.3 The Role of the Empirical Cumulative Distribution Function Methods of statistical inference are based on an assumption (often implicit) that a discrete uniform distribution with mass points at the observed values of a random sample is asymptotically the same as the distribution governing the data-generating process. Thus, the distribution function of this discrete uniform distribution is a model of the distribution function of the data-generating process. For a given set of univariate data, y 1,..., y n, the empirical cumulative distribution function, or ECDF, is P n (y) = #{y i, s.t. y i y}. (1.5) n The ECDF is the basic function used in many methods of computational inference. Although the ECDF has similar definitions for univariate and multivariate random variables, it is most useful in the univariate case. An equivalent expression for univariate random variables, in terms of intervals on the real line, is P n (y) = 1 n I (,y] (y i ), (1.6) n where I is the indicator function. (See page 413 for the definition and some of the properties of the indicator function. The measure di (,a] (x), which we use in equation (1.14) below, is particularly interesting.) It is easy to see that the ECDF is pointwise unbiased for the CDF; that is, if the y i are independent realizations of random variables Y i, each with CDF P( ), for a given y, E ( P n (y) ) ( ) 1 n = E I (,y] (Y i ) n = 1 n E ( I (,y] (Y i ) ) n Similarly, we find = Pr(Y y) = P(y). (1.7) V ( P n (y) ) = P(y) ( 1 P(y) ) /n; (1.8) indeed, at a fixed point y, np n (y) is a binomial random variable with parameters n and π = P(y). Because P n is a function of the order statistics, which form a complete sufficient statistic for P, there is no unbiased estimator of P(y) with smaller variance.

9 1.3 The Role of the ECDF in Inference 13 We also define the empirical probability density function (EPDF) as the derivative of the ECDF: p n (y) = 1 n n δ(y y i ), (1.9) where δ is the Dirac delta function. The EPDF is just a series of spikes at points corresponding to the observed values. It is not as useful as the ECDF. It is, however, unbiased at any point for the probability density function at that point. The ECDF and the EPDF can be used as estimators of the corresponding population functions, but there are better estimators (see Chapter 9) Statistical Functions of the CDF and the ECDF In many models of interest, a parameter can be expressed as a functional of the probability density function or of the cumulative distribution function of a random variable in the model. The mean of a distribution, for example, can be expressed as a functional Θ of the CDF P: Θ(P) = y dp(y). (1.10) IR d A functional that defines a parameter is called a statistical function. Estimation of Statistical Functions A common task in statistics is to use a random sample to estimate the parameters of a probability distribution. If the statistic T from a random sample is used to estimate the parameter θ, we measure the performance of T by the magnitude of the bias, E(T) θ, (1.11) by the variance, by the mean squared error, ( V(T) = E (T E(T))(T E(T)) T), (1.12) E ( (T θ) T (T θ) ), (1.13) and by other expected values of measures of the distance from T to θ. (These expressions above are for the scalar case, but similar expressions apply to vectors T and θ, in which case the bias is a vector, the variance is the variancecovariance matrix, and the mean squared error is a dot product and hence a scalar.)

10 14 1 Basic Computations in Statistical Inference If E(T) = θ, T is unbiased for θ. For sample size n, if E(T) = θ+o(n 1/2 ), T is said to be first-order accurate for θ; if E(T) = θ + O(n 1 ), it is secondorder accurate. (See page 414 for the definition of O( ). Convergence of E(T) can also be expressed as a stochastic convergence of T, in which case we use the notation O P ( ).) The order of the mean squared error is an important characteristic of an estimator. For good estimators of location, the order of the mean squared error is typically O(n 1 ). Good estimators of probability densities, however, typically have mean squared errors of at least order O(n 4/5 ) (see Chapter 9). Estimation Using the ECDF There are many ways to construct an estimator and to make inferences about the population. In the univariate case especially, we often use data to make inferences about a parameter by applying the statistical function to the ECDF. An estimator of a parameter that is defined in this way is called a plug-in estimator. A plug-in estimator for a given parameter is the same functional of the ECDF as the parameter is of the CDF. For the mean of the model, for example, we use the estimate that is the same functional of the ECDF as the population mean in equation (1.10), Θ(P n ) = = = 1 n = 1 n y dp n (y) y d 1 n n n y i n I (,y] (y i ) y di (,y] (y i ) = ȳ. (1.14) The sample mean is thus a plug-in estimator of the population mean. Such an estimator is called a method of moments estimator. This is an important type of plug-in estimator. The method of moments results in estimates of the parameters E(Y r ) that are the corresponding sample moments. Statistical properties of plug-in estimators are generally relatively easy to determine. In some cases, the statistical properties, such as expectation and variance, are optimal in some sense. In addition to estimation based on the ECDF, other methods of computational statistics make use of the ECDF. In some cases, such as in bootstrap methods, the ECDF is a surrogate for the CDF. In other cases, such as Monte Carlo methods, an ECDF for an estimator is constructed by repeated sampling, and that ECDF is used to make inferences using the observed value of the estimator from the given sample.

11 1.3 The Role of the ECDF in Inference 15 Viewed as a statistical function, Θ denotes a specific functional form. Any functional of the ECDF is a function of the data, so we may also use the notation Θ(Y 1,..., Y n ). Often, however, the notation is cleaner if we use another letter to denote the function of the data; for example, T(Y 1,..., Y n ), even if it might be the case that T(Y 1,..., Y n ) = Θ(P n ). We will also often use the same letter that denotes the functional of the sample to represent the random variable computed from a random sample; that is, we may write T = T(Y 1,..., Y n ). As usual, we will use t to denote a realization of the random variable T. Use of the ECDF in statistical inference does not require many assumptions about the distribution. Other methods discussed below are based on information or assumptions about the data-generating process Order Statistics In a set of iid random variables X 1,..., X n, it is often of interest to consider the ranked values X i1 X in. These are called the order statistics and are denoted as X (1:n),..., X (n:n). For 1 k n, we refer to X (k:n) as the k th order statistic. We often use the simpler notation X (k), assuming that n is some fixed and known value. Also, we sometimes drop the parentheses in the other representation, X k:n. If the CDF of the n iid random variables is P(x) and the PDF is p(x), we can get the PDF of the k th order statistic by forming the joint density, and integrating out all variables except the k th order statistic. This yields ( ) n ( k 1 ( n kp(x). p X(k:n) (x) = P(x)) 1 P(x)) (1.15) k Clearly, the order statistics X (1:n),..., X (n:n) from an iid sample of random variables X 1,..., X n are neither independent nor identically distributed. From equation (1.15), and the fact that if the random variable X has CDF P, then P(X) U(0, 1) (expression (1.1) on page 9), we see that the distribution of the k th order statistic from a U(0, 1) is the beta distribution with parameters k and n k + 1; that is, X (k:n) beta(k, n k + 1), (1.16) if X (k:n) is the k th order statistic in a sample of size n from a uniform(0, 1) distribution. Another important fact about order statistics from a uniform distribution is that they have simple conditional distributions. Given the (k + 1) th order

12 16 1 Basic Computations in Statistical Inference statistic, U k+1:n = v, the conditional joint distribution of vu 1:k,...vU k:k is the same as the joint distribution of U 1:n,...U k:n ; that is, (See Exercise 1.7.) (vu 1:k,...vU k:k ) d = (U 1:n,...U k:n ). (1.17) Quantiles For α (0, 1), the α quantile of the distribution with CDF P is the value x (α) such that P(x (α) ) = α if such a value exists. (For a univariate random variable, this is a single point. For a d-variate random variable, it is a (d 1)- dimensional object that is generally nonunique.) In a discrete distribution, for a given value of α, there may be no value x (α) such that P(x (α) ) = α. We can define a useful concept of quantile, however, that always exists. For a univariate probability distribution with CDF P, we define the function P 1 on the open interval (0, 1) as P 1 (α) = inf{x, s.t. P(x) α}. (1.18) We call P 1 the quantile function. Notice that if P is strictly increasing, the quantile function is the ordinary inverse of the cumulative distribution function. If P is not strictly increasing, the quantile function can be interpreted as a generalized inverse of the cumulative distribution function. This definition is reasonable (at the expense of overloading the notation used for the ordinary inverse of a function) because, while a CDF may not be an invertible function, it is monotonic nondecreasing. Notice that for the univariate random variable X with CDF P, if x (α) = P 1 (α), then x (α) is the α quantile of X as above. In a discrete distribution, the quantile function is a step function, and the quantile is the same for values of α in an interval. The quantile function, just as the CDF, fully determines a probability distribution. It is clear that x (α) is a functional of the CDF, say Ξ α (P). The functional is very simple. It is Ξ α (P) = P 1 (α), where P 1 is the quantile function. If P(x) = α, we say the quantile-level of x is α. (We also sometimes use the term quantile by itself in a slightly different way: if P(x) = α, we say the quantile of x is α.) Empirical Quantiles The quantile function Pn 1 associated with an ECDF P leads to the order statistics on which the ECDF is based. These quantiles do not correspond

13 1.3 The Role of the ECDF in Inference 17 symmetrically to the quantiles of the underlying population. For a quantilelevel α < 2/n, Pn 1 (α) = x (1:n), then for each increase in α of 1/n, Pn 1 increases to the next order statistic, until finally, it is x (n:n) for the single limiting value α = 1. This likely disconnect between the quantiles of the ECDF and the quantiles of the underlying distribution leads us to consider other definitions for the empirical quantile, or sample quantile. For a given sample from a continuous distribution, the intervals between successive order statistics are unbiased estimates of intervals of equal probability. For quantile-levels between 1/n and 1 1/n, this leads to estimates of quantiles that are between two specific order statistics; that is, for α between 1/n and 1 1/n, an unbiased estimate of the α quantile would be between x i:n and x i+1:n, where i/n α (i + 1)/n. For α between 1/n and 1 1/n, we therefore define the empirical quantile as a convex linear combination of two successive order statistics: q i = λx i:n + (1 λ)x i+1:n, (1.19) for 0 λ 1. We define the quantile-level of such a quantile as p i = i ι n + ν (1.20) for some ι [0, 1 ] and some ν [0, 1]. This expression is based on a linear 2 interpolation of the ECDF between the order statistics. The values of ι and ν determine the value of λ in equation (1.19). We would like to choose values of ι and ν in expression (1.20) that make the empirical quantiles of a random sample correspond closely to those of the population depend on the distribution of the population, but of course, those are generally unknown. A certain symmetry may be imposed by requiring ν = 1 2ι. For a discrete distribution, it is generally best to take ι = ν = 0. A common pari of choices in continuous distributions are ι = 1 2 and ν = 0. Another reasonable pair of choices are ι = 3 8 and ν = 1/4. This corresponds to values that match quantiles of a normal distribution. Hyndman and Fan (1996) identify nine different versions of the expression (1.20) that are used in statistical computer packages. In some cases, the software only uses one version; in other cases, the user is given a choice. The R function quantiles allows the user to choose among nine types, and for the chosen type returns the empirical quantiles and the quantile-levels. The R function ppoints generates values of p i as in equation (1.20). For n less that 11, ppoints uses the values ι = 3 8 and ν = 1/4, and for larger values of n, it uses ι = 1 2 and ν = 0. Empirical quantiles are used in Monte Carlo inference, in nonparametric inference, and in graphical displays for comparing a sample with a standard distribution or with another sample. Empirical quantiles can also be used as estimators of the population quantiles, but there are other estimators for quantiles of a continuous distribution.

14 18 1 Basic Computations in Statistical Inference Some, such as the Kaigh-Lachenbruch estimator and the Harrell-Davis estimator, use a weighted combination of multiple data points instead of just a single one, as in the simple estimators above. (See Kaigh and Lachenbruch (1982) and Harrell and Davis (1982) and also see Exercise 1.9.) If a covariate is available, it may be possible to use it to improve the quantile estimate. This is often the case in simulation studies, where the covariate is called a control variable q-q Plots The quantile-quantile plot or q-q plot is a useful graphical display for comparing two distributions, two samples, or a sample and a given theoretical distribution. The most common use is for comparing an observed sample with a given theoretical or reference distribution. Either order statistics of the sample or sample quantiles are plotted along one axis and either expected values of order statistics or quantiles of the reference distribution corresponding to appropriate quantile-levels are plotted along the other axis. The choices have to do with the appropriate definitions of empirical quantiles, as in equation (1.19), or of population quantiles corresponding to quantile-levels, as in equation (1.20). The sample points may be on the vertical axis and the population points on the horizontal axis, or they can be plotted in the other way. The extent to which the q-q scatterplot fails to lie along a straight line provides a visual assessment of whether the points plotted along the two axes are from distributions or samples whose quantiles are linearly related. The points will fall along a straight line if the two distributions are in the sample location-scale family; that is, if the two underlying random variables have a relationship of the form Y = ax + b, where a 0. One of the most common reference distributions, of course, is the normal. The normal family of distributions is a location-scale family, so a q-q plot for a normal reference distribution does not depend on the mean or variance. The R function qqnorm plots order statistics from a given sample against normal quantiles. The R function qqplot creates more general q-q plots. Figure 1.3 shows a q-q plot that compares a sample with two gamma reference distributions. The sample was generated from a gamma(10, 10) distribution, and the order statistics from that sample are plotted along the vertical axes. The reference distribution on the left-hand side in Figure 1.3 is a gamma(10, 1) distribution, which is in the same scale family as a gamma(10, 10); hence, the points fall close to a straight line. This plot was produced by the R statements plot(qgamma(ppoints(length(x)),10),sort(x)) abline(lsfit(qgamma(ppoints(x),10),sort(x)))

15 1.3 The Role of the ECDF in Inference 19 Note that the worst fit is in the tails; that is, near the end points of the plot. This is characteristic of q-q plots and is a result of the greater skewness of the extreme order statistics. x x Gamma(10) scores Gamma(1) scores Figure 1.3. Quantile-Quantile Plot for Comparing the Sample to Gamma Distributions The reference distribution in the plot on the right-hand side in Figure 1.3 is a gamma(1, 1) distribution. The points do not seem to lie on a straight line. The extremes of the sample do not match the quantiles well at all. The pattern that we observe for the smaller observations (that is, that they are below a straight line that fits most of the data) is characteristic of data with a heavier left tail than the reference distribution to which it is being compared. Conversely, the larger observations, being below the straight line, indicate that the data have a lighter right tail than the reference distribution. An important property of the q-q plot is that its shape is independent of the location and the scale of the data. In Figure 1.3, the sample is from a gamma distribution with a scale parameter of 10, but the distribution quantiles are from a population with a scale parameter of 1. For a random sample from the distribution against whose quantiles it is plotted, the points generally deviate most from a straight line in the tails. This is because of the larger variability of the extreme order statistics. Also, because the distributions of the extreme statistics are skewed, the deviation from a straight line is in a specific direction (toward lighter tails) more than half of the time (see Exercise 1.10, page 48). A q-q plot is an informal visual goodness-of-fit test. (There is no significance level.) The sup absolute difference between the ECDF and the reference

16 20 1 Basic Computations in Statistical Inference CDF is the Kolmogorov distance, which is the basis for the Kolmogorov test (and the Kolmogorov-Smirnov test) for distributions. The Kolmogorov distance does poorly in measuring differences in the tails of the distribution. A q-q plot, on the other hand, generally is very good in revealing differences in the tails. A q-q plot is a useful application of the ECDF. As I have mentioned, the ECDF is not so meaningful for multivariate distributions. Plots based on the ECDF for of a multivariate dataset are generally difficult to interpret. 1.4 The Role of Optimization in Inference Important classes of estimators are defined as the point at which some function that involves the parameter and the random variable achieves an optimum. There are, of course, many functions that involve the parameter and the random variable; an example is the probability density. In the use of function optimization in inference, once the objective function is chosen, observations on the random variable are taken and are then considered to be fixed; the parameter in the function is considered to be a variable (the decision variable, in the parlance often used in the literature on optimization). The function is then optimized with respect to the parameter variable. The nature of the function determines the meaning of optimized ; if the function is the probability density, for example, optimized would logically mean maximized. (This leads to maximum likelihood estimation, which we discuss below.) In discussing the use of optimization in statistical estimation, we must be careful to distinguish between a symbol that represents a fixed parameter and a symbol that represents a variable parameter. When we denote a probability density function as p(y θ), we generally expect θ to represent a fixed, but possibly unknown, parameter. A family of probability models specifies a parameter space, say Θ, that determines the possible values of θ in that model. In an estimation method that involves optimizing some function, θ is often used as a variable placeholder. I prefer to use some other symbol for a variable quantity that serves in the place of the parameter; therefore, in the following discussion, I will generally use t in place of θ when I want to treat is as a variable. For a family of probability models with parameter space Θ, we define a set T Θ and require t T. For other common symbols of parameters, such as β, when I want to treat them as variables in an optimization algorithm, I use corresponding Latin letters, such as b. In an iterative algorithm, I use t (k) to represent a fixed value in the k th iteration.

17 Some Comments on Optimization 1.4 The Role of Optimization in Inference 21 The solution to an optimization problem is in some sense best for that particular problem and its objective functions; this may mean, however, that it is considerably less good for some other optimization problem. It is often the case, therefore, that an optimal solution is not robust to assumptions about the phenomenon being studied. Optimization, because it is performed in the context of a set of assumptions, is likely to magnify the effects of the assumptions Fitting Statistical Models by Optimization A statistical model such as Y = β 0 + β 1 x + E in equation (1.2) specifies a family of probability distributions of the data. The specific member of that family of probability distributions depends on the values of the parameters in model, β 0, β 1, and σ 2 (see equation (1.3)). Simpler models, such as the statement X is distributed as N(µ, σ 2 ), also specify families of probability distributions. An important step in statistical inference is to use observed data to estimate the parameters in the model. This is called fitting the model. The question is, how do we estimate the parameters? What properties should the estimators have? In the following pages we discuss two general ways in which optimization is used in to estimate parameters or to fit models. One is to minimize deviations of observed values from what a model would predict (think least squares, as an example). This is an intuitive procedure which may be chosen without regard to the nature of the data-generating process. The justification for a particular form of the objective function, however, may arise from assumptions about a probability distribution underlying the data-generating process. Another way in which optimization is used in statistical inference is in maximizing the likelihood, which we will discuss more precisely in Section 1.4.3, beginning on page 30. The correct likelihood function depends on the probability distribution underlying the data-generating process, which, of course, is not known and can only be assumed. How poor the maximum likelihood estimator is depends on both the true distribution and the assumed distribution. In the discussion below, we briefly describe particular optimization techniques that assume that the objective function is a continuous function of the decision variables, or the parameters. We also assume that there are no a priori constraints on the values of the parameters. Techniques appropriate for other situations, such as for discrete optimization and constrained optimization, are available in the general literature on optimization. We must also realize that mathematical expressions below do not necessarily imply computational methods. There are many additional considerations for the numerical computations. A standard example of this point is in the solution of the linear full-rank system of n equations in n unknowns: Ax = b.

18 22 1 Basic Computations in Statistical Inference While we may write the solution as x = A 1 b, we would almost never compute the solution by forming the inverse and then multiplying b by it (see Gentle (2007), Chapter 6) Estimation by Minimizing Residuals In many applications, we can express the expected value of a random variable as a function of a parameter. For example, in the model Y = β 0 +β 1 x + E in equation (1.2), we have E(Y ) = β 0 + β 1 x, (1.21) which involves an observable covariate. In general, we may write the expected value as E(Y ) = f(x, θ), (1.22) where f is some function, x may be some observable covariates (possibly a vector), and θ is some unobservable parameter (possibly a vector). (The more difficult and interesting problems, of course, involve the determination of the form of the function f, but for the time being, we concentrate on the simpler problem of determining an appropriate value of θ, assuming that the form of the function f is known.) Assuming that we have observations y 1,..., y n on Y (and observations on the covariates if there are any), we ask what value of θ would make the model fit the data best. There are various ways of approaching this question, but a reasonable first step would be to look at the differences in the observations and their expected (or predicted ) values: y i f(x i, θ). For any particular value of θ, say t, we have the residuals, r i (t) = y i f(x i, t). (1.23) A reasonable estimator of θ would be the value of t that minimizes some norm of r(t), the n-vector of residuals r i (t). Notice that I am using t in place of θ. The logical reason for doing this is that θ itself is some unknown constant. I want to work with a variable whose value I can choose according to my own criterion. (While you should understand this reasoning, I will often use the same symbol for the variable that I use for the constant unknown parameter. This sloppiness is common in the statistical literature.) We have now formulated our estimation problem as an optimization problem: min r(t), (1.24) t where represents a vector norm. We often choose the norm as the L p norm, which we sometimes represent as p. For the n-vector v, the L p vector norm is

19 1.4 The Role of Optimization in Inference 23 ( n ) 1/p v i p. For the L p norm, we minimize a function of an L p norm of the residuals, s p (t) = n y i f(x i, t) p, (1.25) for some p 1, to obtain an L p estimator. Simple choices are the sum of the absolute values and the sum of the squares. The latter choice yields the least squares estimator. More generally, we could minimize s ρ (t) = n ρ(y i f(x i, t)) (1.26) for some nonnegative function ρ( ) to obtain an M estimator. (The name comes from the similarity of this objective function to the objective function for some maximum likelihood estimators.) Other Types of Residuals Recall our original estimation problem. We have a model, say Y = f(x i, θ) + E, where E is a random variable with unknown variance, say σ 2. We have formulated an optimization problem involving residuals formed from the decision variable t, for the estimation of θ. What about estimation of σ 2? To use the same heuristic, that is, to form a residual involving σ 2, we would use the residuals that involve t: ( n ) g (r(t)) 2 v 2, (1.27) where g is some function and where the variable v 2 is used in place of σ 2. This residual may look somewhat unfamiliar, and there is no obvious way of choosing g. In certain simple models, however, we can formulate expression g ( n (r(t))2) in terms of the expected values of Y f(x, θ). In the case of the model (1.2) Y = β 0 + β 1 x + E, this leads to the estimator v 2 = n (r(t))2 /(n 2). Formation of residuals in other estimation problems may not always be so straightforward. For example, consider the problem of estimation of the parameters α and β in a beta(α, β) distribution. If the random variable Y has

20 24 1 Basic Computations in Statistical Inference this distribution, then E(Y ) = α/(α + β). Given observations y 1,..., y n, we could form residuals similar to what we did above: r 1i (a, b) = y i a/(a + b). Minimizing a norm of this, however, does not yield unique individual values for a and b. We need other, mathematically independent, residuals. ( Mathematically independent refers to unique solutions of equations; not statistical independence of random variables.) If the random variable Y has the beta distribution as above, then we know E(Y 2 ) = V(Y 2 ) + (E(Y )) 2 = α2 (α + β) + αβ (α + β) 2 (α + β + 1). We could now form a second set of residuals: r 2i (a, b) = y 2 i (a2 (a + b) + ab)/((a + b) 2 (a + b + 1)). We now form a minimization problem involving these two sets of residuals. There are, of course, various minimization problems that could be formulated. The most direct problem, analogous to (1.24), is min a,b ( r 1(a, b) + r 2 (a, b) ). Another optimization problem is the sequential one: and then, subject to a conditional minimum, min a,b r 1(a, b), (1.28) min r 2 (ã, b). (1.29) ã, b In Exercise 1.13a, you are asked to estimate the parameters in a gamma distribution using these ideas. Computational Methods for Estimation by Minimizing Residuals Standard techniques for optimization can be used to determine estimates that minimize various functions of the residuals, that is, for some appropriate function of the residuals s( ), to solve min t s(t). (1.30) Here, I am thinking of the model parameter θ, and so I am using t as a variable in place of θ. I will denote the optimal value of t, according to the criteria embodied in the the function s( ), as θ.

21 1.4 The Role of Optimization in Inference 25 Except for special forms of the objective function, the algorithms to solve expression (1.30) are iterative. If s is twice differentiable, one algorithm is Newton s method, in which the minimizing value of t, θ, is obtained as a limit of the iterates ( ( t (k) = t (k 1) H )) 1 ( ) s t (k 1) s t (k 1), (1.31) where H s (t) denotes the Hessian of s and s(t) denotes the gradient of s, both evaluated at t. (Newton s method is sometimes called the Newton-Raphson method.) The function s( ) is usually chosen to be differentiable, at least piecewise. For various computational considerations, instead of the exact Hessian, a matrix H s approximating the Hessian is often used. In this case, the technique is called a quasi-newton method. Newton s method or a quasi-newton method often overshoots the best step. The direction t (k) t (k 1) may be the best direction, but the distance t (k) t (k 1) may be too great. A variety of methods using Newton-like iterations involve a system of equations of the form H s (t)d = s(t). (1.32) These equations are solved for the direction d, and the new point is taken as the old t plus αd, for some damping factor α. There are various ways of deciding when an iterative optimization algorithm has converged. In general, convergence criteria are based on the size of the change in t (k) from t (k 1), or the size of the change in s(t (k) ) from s(t (k 1) ). Statistical Properties of Minimum-Residual Estimators It is generally difficult to determine the variance or other high-order statistical properties of an estimator defined as above (that is, defined as the minimizer of some function of the residuals). In many cases, all that is possible is to approximate the variance of the estimator in terms of some relationship that holds for a normal distribution. (In robust statistical methods, for example, it is common to see a scale estimate expressed in terms of some mysterious constant times a function of some transformation of the residuals.) There are two issues that affect both the computational method and the statistical properties of the estimator defined as the solution to the optimization problem. One consideration has to do with the acceptable values of the

22 26 1 Basic Computations in Statistical Inference parameter θ. In order for the model to make sense, it may be necessary that the parameter be in some restricted range. In some models, a parameter must be positive, for example. In these cases, the optimization problem has constraints. Such a problem is more difficult to solve than an unconstrained problem. Statistical properties of the solution are also more difficult to determine. More extreme cases of restrictions on the parameter may require the parameter to take values in a countable set. Obviously, in such cases, Newton s method cannot be used because the derivatives cannot be defined. In those cases, a combinatorial optimization algorithm must be used instead. Other situations in which the function is not differentiable also present problems for the optimization algorithm. In such cases, if the domain is continuous, a descending sequence of simplexes can be used. Secondly, it may turn out that the optimization problem (1.30) has local minima. This depends on the nature of the function f( ) in equation (1.22). Local minima present problems for the computation of the solution because the algorithm may get stuck in a local optimum. Local minima also present conceptual problems concerning the appropriateness of the estimation criterion itself. As long as there is a unique global optimum, it seems reasonable to seek it and to ignore local optima. It is not so clear what to do if there are multiple points at which the global optimum is attained. Least Squares Estimation Least squares estimators are generally more tractable than estimators based on other functions of the residuals. They are more tractable both in terms of solving the optimization problem to obtain the estimate and in approximating statistical properties of the estimators, such as their variances. Consider the model in equation (1.22), E(Y ) = f(x, θ), and assume that θ is an m-vector and that f( ) is a smooth function in θ. Letting y be the n-vector of observations, we can write the least squares objective function corresponding to equation (1.25) as s(t) = ( r(t) ) T r(t), (1.33) where the superscript T indicates the transpose of a vector or matrix. The gradient and the Hessian for a least squares problem have special structures that involve the Jacobian of the residuals, J r (t). The gradient of s is s(t) = 2 ( J r (t) ) T r(t). (1.34) Taking derivatives of s(t), we see that the Hessian of s can be written in terms of the Jacobian of r and the individual residuals: H s (t) = 2 ( J r (t) ) T Jr (t) + 2 n r i (t)h ri (t). (1.35)

Monte Carlo Studies. The response in a Monte Carlo study is a random variable.

Monte Carlo Studies. The response in a Monte Carlo study is a random variable. Monte Carlo Studies The response in a Monte Carlo study is a random variable. The response in a Monte Carlo study has a variance that comes from the variance of the stochastic elements in the data-generating

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Optimization Problems

Optimization Problems Optimization Problems The goal in an optimization problem is to find the point at which the minimum (or maximum) of a real, scalar function f occurs and, usually, to find the value of the function at that

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

The Nonparametric Bootstrap

The Nonparametric Bootstrap The Nonparametric Bootstrap The nonparametric bootstrap may involve inferences about a parameter, but we use a nonparametric procedure in approximating the parametric distribution using the ECDF. We use

More information

Robust Inference. A central concern in robust statistics is how a functional of a CDF behaves as the distribution is perturbed.

Robust Inference. A central concern in robust statistics is how a functional of a CDF behaves as the distribution is perturbed. Robust Inference Although the statistical functions we have considered have intuitive interpretations, the question remains as to what are the most useful distributional measures by which to describe a

More information

Statistical Estimation

Statistical Estimation Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Review and continuation from last week Properties of MLEs

Review and continuation from last week Properties of MLEs Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

Statistics - Lecture One. Outline. Charlotte Wickham  1. Basic ideas about estimation Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence

More information

Non-parametric Inference and Resampling

Non-parametric Inference and Resampling Non-parametric Inference and Resampling Exercises by David Wozabal (Last update. Juni 010) 1 Basic Facts about Rank and Order Statistics 1.1 10 students were asked about the amount of time they spend surfing

More information

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Multivariate Distributions

Multivariate Distributions IEOR E4602: Quantitative Risk Management Spring 2016 c 2016 by Martin Haugh Multivariate Distributions We will study multivariate distributions in these notes, focusing 1 in particular on multivariate

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

STA 2201/442 Assignment 2

STA 2201/442 Assignment 2 STA 2201/442 Assignment 2 1. This is about how to simulate from a continuous univariate distribution. Let the random variable X have a continuous distribution with density f X (x) and cumulative distribution

More information

Regression. Oscar García

Regression. Oscar García Regression Oscar García Regression methods are fundamental in Forest Mensuration For a more concise and general presentation, we shall first review some matrix concepts 1 Matrices An order n m matrix is

More information

Introduction to Maximum Likelihood Estimation

Introduction to Maximum Likelihood Estimation Introduction to Maximum Likelihood Estimation Eric Zivot July 26, 2012 The Likelihood Function Let 1 be an iid sample with pdf ( ; ) where is a ( 1) vector of parameters that characterize ( ; ) Example:

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Robustness and Distribution Assumptions

Robustness and Distribution Assumptions Chapter 1 Robustness and Distribution Assumptions 1.1 Introduction In statistics, one often works with model assumptions, i.e., one assumes that data follow a certain model. Then one makes use of methodology

More information

Practice Problems Section Problems

Practice Problems Section Problems Practice Problems Section 4-4-3 4-4 4-5 4-6 4-7 4-8 4-10 Supplemental Problems 4-1 to 4-9 4-13, 14, 15, 17, 19, 0 4-3, 34, 36, 38 4-47, 49, 5, 54, 55 4-59, 60, 63 4-66, 68, 69, 70, 74 4-79, 81, 84 4-85,

More information

HANDBOOK OF APPLICABLE MATHEMATICS

HANDBOOK OF APPLICABLE MATHEMATICS HANDBOOK OF APPLICABLE MATHEMATICS Chief Editor: Walter Ledermann Volume VI: Statistics PART A Edited by Emlyn Lloyd University of Lancaster A Wiley-Interscience Publication JOHN WILEY & SONS Chichester

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Extreme Value Analysis and Spatial Extremes

Extreme Value Analysis and Spatial Extremes Extreme Value Analysis and Department of Statistics Purdue University 11/07/2013 Outline Motivation 1 Motivation 2 Extreme Value Theorem and 3 Bayesian Hierarchical Models Copula Models Max-stable Models

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

Bivariate Paired Numerical Data

Bivariate Paired Numerical Data Bivariate Paired Numerical Data Pearson s correlation, Spearman s ρ and Kendall s τ, tests of independence University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html

More information

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Eco517 Fall 2004 C. Sims MIDTERM EXAM Eco517 Fall 2004 C. Sims MIDTERM EXAM Answer all four questions. Each is worth 23 points. Do not devote disproportionate time to any one question unless you have answered all the others. (1) We are considering

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

Statistical Methods as Optimization Problems

Statistical Methods as Optimization Problems models ( 1 Statistical Methods as Optimization Problems Optimization problems maximization or imization arise in many areas of statistics. Statistical estimation and modeling both are usually special types

More information

One-Sample Numerical Data

One-Sample Numerical Data One-Sample Numerical Data quantiles, boxplot, histogram, bootstrap confidence intervals, goodness-of-fit tests University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html

More information

ORDER STATISTICS, QUANTILES, AND SAMPLE QUANTILES

ORDER STATISTICS, QUANTILES, AND SAMPLE QUANTILES ORDER STATISTICS, QUANTILES, AND SAMPLE QUANTILES 1. Order statistics Let X 1,...,X n be n real-valued observations. One can always arrangetheminordertogettheorder statisticsx (1) X (2) X (n). SinceX (k)

More information

MAS223 Statistical Inference and Modelling Exercises

MAS223 Statistical Inference and Modelling Exercises MAS223 Statistical Inference and Modelling Exercises The exercises are grouped into sections, corresponding to chapters of the lecture notes Within each section exercises are divided into warm-up questions,

More information

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems Principles of Statistical Inference Recap of statistical models Statistical inference (frequentist) Parametric vs. semiparametric

More information

Economics 101A (Lecture 3) Stefano DellaVigna

Economics 101A (Lecture 3) Stefano DellaVigna Economics 101A (Lecture 3) Stefano DellaVigna January 24, 2017 Outline 1. Implicit Function Theorem 2. Envelope Theorem 3. Convexity and concavity 4. Constrained Maximization 1 Implicit function theorem

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION SEAN GERRISH AND CHONG WANG 1. WAYS OF ORGANIZING MODELS In probabilistic modeling, there are several ways of organizing models:

More information

Likelihood-Based Methods

Likelihood-Based Methods Likelihood-Based Methods Handbook of Spatial Statistics, Chapter 4 Susheela Singh September 22, 2016 OVERVIEW INTRODUCTION MAXIMUM LIKELIHOOD ESTIMATION (ML) RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION (REML)

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population Lecture 5 1 Lecture 3 The Population Variance The population variance, denoted σ 2, is the sum of the squared deviations about the population mean divided by the number of observations in the population,

More information

UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables

UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables UQ, Semester 1, 2017, Companion to STAT2201/CIVL2530 Exam Formulae and Tables To be provided to students with STAT2201 or CIVIL-2530 (Probability and Statistics) Exam Main exam date: Tuesday, 20 June 1

More information

Nonlinear Programming

Nonlinear Programming Nonlinear Programming Kees Roos e-mail: C.Roos@ewi.tudelft.nl URL: http://www.isa.ewi.tudelft.nl/ roos LNMB Course De Uithof, Utrecht February 6 - May 8, A.D. 2006 Optimization Group 1 Outline for week

More information

POLI 8501 Introduction to Maximum Likelihood Estimation

POLI 8501 Introduction to Maximum Likelihood Estimation POLI 8501 Introduction to Maximum Likelihood Estimation Maximum Likelihood Intuition Consider a model that looks like this: Y i N(µ, σ 2 ) So: E(Y ) = µ V ar(y ) = σ 2 Suppose you have some data on Y,

More information

14.30 Introduction to Statistical Methods in Economics Spring 2009

14.30 Introduction to Statistical Methods in Economics Spring 2009 MIT OpenCourseWare http://ocw.mit.edu 4.0 Introduction to Statistical Methods in Economics Spring 009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Problem 1 (20) Log-normal. f(x) Cauchy

Problem 1 (20) Log-normal. f(x) Cauchy ORF 245. Rigollet Date: 11/21/2008 Problem 1 (20) f(x) f(x) 0.0 0.1 0.2 0.3 0.4 0.0 0.2 0.4 0.6 0.8 4 2 0 2 4 Normal (with mean -1) 4 2 0 2 4 Negative-exponential x x f(x) f(x) 0.0 0.1 0.2 0.3 0.4 0.5

More information

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations John R. Michael, Significance, Inc. and William R. Schucany, Southern Methodist University The mixture

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Statistics 3858 : Maximum Likelihood Estimators

Statistics 3858 : Maximum Likelihood Estimators Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,

More information

What s New in Econometrics? Lecture 14 Quantile Methods

What s New in Econometrics? Lecture 14 Quantile Methods What s New in Econometrics? Lecture 14 Quantile Methods Jeff Wooldridge NBER Summer Institute, 2007 1. Reminders About Means, Medians, and Quantiles 2. Some Useful Asymptotic Results 3. Quantile Regression

More information

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015 Part IA Probability Definitions Based on lectures by R. Weber Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly) after lectures.

More information

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Generalized Linear Models. Last time: Background & motivation for moving beyond linear Generalized Linear Models Last time: Background & motivation for moving beyond linear regression - non-normal/non-linear cases, binary, categorical data Today s class: 1. Examples of count and ordered

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

STA 732: Inference. Notes 2. Neyman-Pearsonian Classical Hypothesis Testing B&D 4

STA 732: Inference. Notes 2. Neyman-Pearsonian Classical Hypothesis Testing B&D 4 STA 73: Inference Notes. Neyman-Pearsonian Classical Hypothesis Testing B&D 4 1 Testing as a rule Fisher s quantification of extremeness of observed evidence clearly lacked rigorous mathematical interpretation.

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.1 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

Lecture 3 September 1

Lecture 3 September 1 STAT 383C: Statistical Modeling I Fall 2016 Lecture 3 September 1 Lecturer: Purnamrita Sarkar Scribe: Giorgio Paulon, Carlos Zanini Disclaimer: These scribe notes have been slightly proofread and may have

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Design of the Fuzzy Rank Tests Package

Design of the Fuzzy Rank Tests Package Design of the Fuzzy Rank Tests Package Charles J. Geyer July 15, 2013 1 Introduction We do fuzzy P -values and confidence intervals following Geyer and Meeden (2005) and Thompson and Geyer (2007) for three

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Gradient Descent. Dr. Xiaowei Huang

Gradient Descent. Dr. Xiaowei Huang Gradient Descent Dr. Xiaowei Huang https://cgi.csc.liv.ac.uk/~xiaowei/ Up to now, Three machine learning algorithms: decision tree learning k-nn linear regression only optimization objectives are discussed,

More information

IENG581 Design and Analysis of Experiments INTRODUCTION

IENG581 Design and Analysis of Experiments INTRODUCTION Experimental Design IENG581 Design and Analysis of Experiments INTRODUCTION Experiments are performed by investigators in virtually all fields of inquiry, usually to discover something about a particular

More information

If we want to analyze experimental or simulated data we might encounter the following tasks:

If we want to analyze experimental or simulated data we might encounter the following tasks: Chapter 1 Introduction If we want to analyze experimental or simulated data we might encounter the following tasks: Characterization of the source of the signal and diagnosis Studying dependencies Prediction

More information

Chapter 3. Introduction to Linear Correlation and Regression Part 3

Chapter 3. Introduction to Linear Correlation and Regression Part 3 Tuesday, December 12, 2000 Ch3 Intro Correlation Pt 3 Page: 1 Richard Lowry, 1999-2000 All rights reserved. Chapter 3. Introduction to Linear Correlation and Regression Part 3 Regression The appearance

More information

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation

Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation Univariate Normal Distribution; GLM with the Univariate Normal; Least Squares Estimation PRE 905: Multivariate Analysis Spring 2014 Lecture 4 Today s Class The building blocks: The basics of mathematical

More information

6 Pattern Mixture Models

6 Pattern Mixture Models 6 Pattern Mixture Models A common theme underlying the methods we have discussed so far is that interest focuses on making inference on parameters in a parametric or semiparametric model for the full data

More information

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12)

Prentice Hall Stats: Modeling the World 2004 (Bock) Correlated to: National Advanced Placement (AP) Statistics Course Outline (Grades 9-12) National Advanced Placement (AP) Statistics Course Outline (Grades 9-12) Following is an outline of the major topics covered by the AP Statistics Examination. The ordering here is intended to define the

More information

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness Information in Data Sufficiency, Ancillarity, Minimality, and Completeness Important properties of statistics that determine the usefulness of those statistics in statistical inference. These general properties

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Physics 509: Bootstrap and Robust Parameter Estimation

Physics 509: Bootstrap and Robust Parameter Estimation Physics 509: Bootstrap and Robust Parameter Estimation Scott Oser Lecture #20 Physics 509 1 Nonparametric parameter estimation Question: what error estimate should you assign to the slope and intercept

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Math Review Sheet, Fall 2008

Math Review Sheet, Fall 2008 1 Descriptive Statistics Math 3070-5 Review Sheet, Fall 2008 First we need to know about the relationship among Population Samples Objects The distribution of the population can be given in one of the

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

Chapter 6. Order Statistics and Quantiles. 6.1 Extreme Order Statistics

Chapter 6. Order Statistics and Quantiles. 6.1 Extreme Order Statistics Chapter 6 Order Statistics and Quantiles 61 Extreme Order Statistics Suppose we have a finite sample X 1,, X n Conditional on this sample, we define the values X 1),, X n) to be a permutation of X 1,,

More information

Topic 1. Definitions

Topic 1. Definitions S Topic. Definitions. Scalar A scalar is a number. 2. Vector A vector is a column of numbers. 3. Linear combination A scalar times a vector plus a scalar times a vector, plus a scalar times a vector...

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

5 Handling Constraints

5 Handling Constraints 5 Handling Constraints Engineering design optimization problems are very rarely unconstrained. Moreover, the constraints that appear in these problems are typically nonlinear. This motivates our interest

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Regression Clustering

Regression Clustering Regression Clustering In regression clustering, we assume a model of the form y = f g (x, θ g ) + ɛ g for observations y and x in the g th group. Usually, of course, we assume linear models of the form

More information

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn! Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn! Questions?! C. Porciani! Estimation & forecasting! 2! Cosmological parameters! A branch of modern cosmological research focuses

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

BTRY 4090: Spring 2009 Theory of Statistics

BTRY 4090: Spring 2009 Theory of Statistics BTRY 4090: Spring 2009 Theory of Statistics Guozhang Wang September 25, 2010 1 Review of Probability We begin with a real example of using probability to solve computationally intensive (or infeasible)

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

Lecture 2: CDF and EDF

Lecture 2: CDF and EDF STAT 425: Introduction to Nonparametric Statistics Winter 2018 Instructor: Yen-Chi Chen Lecture 2: CDF and EDF 2.1 CDF: Cumulative Distribution Function For a random variable X, its CDF F () contains all

More information

Multivariate Distributions

Multivariate Distributions Copyright Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/~cshalizi/adafaepov/ Appendix E Multivariate Distributions E.1 Review of Definitions Let s review

More information

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics A short review of the principles of mathematical statistics (or, what you should have learned in EC 151).

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

P Values and Nuisance Parameters

P Values and Nuisance Parameters P Values and Nuisance Parameters Luc Demortier The Rockefeller University PHYSTAT-LHC Workshop on Statistical Issues for LHC Physics CERN, Geneva, June 27 29, 2007 Definition and interpretation of p values;

More information

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1 Contents Preface to Second Edition Preface to First Edition Abbreviations xv xvii xix PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1 1 The Role of Statistical Methods in Modern Industry and Services

More information

Discussion of Maximization by Parts in Likelihood Inference

Discussion of Maximization by Parts in Likelihood Inference Discussion of Maximization by Parts in Likelihood Inference David Ruppert School of Operations Research & Industrial Engineering, 225 Rhodes Hall, Cornell University, Ithaca, NY 4853 email: dr24@cornell.edu

More information

Introduction to Probability

Introduction to Probability LECTURE NOTES Course 6.041-6.431 M.I.T. FALL 2000 Introduction to Probability Dimitri P. Bertsekas and John N. Tsitsiklis Professors of Electrical Engineering and Computer Science Massachusetts Institute

More information

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined

More information