Statistical Data Analysis

Size: px

Start display at page:

Download "Statistical Data Analysis"

Linda Wilkerson
5 years ago
Views:

1 DS-GA 0 Lecture notes 8 Fall Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the dataset, as well as for computing quantities that summarize it effectively. Such quantities are known as descriptive statistics. As we will see in the following questions, these statistics can often be interpreted within a probabilistic framework, but they are also useful when probabilistic assumptions are not warranted. Because of this, we present them as deterministic functions of the available data. 1.1 Histogram We begin by considering datasets containing one-dimensional data, which are often visualized by plotting their histogram. The histogram is obtained by binning the range of the data and counting the number of instances that fall within each bin. The width of the bins is a parameter that can be adjusted to yield higher or lower resolution. If the data are interpreted as samples from a random variable, then the histogram can be interpreted as an approximation to their pmf or pdf. Figure 1 shows two histograms computed from temperature data taken in a weather station in Oxford over 150 years. 1 Each data point represents the maximum temperature recorded in January or August of a particular year. Figure shows a histogram of the GDP per capita of all countries in the world in 014 according to the United Nations. 1. Empirical mean and variance Averaging the elements in a one-dimensional dataset provides a one-number summary of the data, which is a deterministic counterpart to the mean. This can be extended to multidimensional data by averaging over each dimension separately. Geometrically, the average, also known as the empirical or empirical mean, is the center of mass of the data. A common preprocessing step in data analysis is to center a set of data by subtracting its empirical mean. 1 The data is available at oxforddata.txt. The data is available at

2 January August Degrees Celsius Figure 1: Histograms of temperature data taken in a weather station in Oxford over 150 years. Each data point equals the maximum temperature recorded in a certain month in a particular year Thousands of dollars Figure : Histogram of the GDP per capita in 014.

3 Definition 1.1 Empirical mean. Let {x 1, x,..., x n } be a set of real-valued data. empirical mean of the data is defined as av x 1, x,..., x n := 1 n The n x i. 1 Let { x 1, x,..., x n } be a set of d-dimensional real-valued data vectors. The empirical mean or center is av x 1, x,..., x n := 1 n x i. n The empirical mean of the data in Figure 1 is 6.73 C in January and 1.3 C in August. The empirical mean of the GDPs per capita in Figure is $ The empirical variance is the average of the squared deviations from the empirical mean. Geometrically, it quantifies the average variation of the dataset around its center. It is a deterministic counterpart to the variance of a random variable. Definition 1. Empirical variance and standard deviation. Let {x 1, x,..., x n } be a set of real-valued data. The empirical variance is defined as var x 1, x,..., x n := 1 n 1 n x i av x 1, x,..., x n 3 The sample standard deviation is the square root of the empirical variance std x 1, x,..., x n := var x 1, x,..., x n. 4 You might be wondering why the normalizing constant is 1/ n 1 instead of 1/n. The reason is that this ensures that the expectation of the empirical variance equals the true variance when the data are iid see Lemma.6. In practice there is not much difference between the two normalizations. The empirical standard deviation of the temperature data in Figure 1 is 1.99 C in January and 1.73 C in August. The empirical standard deviation of the GDP data in Figure is $ Order statistics In some cases, a dataset is well described by its mean and standard deviation. In January the temperature in Oxford is around 6.73 C give or take C. 3

4 This a pretty accurate account of the temperature data from the previous section. However, imagine that someone describes the GDP dataset in Figure as: Countries typically have a GDP per capita of about $ give or take $ This description is pretty terrible. The problem is that most countries have very small GDPs per capita, whereas a few have really large ones and the empirical mean and standard deviation don t really convey this information. Order statistics provide an alternative description, which is usually more informative in the presence of extreme values. Definition 1.3 Quantiles and percentiles. Let x 1 x... x n denote the ordered elements of a set of data {x 1, x,..., x n }. The q quantile of the data for 0 < q < 1 is x [qn+1]. 3 The 0 p quantile is known as the p percentile. The 0.5 and 0.75 quantiles are known as the first and third quartiles, whereas the 0.5 quantile is known as the empirical median. A quarter of the data are smaller than the 0.5 quantile, half are smaller or larger than the median and three quarters are smaller than the 0.75 quartile. If n is even, the empirical median is usually set to x n/ + x n/+1. 5 The difference between the third and the first quartile is known as the interquartile range IQR. It turns out that for the temperature dataset in Figure 1 the empirical median is 6.80 C in January and 1. C in August, which is essentially the same as the empirical mean. The IQR.9 C in January and.1 C in August. This gives a very similar spread around the median, as the empirical mean. In this particular example, there does not seem to be an advantage in using order statistics. For the GDP dataset, the median is $ This means that half of the countries have a GDP of less than $ In contrast, 71% of the countries have a GDP per capita lower than the empirical mean! The IQR of these data is $ To provide a more complete description of the dataset, we can list a five-number summary of order statistics: the minimum x 1, the first quartile, the empirical median, the third quartile and the maximum x n. For the GDP dataset these are $130, $1 960, $6 350, $0 0, and $ respectively. We can visualize the main order statistics of a dataset by using a box plot, which shows the median value of the data enclosed in a box. The bottom and top of the box are the first and third quartiles. This way of visualizing a dataset was proposed by the mathematician John Tukey. Tukey s box plot also includes whiskers. The lower whisker is a line extending from the bottom of the box to the smallest value within 1.5 IQR of the first quartile. The higher whisker extends from the top of the box to the highest value within 1.5 IQR of the third quartile. Values beyond the whiskers are considered outliers and are plotted separately. 4

5 Degrees Celsius January April August November Figure 3: Box plots of the Oxford temperature dataset used in Figure 1. Each box plot corresponds to the maximum temperature in a particular month January, April, August and November over the last 150 years Thousands of dollars Figure 4: Box plot of the GDP per capita in 014. Not all of the outliers are shown. 5

6 Figure 3 applies box plots to visualize the temperature dataset used in Figure 1. Each box plot corresponds to the maximum temperature in a particular month January, April, August and November over the last 150 years. The box plots allow us to quickly compare the spread of temperatures in the different months. Figure 4 shows a box plot of the GDP data from Figure. From the box plot it is immediately apparent that most countries have very small GDPs per capita, that the spread between countries increases for larger GDPs per capita and that a small number of countries have very large GDPs per capita. 1.4 Empirical covariance In the previous sections we mostly considered datasets consisting of one-dimensional data except when we discussed the empirical mean of a multidimensional dataset. In machinelearning lingo, there was only one feature per data point. We now study a multidimensional scenario, where there are several features associated to each data point. If the dimension of the dataset equals to two i.e. there are two feature per data point, we can visualize the data using a scatter plot, where each axis represents one of the features. Figure 5 shows several scatter plots of temperature data. These data are the same as in Figure 1, but we have now arranged them to form two-dimensional datasets. In the plot on the left, one dimension corresponds to the temperature in January and the other dimensions to the temperature in August there is one data point per year. In the plot on the right, one dimension represents the minimum temperature in a particular month and the other dimensions represents the maximum temperature in the same month there is one data point per month. The empirical covariance quantifies whether the two features of a two-dimensional dataset tend to vary in a similar way on average, just as the covariance quantifies the expected joint variation of two random variables. In order to take into account that each individual feature may vary on a different scale, a common preprocessing step is to normalize each feature, dividing it by its empirical standard deviation. If we normalize before computing the covariance, we obtain the empirical correlation coefficient of the two features. If one of the features represents distance, for example, its correlation coefficient with another feature does not depend on the unit that we are using, but its covariance does. Definition 1.4 Empirical covariance. Let {x 1, y 1, x, y,..., x n, y n } be a dataset where each example consists of two features. The empirical covariance is defined as cov x 1, y 1,..., x n, y n := 1 n 1 3 [q n + 1] is the result of rounding q n + 1 to the closest integer. n x i av x 1,..., x n y i av y 1,..., y n. 6 6

7 ρ = 0.69 ρ = 0.96 April August Minimum temperature Maximum temperature Figure 5: Scatterplot of the temperature in January and in August left and of the maximum and minimum monthly temperature right in Oxford over the last 150 years. Definition 1.5 Empirical correlation coefficient. Let {x 1, y 1, x, y,..., x n, y n } be a dataset where each example consists of two features. The empirical correlation coefficient is defined as ρ x 1, y 1,..., x n, y n := cov x 1, y 1,..., x n, y n std x 1,..., x n std y 1,..., y n. 7 By the Cauchy-Schwarz inequality from linear algebra, which states that for any vectors a and b 1 a T b a b 1, 8 the magnitude of the empirical correlation coefficient is bounded by one. If it is equal to 1 or -1, then the two centered datasets are collinear. The Cauchy-Schwarz inequality is related to the Cauchy-Schwarz inequality for random variables Theorem 4.7 in Lecture Notes 4, but here it applies to deterministic vectors. Figure 5 shows the empirical correlation coefficients corresponding to the two plots. Maximum and minimum temperatures within the same month are highly correlated, whereas the maximum temperature in January and August within the same year are only somewhat correlated. 7

8 1.5 Covariance matrix and principal component analysis We now turn to the problem of describing how a set of multidimensional data varies around its center when the dimension is larger than two. We begin by defining the empirical covariance matrix, which contains the pairwise empirical covariance between every two features. Definition 1.6 Empirical covariance matrix. Let { x 1, x,..., x n } be a set of d-dimensional real-valued data vectors.the empirical covariance matrix of these data is the d d matrix Σ x 1,..., x n := 1 n x i av x 1,..., x n x i av x 1,..., x n T. 9 n 1 The i, j entry of the covariance matrix, where 1 i, j d, is given by { var x1 i,..., x n i if i = j, Σ x 1,..., x n ij = cov x 1 i, x 1 j,..., x n i, x n j if i j. In order to characterize the variation of a multidimensional dataset around its center, we consider its variation in different directions. In particular, we are interested in determining in what directions it varies more and in what directions it varies less. The average variation of the data in a certain direction is quantified by the empirical variance of the projections of the data onto that direction. Let v be a unit-norm vector aligned with a direction of interest, the empirical variance of the data set in that direction is given by var v T x 1,..., v T x n = 1 n 1 = 1 n 1 = v T n v T x i av v T x 1,..., v T x n 11 n v T x i av x 1,..., x n 1 n 1 n x i av x 1,..., x n x i av x 1,..., x n T v 1 = v T Σ x 1,..., x n v. 13 Using the empirical covariance matrix we can express the variation in every direction! This is a deterministic analog of the fact that the covariance matrix of a random vector encodes its variance in every direction. To find the direction in which variation is maximal we need to maximize the quadratic form v T Σ x 1,..., x n v over all unit-norm vectors v. Consider the eigendecomposition of the covariance matrix Σ x 1,..., x n = UΛU T 14 λ = [ ] u 1 u u n 0 λ 0 [ ] T u1 u u n λ n 8

9 u 1 u 1 u 1 u u u Figure 6: PCA of a two-dimensional dataset with n = 0 data points with different configurations. By definition, Σ x 1,..., x n is symmetric, so its eigenvectors u 1, u,..., u n are orthogonal. Furthermore, the eigenvectors and eigenvalues have a very intuitive interpretation in terms of the quadratic form of interest. Theorem 1.7. For any symmetric matrix A R n with normalized eigenvectors u 1, u,..., u n and corresponding eigenvalues λ 1 λ... λ n λ 1 = max v =1 v T A v, 16 u 1 = arg max v =1 v T A v, 17 λ k = max v =1, u u 1,..., u k 1 v T A v, 18 u k = arg max v T A v. 19 v =1, u u 1,..., u k 1 The maximum of v T Σ x 1,..., x n v is equal to the largest eigenvalue λ 1 of Σ x 1,..., x n and is attained by the corresponding eigenvector u 1. This means that u 1 is the direction of maximum variation. Moreover, the eigenvector u corresponding to the second largest eigenvalue λ is the direction of maximum variation that is orthogonal to u 1. In general, the eigenvector u k corresponding to the kth largest eigenvalue λ k reveals the direction of maximum variation that is orthogonal to u 1, u,..., u k 1. Finally, u n is the direction of minimum variation. In data analysis, the eigenvectors of the sample covariance matrix are usually called principal components. Computing these eigenvectors to quantify the variation of a dataset in different directions is called principal component analysis PCA. Figure 6 shows the principal components for several D examples. The following example explains how to apply principal component analysis to dimensionality reduction. The motivation is that in many cases directions of higher variation contain are more informative about the structure of the dataset. 9

10 Projection onto second PC Projection onto dth PC Projection onto first PC Projection onto d-1th PC Figure 7: Projection of 7-dimensional vectors describing different wheat seeds onto the first two left and the last two right principal components of the dataset. Each color represents a variety of wheat. Example 1.8 Dimensionality reduction via PCA. We consider a dataset where each data point corresponds to a seed which has seven features: area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove. The seeds belong to three different varieties of wheat: Kama, Rosa and Canadian. 4 Our aim is to visualize the data by projecting the data down to two dimensions in a way that preserves as much variation as possible. This can be achieved by projecting each point onto the two first principal components of the dataset. Figure 7 shows the projection of the data onto the first two and the last two principal components. In the latter case, there is almost no discernible variation. The structure of the data is much better conserved by the two first components, which allow to clearly visualize the difference between the three types of seeds. Note however that projection onto the first principal components only ensures that we preserve as much variation as possible, not that the projection will be good for tasks such as classification. Statistical estimation The goal of statistical estimation is to extract information from data. In this section we model the data as a realization of an iid sequence of random variables. This assumption allows to 4 The data can be found at

11 analyze statistical estimation using probabilistic tools, such as the law of large numbers and the central limit theorem. We study how to approximate a deterministic parameter associated to the underlying distribution of the iid sequence, for example its mean. This is a frequentist framework, as opposed to Bayesian approaches where parameters of interest are modeled as a random quantities. We will study Bayesian statistics later on in the course. We define an estimator as a deterministic function which provides an approximation to a parameter of interest γ from the data x 1, x,..., x n, y n := h x 1, x,..., x n. 0 Under the assumption that the data are a realization of an iid sequence X, the estimators for different numbers of samples can be interpreted as a random sequence, Ỹ n := h X 1, X,..., X n, 1 which we can analyze probabilistically. In particular, we consider the following questions: Is the estimator guaranteed to produce an arbitrarily good approximation from arbitrarily large amounts of data? More formally, does Ỹ converge to γ as n? For finite n what is the probability that γ is approximated by the estimator up to a certain accuracy? Before answering these questions, we show that our framework applies to an important scenario: estimating a descriptive statistic of a large population from a randomly chosen subset of individuals. Example.1 Sampling from a population. Assume that we are studying a population of m individuals. We are interested in a certain feature associated to each person, e.g. their cholesterol level, their salary or who they are voting for in an election. There are k possible values for the feature {z 1, z,..., z k }, where k can be equal to m or much smaller. We denote by m j the number of people for whom the feature is equal to z j, 1 j k. In the case of an election with two candidates, k would equal two and m 1 and m would represent the people voting for each of the candidates respectively. Our goal is to estimate a descriptive statistic of the population, but we can only measure the feature of interest for a reduced number of individuals. If we choose those individuals uniformly at random then the measurements can be modeled as an sequence X with firstorder pmf p Xi z j = P The feature for the ith chosen person equals z j = m j, 1 j k. 3 m 11

12 If we sample with replacement an individual can be chosen several times every sample has the same pmf and the different samples are independent, so the data are an iid sequence..1 Mean square error As we discussed in Lecture Notes 6 when describing convergence in mean square, the mean square of the difference between two random variables is a reasonable measure of how close they are to each other. The mean square error of an estimator quantifies how accurately it approximates the quantity of interest. Definition. Mean square error. The mean square error MSE of an estimator Y that approximates a parameter γ is MSE Y := E Y γ. 4 The MSE can be decomposed into a bias term and a variance term. The bias term is the difference between the parameter of interest and the expected value of the estimator. The variance term corresponds to the variation of the estimator around its expected value. Lemma.3 Bias-variance decomposition. The MSE of an estimator Y that approximates a parameter γ satisfies MSE Y = E Y E Y + E Y γ. 5 }{{}}{{} variance bias Proof. The lemma is a direct consequence of linearity of expectation. If the bias is zero, then the estimator equals the quantity of interest on average. Definition.4 Unbiased estimator. An estimator Y that approximates a parameter γ is unbiased if its bias is equal to zero, i.e. if and only if E Y = γ. 6 An estimator may be unbiased but still incur in a large mean square error variance, due to its variance. The following lemmas establish that the empirical mean and variance are unbiased estimators of the true mean and variance of an iid sequence of random variables. 1

13 Lemma.5 The empirical mean is unbiased. The empirical mean is an unbiased estimator of the mean of an iid sequence of random variables. Proof. We consider the empirical mean of an iid sequence X with mean µ, Ỹ n := 1 n X i. 7 n By linearity of expectation E Ỹ n = 1 n n E X i 8 = µ. 9 Lemma.6 The empirical variance is unbiased. The empirical variance is an unbiased estimator of the variance of an iid sequence of random variables. The proof of this result is in Section A of the appendix.. Consistency Intuitively, if we are estimating a scalar quantity, the estimate should improve as we gather more data. In fact, ideally the estimate should converge to the true parameter in the limit when the number of data n. Estimators that achieve this are consistent. Definition.7 Consistency. An estimator Ỹ n := h X 1, X,..., X n that approximates a parameter γ is consistent if it converges to γ as n in mean square, with probability one or in probability. The following theorem shows that the mean is consistent. Theorem.8 The empirical mean is consistent. The empirical mean is a consistent estimator of the mean of an iid sequence of random variables as long as the variance of the sequence is bounded. Proof. We consider the empirical mean of an iid sequence X with mean µ, Ỹ n := 1 n X i. 30 n The estimator is equal to the moving average of the data. As a result it converges to µ in mean square and with probability one by the law of large numbers Theorem 3. in Lecture Notes 6, as long as the variance σ of each of the entries in the iid sequence is bounded. 13

14 Example.9 Estimating the average height. In this example we illustrate the consistency of the empirical mean. Imagine that we want to estimate the mean height in a population. To be concrete we will consider a population of m := 5000 people. Figure 8 shows a histogram of their heights. 5 As explained in Example.1 if we sample n individuals from this population with replacement, then their heights form an iid sequence X. The mean of this sequence is E X i := = 1 m m P Person j is chosen height of person j 31 j=1 m h j 3 j=1 = av h 1,..., h m 33 for 1 i n, where h 1,..., h m are the heights of the people. In addition, the variance is bounded because the heights are finite. By Theorem.8 the empirical mean of the n data should converge to the mean of the iid sequence and hence to the average height over the whole population. Figure 9 illustrates this numerically. If the mean of the underlying distribution is not well defined, or its variance is unbounded, then the empirical mean is not necessarily a consistent estimate. This is related to the fact that the empirical mean can be severely affected by the presence of extreme values, as we discussed in Section 1.. The empirical median, in contrast, tends to be more robust in such situations, as discussed in Section 1.3. The following theorem establishes that the empirical median is consistent, even if the mean is not well defined or the variance is unbounded. The proof is in Section B of the appendix. Theorem. Empirical median as an estimator of the median. The empirical median is a consistent estimator of the median of an iid sequence of random variables. Figure compares the moving average and the moving median of an iid sequence of Cauchy random variables for three different realizations. The moving average is unstable and does not converge no matter how many data are available, which is not surprising because the mean is not well defined. In contrast, the moving median does eventually converge to the true median as predicted by Theorem.. 5 The data set can be found here: wiki.stat.ucla.edu/socr/index.php/socr_data_dinov_008_ HeightsWeights. 14

15 Height inches Figure 8: Histogram of the heights of a group of people. Height inches n True mean Empirical mean Figure 9: Different realizations of the empirical mean when individuals from the population in Figure 8 are sampled with replacement. 15

16 Moving average Median of iid seq i Empirical mean Moving average Median of iid seq i 30 0 Moving average Median of iid seq i Moving median Median of iid seq i Empirical median Moving median Median of iid seq i Moving median Median of iid seq i Figure : Realization of the moving average of an iid Cauchy sequence top compared to the moving median bottom. 16

17 n = 5 n = 0 n = 0 True covariance Empirical covariance Figure 11: Principal components of n samples from a bivariate Gaussian distribution red compared to the eigenvectors of the covariance matrix of the distribution black. The empirical variance and covariance are consistent estimators of the variance and covariance respectively, under certain assumptions on the higher moments of the underlying distributions. This provides an intuitive interpretation for PCA under the assumption that the data are realizations of an iid sequence of random vectors: the principal components approximate the eigenvectors of the true covariance matrix, and hence the directions of maximum variance of the multidimensional distribution. Figure 11 illustrates this with a numerical example, where the principal components indeed converge to the eigenvectors as the number of data increases..3 Confidence intervals Consistency implies that an estimator will be perfect if we acquire infinite data, but this is of course impossible in practice. It is therefore very important to quantify the accuracy of an estimator for a fixed number of data. Confidence intervals allow to do this from a frequentist point of view. A confidence interval can be interpreted as a soft estimate of the deterministic parameter of interest, which guarantees that the parameter will belong to the interval with a certain probability. Definition.11 Confidence interval. A 1 α confidence interval I for a parameter γ satisfies where 0 < α < 1. P γ I 1 α, 34 Confidence intervals are usually of the form [Y c, Y + c] where Y is an estimator of the quantity of interest and c is a constant that depends on the number of data. The following 17

18 theorem shows how to derive a confidence interval for the mean of data that are modeled as an iid sequence. The confidence interval is centered at the empirical mean. Theorem.1 Confidence interval for the mean of an iid sequence. Let X be an iid sequence with mean µ and variance σ b for some b > 0. For any 0 < α < 1 [ I n := Y n b, Y n + b ], Y n := av X 1, X,..., X n, 35 α n α n is a 1 α confidence interval for µ. Proof. Recall that the variance of Y n equals Var Xn = σ /n see the proof of Theorem 3. in Lecture Notes 6, we have [ P µ Y n σ, Y n + σ ] = 1 P Y n µ > σ 36 α n α n α n 1 α nvar Y n b by Chebyshev s inequality 37 = 1 α σ b 38 1 α. 39 The width of the interval provided in the theorem decreases with n for fixed α, which makes sense as incorporating more data reduces the variance of the estimator and hence our uncertainty about it. Example.13 Bears in Yosemite. A scientist is trying to estimate the average weight of the black bears in Yosemite National Park. She manages to capture 300 bears. We assume that the bears are sampled uniformly at random with replacement a bear can be weighed more than once. Under this assumptions, in Example.1 we showed that the data can be modeled as iid samples and in Example.9 we showed the empirical mean is a consistent estimator of the mean of the whole population. The average weight of the 300 captured bears is Y := 00 lbs. To derive a confidence interval from this information we need a bound on the variance. The maximum weight recorded for a black bear ever is 880 lbs. Let µ and σ be the unknown mean and variance of the weights of the whole population. If X is the weight of a bear chosen uniformly at random from the whole population then X has mean µ and variance σ, so σ = E X E X 40 E X because X

19 As a result, 880 is an upper bound for the standard deviation. Applying Theorem.1, [ Y b, Y + b ] = [ 7., 47.] 43 α n α n is a 95% confidence interval for the average weight of the whole population. The interval is not very precise because n is not very large. As illustrated by this example, confidence intervals derived from Chebyshev s inequality tend to be very conservative. An alternative is to leverage the central limit theorem CLT. The CLT characterizes the distribution of the empirical mean asymptotically, so confidence intervals derived from it are not guaranteed to be precise. However, the CLT often provides a very accurate approximation to the distribution of the empirical mean for finite n, as we showed through some numerical examples in Lecture Notes 6. In order to obtain confidence intervals for the mean of an iid sequence from the CLT as stated in Lecture Notes 6 we would need to have access to the true variance of the sequence, which is unrealistic in practice. However, the following result states that we can substitute the true variance with the empirical variance. The proof is beyond the scope of these notes. Theorem.14 Central limit theorem with empirical standard deviation. Let X be an iid discrete random process with mean µ X := µ such that its variance and fourth moment E X i 4 are bounded. The sequence n av X 1,..., X n µ 44 std X 1,..., X n converges in distribution to a standard Gaussian random variable. Recall that the cdf of a standard Gaussian does not have a closed-form expression. simplify notation we express the confidence interval in terms of the Q function. To Definition.15 Q functions. Q x is the probability that a standard Gaussian random variable is greater than x for positive x, 1 Q x := exp u du, x > π u=x By symmetry, if U is a standard Gaussian random variable and y < 0 P U < y = Q y

20 Corollary.16 Approximate confidence interval for the mean. Let X be an iid sequence that satisfies the conditions of Theorem.14. For any 0 < α < 1 [ I n := Y n S n α Q 1, Y n + S n α ] Q 1, 47 n n Y n := av X 1, X,..., X n, 48 S n := std X 1, X,..., X n, 49 is an approximate 1 α confidence interval for µ, i.e. P µ I n 1 α. 50 Proof. By the Central Limit Theorem, when n X n is distributed as a Gaussian random variable with mean µ and variance σ. As a result P µ I n = 1 P Y n > µ + S n α Q 1 P Y n < µ S n α Q 1 51 n n n Yn µ α n = 1 P > Q 1 Yn µ α P < Q 1 5 S n S n α 1 Q Q 1 by Theorem = 1 α. 54 It is important to stress that the result only provides an accurate confidence interval if n is large enough for the empirical variance to converge to the true variance and for the CLT to take effect. Example.17 Bears in Yosemite continued. The empirical standard deviation of the bears captured by the scientist equals 0 lbs. We apply Corollary.16 to derive an approximate confidence interval that is tighter than the one obtained applying Chebyshev s inequality. Given that Q , [ Y σ α Q 1, Y + σ α ] Q 1 [188.8, 11.3] 55 n n is an approximate 95% confidence interval for the mean weight of the population of bears. 0

21 n = 50 n = 00 n = 00 True mean True mean True mean Figure 1: 95% confidence intervals for the average of the height population in Example.9. Interpreting confidence intervals is somewhat tricky. After computing the confidence interval in Example.17 one is tempted to state: The probability that the average weight is between and 11.3 lbs is However we are modeling the average weight as a deterministic parameter, so there are no random quantities in this statement! The correct interpretation is that if we repeat the process of sampling the population and compute the confidence interval, then the true parameter will lie in the interval 95% of the time. This is illustrated in the following example and in Figure 1. Example.18 Estimating the average height continued. Figure 1 shows several 95% confidence intervals for the average of the height population in Example.9. To compute each interval we select n individuals and then apply Corollary.16. The width of the intervals decreases as n grows, but because they are all 95% confidence intervals they all contain the true average with probability Indeed this is the case for 113 out of 94 % of the intervals that are plotted. 1

22 A Proof of Lemma.6 We consider the empirical variance of an iid sequence X with mean µ and variance σ, Ỹ n := 1 n X i 1 1 X j 56 n 1 n j=1 = 1 X i 1 n X j 57 n 1 n j=1 = 1 X i + 1 n n X j n 1 n X k n X i n X j 58 j=1 k=1 To simplify notation, we denote the mean square E X i = µ + σ by ξ. We have E Ỹ n = 1 n 1 j=1 j=1 n E X i + 1 n E X j + 1 n n n n E X i n = 1 n 1 = 1 n 1 = σ. n n n j=1 j i E X i X j j=1 n k=1 k j ξ + n ξ n n 1 µ + ξ n 1 µ n n n n n 1 n E X j X k ξ µ 6 63 B Proof of Theorem. We denote the empirical median by Ỹ n. Our aim is to show that for any ɛ > 0 lim P Ỹ n γ ɛ = n We will prove that lim Ỹ P n γ + ɛ = n

23 The same argument allows to establish If we order the set lim Ỹ P n γ ɛ = n { X 1,..., X n }, then Ỹ n equals the n + 1 /th element if n is odd and the average of the n/th and the n/ + 1th element if n is even. The event Ỹ n γ + ɛ therefore implies that at least n + 1 / of the elements are larger than γ + ɛ. For each individual X i, the probability that X i > γ + ɛ is p := 1 F Xi γ + ɛ = 1/ ɛ 67 where we assume that ɛ > 0. If this is not the case then the cdf of the iid sequence is flat { at γ and the } median is not well defined. The number of random variables in the set X 1,..., X n which are larger than γ + ɛ is distributed as a binomial random variable B n with parameters n and p. As a result, we have n + 1 P Ỹ n γ + ɛ P or more samples are greater or equal to γ + ɛ 68 = P B n n = P B n np n + 1 np 70 P B n np nɛ Var B n nɛ + 1 by Chebyshev s inequality 7 = = np 1 p n ɛ + 1 n 73 p 1 p n ɛ + 1, n 74 which converges to zero as n. This establishes 65. 3

Descriptive Statistics

Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Descriptive statistics Techniques to visualize