Statistical Data Analysis

Size: px
Start display at page:

Download "Statistical Data Analysis"

Transcription

1 DS-GA 0 Lecture notes 8 Fall Descriptive statistics Statistical Data Analysis In this section we consider the problem of analyzing a set of data. We describe several techniques for visualizing the dataset, as well as for computing quantities that summarize it effectively. Such quantities are known as descriptive statistics. As we will see in the following questions, these statistics can often be interpreted within a probabilistic framework, but they are also useful when probabilistic assumptions are not warranted. Because of this, we present them as deterministic functions of the available data. 1.1 Histogram We begin by considering datasets containing one-dimensional data, which are often visualized by plotting their histogram. The histogram is obtained by binning the range of the data and counting the number of instances that fall within each bin. The width of the bins is a parameter that can be adjusted to yield higher or lower resolution. If the data are interpreted as samples from a random variable, then the histogram can be interpreted as an approximation to their pmf or pdf. Figure 1 shows two histograms computed from temperature data taken in a weather station in Oxford over 150 years. 1 Each data point represents the maximum temperature recorded in January or August of a particular year. Figure shows a histogram of the GDP per capita of all countries in the world in 014 according to the United Nations. 1. Empirical mean and variance Averaging the elements in a one-dimensional dataset provides a one-number summary of the data, which is a deterministic counterpart to the mean. This can be extended to multidimensional data by averaging over each dimension separately. Geometrically, the average, also known as the empirical or empirical mean, is the center of mass of the data. A common preprocessing step in data analysis is to center a set of data by subtracting its empirical mean. 1 The data is available at oxforddata.txt. The data is available at

2 January August Degrees Celsius Figure 1: Histograms of temperature data taken in a weather station in Oxford over 150 years. Each data point equals the maximum temperature recorded in a certain month in a particular year Thousands of dollars Figure : Histogram of the GDP per capita in 014.

3 Definition 1.1 Empirical mean. Let {x 1, x,..., x n } be a set of real-valued data. empirical mean of the data is defined as av x 1, x,..., x n := 1 n The n x i. 1 Let { x 1, x,..., x n } be a set of d-dimensional real-valued data vectors. The empirical mean or center is av x 1, x,..., x n := 1 n x i. n The empirical mean of the data in Figure 1 is 6.73 C in January and 1.3 C in August. The empirical mean of the GDPs per capita in Figure is $ The empirical variance is the average of the squared deviations from the empirical mean. Geometrically, it quantifies the average variation of the dataset around its center. It is a deterministic counterpart to the variance of a random variable. Definition 1. Empirical variance and standard deviation. Let {x 1, x,..., x n } be a set of real-valued data. The empirical variance is defined as var x 1, x,..., x n := 1 n 1 n x i av x 1, x,..., x n 3 The sample standard deviation is the square root of the empirical variance std x 1, x,..., x n := var x 1, x,..., x n. 4 You might be wondering why the normalizing constant is 1/ n 1 instead of 1/n. The reason is that this ensures that the expectation of the empirical variance equals the true variance when the data are iid see Lemma.6. In practice there is not much difference between the two normalizations. The empirical standard deviation of the temperature data in Figure 1 is 1.99 C in January and 1.73 C in August. The empirical standard deviation of the GDP data in Figure is $ Order statistics In some cases, a dataset is well described by its mean and standard deviation. In January the temperature in Oxford is around 6.73 C give or take C. 3

4 This a pretty accurate account of the temperature data from the previous section. However, imagine that someone describes the GDP dataset in Figure as: Countries typically have a GDP per capita of about $ give or take $ This description is pretty terrible. The problem is that most countries have very small GDPs per capita, whereas a few have really large ones and the empirical mean and standard deviation don t really convey this information. Order statistics provide an alternative description, which is usually more informative in the presence of extreme values. Definition 1.3 Quantiles and percentiles. Let x 1 x... x n denote the ordered elements of a set of data {x 1, x,..., x n }. The q quantile of the data for 0 < q < 1 is x [qn+1]. 3 The 0 p quantile is known as the p percentile. The 0.5 and 0.75 quantiles are known as the first and third quartiles, whereas the 0.5 quantile is known as the empirical median. A quarter of the data are smaller than the 0.5 quantile, half are smaller or larger than the median and three quarters are smaller than the 0.75 quartile. If n is even, the empirical median is usually set to x n/ + x n/+1. 5 The difference between the third and the first quartile is known as the interquartile range IQR. It turns out that for the temperature dataset in Figure 1 the empirical median is 6.80 C in January and 1. C in August, which is essentially the same as the empirical mean. The IQR.9 C in January and.1 C in August. This gives a very similar spread around the median, as the empirical mean. In this particular example, there does not seem to be an advantage in using order statistics. For the GDP dataset, the median is $ This means that half of the countries have a GDP of less than $ In contrast, 71% of the countries have a GDP per capita lower than the empirical mean! The IQR of these data is $ To provide a more complete description of the dataset, we can list a five-number summary of order statistics: the minimum x 1, the first quartile, the empirical median, the third quartile and the maximum x n. For the GDP dataset these are $130, $1 960, $6 350, $0 0, and $ respectively. We can visualize the main order statistics of a dataset by using a box plot, which shows the median value of the data enclosed in a box. The bottom and top of the box are the first and third quartiles. This way of visualizing a dataset was proposed by the mathematician John Tukey. Tukey s box plot also includes whiskers. The lower whisker is a line extending from the bottom of the box to the smallest value within 1.5 IQR of the first quartile. The higher whisker extends from the top of the box to the highest value within 1.5 IQR of the third quartile. Values beyond the whiskers are considered outliers and are plotted separately. 4

5 Degrees Celsius January April August November Figure 3: Box plots of the Oxford temperature dataset used in Figure 1. Each box plot corresponds to the maximum temperature in a particular month January, April, August and November over the last 150 years Thousands of dollars Figure 4: Box plot of the GDP per capita in 014. Not all of the outliers are shown. 5

6 Figure 3 applies box plots to visualize the temperature dataset used in Figure 1. Each box plot corresponds to the maximum temperature in a particular month January, April, August and November over the last 150 years. The box plots allow us to quickly compare the spread of temperatures in the different months. Figure 4 shows a box plot of the GDP data from Figure. From the box plot it is immediately apparent that most countries have very small GDPs per capita, that the spread between countries increases for larger GDPs per capita and that a small number of countries have very large GDPs per capita. 1.4 Empirical covariance In the previous sections we mostly considered datasets consisting of one-dimensional data except when we discussed the empirical mean of a multidimensional dataset. In machinelearning lingo, there was only one feature per data point. We now study a multidimensional scenario, where there are several features associated to each data point. If the dimension of the dataset equals to two i.e. there are two feature per data point, we can visualize the data using a scatter plot, where each axis represents one of the features. Figure 5 shows several scatter plots of temperature data. These data are the same as in Figure 1, but we have now arranged them to form two-dimensional datasets. In the plot on the left, one dimension corresponds to the temperature in January and the other dimensions to the temperature in August there is one data point per year. In the plot on the right, one dimension represents the minimum temperature in a particular month and the other dimensions represents the maximum temperature in the same month there is one data point per month. The empirical covariance quantifies whether the two features of a two-dimensional dataset tend to vary in a similar way on average, just as the covariance quantifies the expected joint variation of two random variables. In order to take into account that each individual feature may vary on a different scale, a common preprocessing step is to normalize each feature, dividing it by its empirical standard deviation. If we normalize before computing the covariance, we obtain the empirical correlation coefficient of the two features. If one of the features represents distance, for example, its correlation coefficient with another feature does not depend on the unit that we are using, but its covariance does. Definition 1.4 Empirical covariance. Let {x 1, y 1, x, y,..., x n, y n } be a dataset where each example consists of two features. The empirical covariance is defined as cov x 1, y 1,..., x n, y n := 1 n 1 3 [q n + 1] is the result of rounding q n + 1 to the closest integer. n x i av x 1,..., x n y i av y 1,..., y n. 6 6

7 ρ = 0.69 ρ = 0.96 April August Minimum temperature Maximum temperature Figure 5: Scatterplot of the temperature in January and in August left and of the maximum and minimum monthly temperature right in Oxford over the last 150 years. Definition 1.5 Empirical correlation coefficient. Let {x 1, y 1, x, y,..., x n, y n } be a dataset where each example consists of two features. The empirical correlation coefficient is defined as ρ x 1, y 1,..., x n, y n := cov x 1, y 1,..., x n, y n std x 1,..., x n std y 1,..., y n. 7 By the Cauchy-Schwarz inequality from linear algebra, which states that for any vectors a and b 1 a T b a b 1, 8 the magnitude of the empirical correlation coefficient is bounded by one. If it is equal to 1 or -1, then the two centered datasets are collinear. The Cauchy-Schwarz inequality is related to the Cauchy-Schwarz inequality for random variables Theorem 4.7 in Lecture Notes 4, but here it applies to deterministic vectors. Figure 5 shows the empirical correlation coefficients corresponding to the two plots. Maximum and minimum temperatures within the same month are highly correlated, whereas the maximum temperature in January and August within the same year are only somewhat correlated. 7

8 1.5 Covariance matrix and principal component analysis We now turn to the problem of describing how a set of multidimensional data varies around its center when the dimension is larger than two. We begin by defining the empirical covariance matrix, which contains the pairwise empirical covariance between every two features. Definition 1.6 Empirical covariance matrix. Let { x 1, x,..., x n } be a set of d-dimensional real-valued data vectors.the empirical covariance matrix of these data is the d d matrix Σ x 1,..., x n := 1 n x i av x 1,..., x n x i av x 1,..., x n T. 9 n 1 The i, j entry of the covariance matrix, where 1 i, j d, is given by { var x1 i,..., x n i if i = j, Σ x 1,..., x n ij = cov x 1 i, x 1 j,..., x n i, x n j if i j. In order to characterize the variation of a multidimensional dataset around its center, we consider its variation in different directions. In particular, we are interested in determining in what directions it varies more and in what directions it varies less. The average variation of the data in a certain direction is quantified by the empirical variance of the projections of the data onto that direction. Let v be a unit-norm vector aligned with a direction of interest, the empirical variance of the data set in that direction is given by var v T x 1,..., v T x n = 1 n 1 = 1 n 1 = v T n v T x i av v T x 1,..., v T x n 11 n v T x i av x 1,..., x n 1 n 1 n x i av x 1,..., x n x i av x 1,..., x n T v 1 = v T Σ x 1,..., x n v. 13 Using the empirical covariance matrix we can express the variation in every direction! This is a deterministic analog of the fact that the covariance matrix of a random vector encodes its variance in every direction. To find the direction in which variation is maximal we need to maximize the quadratic form v T Σ x 1,..., x n v over all unit-norm vectors v. Consider the eigendecomposition of the covariance matrix Σ x 1,..., x n = UΛU T 14 λ = [ ] u 1 u u n 0 λ 0 [ ] T u1 u u n λ n 8

9 u 1 u 1 u 1 u u u Figure 6: PCA of a two-dimensional dataset with n = 0 data points with different configurations. By definition, Σ x 1,..., x n is symmetric, so its eigenvectors u 1, u,..., u n are orthogonal. Furthermore, the eigenvectors and eigenvalues have a very intuitive interpretation in terms of the quadratic form of interest. Theorem 1.7. For any symmetric matrix A R n with normalized eigenvectors u 1, u,..., u n and corresponding eigenvalues λ 1 λ... λ n λ 1 = max v =1 v T A v, 16 u 1 = arg max v =1 v T A v, 17 λ k = max v =1, u u 1,..., u k 1 v T A v, 18 u k = arg max v T A v. 19 v =1, u u 1,..., u k 1 The maximum of v T Σ x 1,..., x n v is equal to the largest eigenvalue λ 1 of Σ x 1,..., x n and is attained by the corresponding eigenvector u 1. This means that u 1 is the direction of maximum variation. Moreover, the eigenvector u corresponding to the second largest eigenvalue λ is the direction of maximum variation that is orthogonal to u 1. In general, the eigenvector u k corresponding to the kth largest eigenvalue λ k reveals the direction of maximum variation that is orthogonal to u 1, u,..., u k 1. Finally, u n is the direction of minimum variation. In data analysis, the eigenvectors of the sample covariance matrix are usually called principal components. Computing these eigenvectors to quantify the variation of a dataset in different directions is called principal component analysis PCA. Figure 6 shows the principal components for several D examples. The following example explains how to apply principal component analysis to dimensionality reduction. The motivation is that in many cases directions of higher variation contain are more informative about the structure of the dataset. 9

10 Projection onto second PC Projection onto dth PC Projection onto first PC Projection onto d-1th PC Figure 7: Projection of 7-dimensional vectors describing different wheat seeds onto the first two left and the last two right principal components of the dataset. Each color represents a variety of wheat. Example 1.8 Dimensionality reduction via PCA. We consider a dataset where each data point corresponds to a seed which has seven features: area, perimeter, compactness, length of kernel, width of kernel, asymmetry coefficient and length of kernel groove. The seeds belong to three different varieties of wheat: Kama, Rosa and Canadian. 4 Our aim is to visualize the data by projecting the data down to two dimensions in a way that preserves as much variation as possible. This can be achieved by projecting each point onto the two first principal components of the dataset. Figure 7 shows the projection of the data onto the first two and the last two principal components. In the latter case, there is almost no discernible variation. The structure of the data is much better conserved by the two first components, which allow to clearly visualize the difference between the three types of seeds. Note however that projection onto the first principal components only ensures that we preserve as much variation as possible, not that the projection will be good for tasks such as classification. Statistical estimation The goal of statistical estimation is to extract information from data. In this section we model the data as a realization of an iid sequence of random variables. This assumption allows to 4 The data can be found at

11 analyze statistical estimation using probabilistic tools, such as the law of large numbers and the central limit theorem. We study how to approximate a deterministic parameter associated to the underlying distribution of the iid sequence, for example its mean. This is a frequentist framework, as opposed to Bayesian approaches where parameters of interest are modeled as a random quantities. We will study Bayesian statistics later on in the course. We define an estimator as a deterministic function which provides an approximation to a parameter of interest γ from the data x 1, x,..., x n, y n := h x 1, x,..., x n. 0 Under the assumption that the data are a realization of an iid sequence X, the estimators for different numbers of samples can be interpreted as a random sequence, Ỹ n := h X 1, X,..., X n, 1 which we can analyze probabilistically. In particular, we consider the following questions: Is the estimator guaranteed to produce an arbitrarily good approximation from arbitrarily large amounts of data? More formally, does Ỹ converge to γ as n? For finite n what is the probability that γ is approximated by the estimator up to a certain accuracy? Before answering these questions, we show that our framework applies to an important scenario: estimating a descriptive statistic of a large population from a randomly chosen subset of individuals. Example.1 Sampling from a population. Assume that we are studying a population of m individuals. We are interested in a certain feature associated to each person, e.g. their cholesterol level, their salary or who they are voting for in an election. There are k possible values for the feature {z 1, z,..., z k }, where k can be equal to m or much smaller. We denote by m j the number of people for whom the feature is equal to z j, 1 j k. In the case of an election with two candidates, k would equal two and m 1 and m would represent the people voting for each of the candidates respectively. Our goal is to estimate a descriptive statistic of the population, but we can only measure the feature of interest for a reduced number of individuals. If we choose those individuals uniformly at random then the measurements can be modeled as an sequence X with firstorder pmf p Xi z j = P The feature for the ith chosen person equals z j = m j, 1 j k. 3 m 11

12 If we sample with replacement an individual can be chosen several times every sample has the same pmf and the different samples are independent, so the data are an iid sequence..1 Mean square error As we discussed in Lecture Notes 6 when describing convergence in mean square, the mean square of the difference between two random variables is a reasonable measure of how close they are to each other. The mean square error of an estimator quantifies how accurately it approximates the quantity of interest. Definition. Mean square error. The mean square error MSE of an estimator Y that approximates a parameter γ is MSE Y := E Y γ. 4 The MSE can be decomposed into a bias term and a variance term. The bias term is the difference between the parameter of interest and the expected value of the estimator. The variance term corresponds to the variation of the estimator around its expected value. Lemma.3 Bias-variance decomposition. The MSE of an estimator Y that approximates a parameter γ satisfies MSE Y = E Y E Y + E Y γ. 5 }{{}}{{} variance bias Proof. The lemma is a direct consequence of linearity of expectation. If the bias is zero, then the estimator equals the quantity of interest on average. Definition.4 Unbiased estimator. An estimator Y that approximates a parameter γ is unbiased if its bias is equal to zero, i.e. if and only if E Y = γ. 6 An estimator may be unbiased but still incur in a large mean square error variance, due to its variance. The following lemmas establish that the empirical mean and variance are unbiased estimators of the true mean and variance of an iid sequence of random variables. 1

13 Lemma.5 The empirical mean is unbiased. The empirical mean is an unbiased estimator of the mean of an iid sequence of random variables. Proof. We consider the empirical mean of an iid sequence X with mean µ, Ỹ n := 1 n X i. 7 n By linearity of expectation E Ỹ n = 1 n n E X i 8 = µ. 9 Lemma.6 The empirical variance is unbiased. The empirical variance is an unbiased estimator of the variance of an iid sequence of random variables. The proof of this result is in Section A of the appendix.. Consistency Intuitively, if we are estimating a scalar quantity, the estimate should improve as we gather more data. In fact, ideally the estimate should converge to the true parameter in the limit when the number of data n. Estimators that achieve this are consistent. Definition.7 Consistency. An estimator Ỹ n := h X 1, X,..., X n that approximates a parameter γ is consistent if it converges to γ as n in mean square, with probability one or in probability. The following theorem shows that the mean is consistent. Theorem.8 The empirical mean is consistent. The empirical mean is a consistent estimator of the mean of an iid sequence of random variables as long as the variance of the sequence is bounded. Proof. We consider the empirical mean of an iid sequence X with mean µ, Ỹ n := 1 n X i. 30 n The estimator is equal to the moving average of the data. As a result it converges to µ in mean square and with probability one by the law of large numbers Theorem 3. in Lecture Notes 6, as long as the variance σ of each of the entries in the iid sequence is bounded. 13

14 Example.9 Estimating the average height. In this example we illustrate the consistency of the empirical mean. Imagine that we want to estimate the mean height in a population. To be concrete we will consider a population of m := 5000 people. Figure 8 shows a histogram of their heights. 5 As explained in Example.1 if we sample n individuals from this population with replacement, then their heights form an iid sequence X. The mean of this sequence is E X i := = 1 m m P Person j is chosen height of person j 31 j=1 m h j 3 j=1 = av h 1,..., h m 33 for 1 i n, where h 1,..., h m are the heights of the people. In addition, the variance is bounded because the heights are finite. By Theorem.8 the empirical mean of the n data should converge to the mean of the iid sequence and hence to the average height over the whole population. Figure 9 illustrates this numerically. If the mean of the underlying distribution is not well defined, or its variance is unbounded, then the empirical mean is not necessarily a consistent estimate. This is related to the fact that the empirical mean can be severely affected by the presence of extreme values, as we discussed in Section 1.. The empirical median, in contrast, tends to be more robust in such situations, as discussed in Section 1.3. The following theorem establishes that the empirical median is consistent, even if the mean is not well defined or the variance is unbounded. The proof is in Section B of the appendix. Theorem. Empirical median as an estimator of the median. The empirical median is a consistent estimator of the median of an iid sequence of random variables. Figure compares the moving average and the moving median of an iid sequence of Cauchy random variables for three different realizations. The moving average is unstable and does not converge no matter how many data are available, which is not surprising because the mean is not well defined. In contrast, the moving median does eventually converge to the true median as predicted by Theorem.. 5 The data set can be found here: wiki.stat.ucla.edu/socr/index.php/socr_data_dinov_008_ HeightsWeights. 14

15 Height inches Figure 8: Histogram of the heights of a group of people. Height inches n True mean Empirical mean Figure 9: Different realizations of the empirical mean when individuals from the population in Figure 8 are sampled with replacement. 15

16 Moving average Median of iid seq i Empirical mean Moving average Median of iid seq i 30 0 Moving average Median of iid seq i Moving median Median of iid seq i Empirical median Moving median Median of iid seq i Moving median Median of iid seq i Figure : Realization of the moving average of an iid Cauchy sequence top compared to the moving median bottom. 16

17 n = 5 n = 0 n = 0 True covariance Empirical covariance Figure 11: Principal components of n samples from a bivariate Gaussian distribution red compared to the eigenvectors of the covariance matrix of the distribution black. The empirical variance and covariance are consistent estimators of the variance and covariance respectively, under certain assumptions on the higher moments of the underlying distributions. This provides an intuitive interpretation for PCA under the assumption that the data are realizations of an iid sequence of random vectors: the principal components approximate the eigenvectors of the true covariance matrix, and hence the directions of maximum variance of the multidimensional distribution. Figure 11 illustrates this with a numerical example, where the principal components indeed converge to the eigenvectors as the number of data increases..3 Confidence intervals Consistency implies that an estimator will be perfect if we acquire infinite data, but this is of course impossible in practice. It is therefore very important to quantify the accuracy of an estimator for a fixed number of data. Confidence intervals allow to do this from a frequentist point of view. A confidence interval can be interpreted as a soft estimate of the deterministic parameter of interest, which guarantees that the parameter will belong to the interval with a certain probability. Definition.11 Confidence interval. A 1 α confidence interval I for a parameter γ satisfies where 0 < α < 1. P γ I 1 α, 34 Confidence intervals are usually of the form [Y c, Y + c] where Y is an estimator of the quantity of interest and c is a constant that depends on the number of data. The following 17

18 theorem shows how to derive a confidence interval for the mean of data that are modeled as an iid sequence. The confidence interval is centered at the empirical mean. Theorem.1 Confidence interval for the mean of an iid sequence. Let X be an iid sequence with mean µ and variance σ b for some b > 0. For any 0 < α < 1 [ I n := Y n b, Y n + b ], Y n := av X 1, X,..., X n, 35 α n α n is a 1 α confidence interval for µ. Proof. Recall that the variance of Y n equals Var Xn = σ /n see the proof of Theorem 3. in Lecture Notes 6, we have [ P µ Y n σ, Y n + σ ] = 1 P Y n µ > σ 36 α n α n α n 1 α nvar Y n b by Chebyshev s inequality 37 = 1 α σ b 38 1 α. 39 The width of the interval provided in the theorem decreases with n for fixed α, which makes sense as incorporating more data reduces the variance of the estimator and hence our uncertainty about it. Example.13 Bears in Yosemite. A scientist is trying to estimate the average weight of the black bears in Yosemite National Park. She manages to capture 300 bears. We assume that the bears are sampled uniformly at random with replacement a bear can be weighed more than once. Under this assumptions, in Example.1 we showed that the data can be modeled as iid samples and in Example.9 we showed the empirical mean is a consistent estimator of the mean of the whole population. The average weight of the 300 captured bears is Y := 00 lbs. To derive a confidence interval from this information we need a bound on the variance. The maximum weight recorded for a black bear ever is 880 lbs. Let µ and σ be the unknown mean and variance of the weights of the whole population. If X is the weight of a bear chosen uniformly at random from the whole population then X has mean µ and variance σ, so σ = E X E X 40 E X because X

19 As a result, 880 is an upper bound for the standard deviation. Applying Theorem.1, [ Y b, Y + b ] = [ 7., 47.] 43 α n α n is a 95% confidence interval for the average weight of the whole population. The interval is not very precise because n is not very large. As illustrated by this example, confidence intervals derived from Chebyshev s inequality tend to be very conservative. An alternative is to leverage the central limit theorem CLT. The CLT characterizes the distribution of the empirical mean asymptotically, so confidence intervals derived from it are not guaranteed to be precise. However, the CLT often provides a very accurate approximation to the distribution of the empirical mean for finite n, as we showed through some numerical examples in Lecture Notes 6. In order to obtain confidence intervals for the mean of an iid sequence from the CLT as stated in Lecture Notes 6 we would need to have access to the true variance of the sequence, which is unrealistic in practice. However, the following result states that we can substitute the true variance with the empirical variance. The proof is beyond the scope of these notes. Theorem.14 Central limit theorem with empirical standard deviation. Let X be an iid discrete random process with mean µ X := µ such that its variance and fourth moment E X i 4 are bounded. The sequence n av X 1,..., X n µ 44 std X 1,..., X n converges in distribution to a standard Gaussian random variable. Recall that the cdf of a standard Gaussian does not have a closed-form expression. simplify notation we express the confidence interval in terms of the Q function. To Definition.15 Q functions. Q x is the probability that a standard Gaussian random variable is greater than x for positive x, 1 Q x := exp u du, x > π u=x By symmetry, if U is a standard Gaussian random variable and y < 0 P U < y = Q y

20 Corollary.16 Approximate confidence interval for the mean. Let X be an iid sequence that satisfies the conditions of Theorem.14. For any 0 < α < 1 [ I n := Y n S n α Q 1, Y n + S n α ] Q 1, 47 n n Y n := av X 1, X,..., X n, 48 S n := std X 1, X,..., X n, 49 is an approximate 1 α confidence interval for µ, i.e. P µ I n 1 α. 50 Proof. By the Central Limit Theorem, when n X n is distributed as a Gaussian random variable with mean µ and variance σ. As a result P µ I n = 1 P Y n > µ + S n α Q 1 P Y n < µ S n α Q 1 51 n n n Yn µ α n = 1 P > Q 1 Yn µ α P < Q 1 5 S n S n α 1 Q Q 1 by Theorem = 1 α. 54 It is important to stress that the result only provides an accurate confidence interval if n is large enough for the empirical variance to converge to the true variance and for the CLT to take effect. Example.17 Bears in Yosemite continued. The empirical standard deviation of the bears captured by the scientist equals 0 lbs. We apply Corollary.16 to derive an approximate confidence interval that is tighter than the one obtained applying Chebyshev s inequality. Given that Q , [ Y σ α Q 1, Y + σ α ] Q 1 [188.8, 11.3] 55 n n is an approximate 95% confidence interval for the mean weight of the population of bears. 0

21 n = 50 n = 00 n = 00 True mean True mean True mean Figure 1: 95% confidence intervals for the average of the height population in Example.9. Interpreting confidence intervals is somewhat tricky. After computing the confidence interval in Example.17 one is tempted to state: The probability that the average weight is between and 11.3 lbs is However we are modeling the average weight as a deterministic parameter, so there are no random quantities in this statement! The correct interpretation is that if we repeat the process of sampling the population and compute the confidence interval, then the true parameter will lie in the interval 95% of the time. This is illustrated in the following example and in Figure 1. Example.18 Estimating the average height continued. Figure 1 shows several 95% confidence intervals for the average of the height population in Example.9. To compute each interval we select n individuals and then apply Corollary.16. The width of the intervals decreases as n grows, but because they are all 95% confidence intervals they all contain the true average with probability Indeed this is the case for 113 out of 94 % of the intervals that are plotted. 1

22 A Proof of Lemma.6 We consider the empirical variance of an iid sequence X with mean µ and variance σ, Ỹ n := 1 n X i 1 1 X j 56 n 1 n j=1 = 1 X i 1 n X j 57 n 1 n j=1 = 1 X i + 1 n n X j n 1 n X k n X i n X j 58 j=1 k=1 To simplify notation, we denote the mean square E X i = µ + σ by ξ. We have E Ỹ n = 1 n 1 j=1 j=1 n E X i + 1 n E X j + 1 n n n n E X i n = 1 n 1 = 1 n 1 = σ. n n n j=1 j i E X i X j j=1 n k=1 k j ξ + n ξ n n 1 µ + ξ n 1 µ n n n n n 1 n E X j X k ξ µ 6 63 B Proof of Theorem. We denote the empirical median by Ỹ n. Our aim is to show that for any ɛ > 0 lim P Ỹ n γ ɛ = n We will prove that lim Ỹ P n γ + ɛ = n

23 The same argument allows to establish If we order the set lim Ỹ P n γ ɛ = n { X 1,..., X n }, then Ỹ n equals the n + 1 /th element if n is odd and the average of the n/th and the n/ + 1th element if n is even. The event Ỹ n γ + ɛ therefore implies that at least n + 1 / of the elements are larger than γ + ɛ. For each individual X i, the probability that X i > γ + ɛ is p := 1 F Xi γ + ɛ = 1/ ɛ 67 where we assume that ɛ > 0. If this is not the case then the cdf of the iid sequence is flat { at γ and the } median is not well defined. The number of random variables in the set X 1,..., X n which are larger than γ + ɛ is distributed as a binomial random variable B n with parameters n and p. As a result, we have n + 1 P Ỹ n γ + ɛ P or more samples are greater or equal to γ + ɛ 68 = P B n n = P B n np n + 1 np 70 P B n np nɛ Var B n nɛ + 1 by Chebyshev s inequality 7 = = np 1 p n ɛ + 1 n 73 p 1 p n ɛ + 1, n 74 which converges to zero as n. This establishes 65. 3

Descriptive Statistics

Descriptive Statistics Descriptive Statistics DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Descriptive statistics Techniques to visualize

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

The Singular-Value Decomposition

The Singular-Value Decomposition Mathematical Tools for Data Science Spring 2019 1 Motivation The Singular-Value Decomposition The singular-value decomposition (SVD) is a fundamental tool in linear algebra. In this section, we introduce

More information

Lecture Notes 2: Matrices

Lecture Notes 2: Matrices Optimization-based data analysis Fall 2017 Lecture Notes 2: Matrices Matrices are rectangular arrays of numbers, which are extremely useful for data analysis. They can be interpreted as vectors in a vector

More information

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Vector spaces DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Vector space Consists of: A set V A scalar

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Expectation. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

Expectation. DS GA 1002 Probability and Statistics for Data Science.   Carlos Fernandez-Granda Expectation DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Aim Describe random variables with a few numbers: mean,

More information

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics DS-GA 100 Lecture notes 11 Fall 016 Bayesian statistics In the frequentist paradigm we model the data as realizations from a distribution that depends on deterministic parameters. In contrast, in Bayesian

More information

Expectation. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Expectation. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Expectation DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Aim Describe random variables with a few numbers: mean, variance,

More information

DS-GA 1002 Lecture notes 10 November 23, Linear models

DS-GA 1002 Lecture notes 10 November 23, Linear models DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.

More information

DS-GA 1002 Lecture notes 12 Fall Linear regression

DS-GA 1002 Lecture notes 12 Fall Linear regression DS-GA Lecture notes 1 Fall 16 1 Linear models Linear regression In statistics, regression consists of learning a function relating a certain quantity of interest y, the response or dependent variable,

More information

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis. Linear Models DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_fall17/index.html Carlos Fernandez-Granda Linear regression Least-squares estimation

More information

Multivariate random variables

Multivariate random variables DS-GA 002 Lecture notes 3 Fall 206 Introduction Multivariate random variables Probabilistic models usually include multiple uncertain numerical quantities. In this section we develop tools to characterize

More information

TOPIC: Descriptive Statistics Single Variable

TOPIC: Descriptive Statistics Single Variable TOPIC: Descriptive Statistics Single Variable I. Numerical data summary measurements A. Measures of Location. Measures of central tendency Mean; Median; Mode. Quantiles - measures of noncentral tendency

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

The Hilbert Space of Random Variables

The Hilbert Space of Random Variables The Hilbert Space of Random Variables Electrical Engineering 126 (UC Berkeley) Spring 2018 1 Outline Fix a probability space and consider the set H := {X : X is a real-valued random variable with E[X 2

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved.

1-1. Chapter 1. Sampling and Descriptive Statistics by The McGraw-Hill Companies, Inc. All rights reserved. 1-1 Chapter 1 Sampling and Descriptive Statistics 1-2 Why Statistics? Deal with uncertainty in repeated scientific measurements Draw conclusions from data Design valid experiments and draw reliable conclusions

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Exploratory data analysis: numerical summaries

Exploratory data analysis: numerical summaries 16 Exploratory data analysis: numerical summaries The classical way to describe important features of a dataset is to give several numerical summaries We discuss numerical summaries for the center of a

More information

Data Preprocessing Tasks

Data Preprocessing Tasks Data Tasks 1 2 3 Data Reduction 4 We re here. 1 Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can

More information

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Random variables. DS GA 1002 Probability and Statistics for Data Science. Random variables DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Motivation Random variables model numerical quantities

More information

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision) CS4495/6495 Introduction to Computer Vision 8B-L2 Principle Component Analysis (and its use in Computer Vision) Wavelength 2 Wavelength 2 Principal Components Principal components are all about the directions

More information

Introduction to Probability

Introduction to Probability LECTURE NOTES Course 6.041-6.431 M.I.T. FALL 2000 Introduction to Probability Dimitri P. Bertsekas and John N. Tsitsiklis Professors of Electrical Engineering and Computer Science Massachusetts Institute

More information

15 Singular Value Decomposition

15 Singular Value Decomposition 15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Convergence of Random Processes

Convergence of Random Processes Convergence of Random Processes DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Aim Define convergence for random

More information

7 Principal Component Analysis

7 Principal Component Analysis 7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is

More information

Week 9 The Central Limit Theorem and Estimation Concepts

Week 9 The Central Limit Theorem and Estimation Concepts Week 9 and Estimation Concepts Week 9 and Estimation Concepts Week 9 Objectives 1 The Law of Large Numbers and the concept of consistency of averages are introduced. The condition of existence of the population

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

L2: Review of probability and statistics

L2: Review of probability and statistics Probability L2: Review of probability and statistics Definition of probability Axioms and properties Conditional probability Bayes theorem Random variables Definition of a random variable Cumulative distribution

More information

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode.

Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode. Chapter 3 Numerically Summarizing Data Chapter 3.1 Measures of Central Tendency Objective A: Mean, Median and Mode Three measures of central of tendency: the mean, the median, and the mode. A1. Mean The

More information

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) CS68: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA) Tim Roughgarden & Gregory Valiant April 0, 05 Introduction. Lecture Goal Principal components analysis

More information

Linear regression. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Linear regression. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Linear regression DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall15 Carlos Fernandez-Granda Linear models Least-squares estimation Overfitting Example:

More information

4 Bias-Variance for Ridge Regression (24 points)

4 Bias-Variance for Ridge Regression (24 points) Implement Ridge Regression with λ = 0.00001. Plot the Squared Euclidean test error for the following values of k (the dimensions you reduce to): k = {0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500,

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

CS 5014: Research Methods in Computer Science. Bernoulli Distribution. Binomial Distribution. Poisson Distribution. Clifford A. Shaffer.

CS 5014: Research Methods in Computer Science. Bernoulli Distribution. Binomial Distribution. Poisson Distribution. Clifford A. Shaffer. Department of Computer Science Virginia Tech Blacksburg, Virginia Copyright c 2015 by Clifford A. Shaffer Computer Science Title page Computer Science Clifford A. Shaffer Fall 2015 Clifford A. Shaffer

More information

Learning Objectives for Stat 225

Learning Objectives for Stat 225 Learning Objectives for Stat 225 08/20/12 Introduction to Probability: Get some general ideas about probability, and learn how to use sample space to compute the probability of a specific event. Set Theory:

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 26, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 55 High dimensional

More information

(Re)introduction to Statistics Dan Lizotte

(Re)introduction to Statistics Dan Lizotte (Re)introduction to Statistics Dan Lizotte 2017-01-17 Statistics The systematic collection and arrangement of numerical facts or data of any kind; (also) the branch of science or mathematics concerned

More information

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that? Tastitsticsss? What s that? Statistics describes random mass phanomenons. Principles of Biostatistics and Informatics nd Lecture: Descriptive Statistics 3 th September Dániel VERES Data Collecting (Sampling)

More information

Module 3. Function of a Random Variable and its distribution

Module 3. Function of a Random Variable and its distribution Module 3 Function of a Random Variable and its distribution 1. Function of a Random Variable Let Ω, F, be a probability space and let be random variable defined on Ω, F,. Further let h: R R be a given

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

DS-GA 1002 Lecture notes 2 Fall Random variables

DS-GA 1002 Lecture notes 2 Fall Random variables DS-GA 12 Lecture notes 2 Fall 216 1 Introduction Random variables Random variables are a fundamental tool in probabilistic modeling. They allow us to model numerical quantities that are uncertain: the

More information

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes

ADMS2320.com. We Make Stats Easy. Chapter 4. ADMS2320.com Tutorials Past Tests. Tutorial Length 1 Hour 45 Minutes We Make Stats Easy. Chapter 4 Tutorial Length 1 Hour 45 Minutes Tutorials Past Tests Chapter 4 Page 1 Chapter 4 Note The following topics will be covered in this chapter: Measures of central location Measures

More information

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:

More information

MATH4427 Notebook 4 Fall Semester 2017/2018

MATH4427 Notebook 4 Fall Semester 2017/2018 MATH4427 Notebook 4 Fall Semester 2017/2018 prepared by Professor Jenny Baglivo c Copyright 2009-2018 by Jenny A. Baglivo. All Rights Reserved. 4 MATH4427 Notebook 4 3 4.1 K th Order Statistics and Their

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham Intelligent Data Analysis Principal Component Analysis Peter Tiňo School of Computer Science University of Birmingham Discovering low-dimensional spatial layout in higher dimensional spaces - 1-D/3-D example

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

3.1 Measure of Center

3.1 Measure of Center 3.1 Measure of Center Calculate the mean for a given data set Find the median, and describe why the median is sometimes preferable to the mean Find the mode of a data set Describe how skewness affects

More information

Chapter 6. Order Statistics and Quantiles. 6.1 Extreme Order Statistics

Chapter 6. Order Statistics and Quantiles. 6.1 Extreme Order Statistics Chapter 6 Order Statistics and Quantiles 61 Extreme Order Statistics Suppose we have a finite sample X 1,, X n Conditional on this sample, we define the values X 1),, X n) to be a permutation of X 1,,

More information

Math 180A. Lecture 16 Friday May 7 th. Expectation. Recall the three main probability density functions so far (1) Uniform (2) Exponential.

Math 180A. Lecture 16 Friday May 7 th. Expectation. Recall the three main probability density functions so far (1) Uniform (2) Exponential. Math 8A Lecture 6 Friday May 7 th Epectation Recall the three main probability density functions so far () Uniform () Eponential (3) Power Law e, ( ), Math 8A Lecture 6 Friday May 7 th Epectation Eample

More information

Lecture 2: Review of Basic Probability Theory

Lecture 2: Review of Basic Probability Theory ECE 830 Fall 2010 Statistical Signal Processing instructor: R. Nowak, scribe: R. Nowak Lecture 2: Review of Basic Probability Theory Probabilistic models will be used throughout the course to represent

More information

ECON3150/4150 Spring 2015

ECON3150/4150 Spring 2015 ECON3150/4150 Spring 2015 Lecture 3&4 - The linear regression model Siv-Elisabeth Skjelbred University of Oslo January 29, 2015 1 / 67 Chapter 4 in S&W Section 17.1 in S&W (extended OLS assumptions) 2

More information

Lecture Notes 6: Linear Models

Lecture Notes 6: Linear Models Optimization-based data analysis Fall 17 Lecture Notes 6: Linear Models 1 Linear regression 1.1 The regression problem In statistics, regression is the problem of characterizing the relation between a

More information

Methods for sparse analysis of high-dimensional data, II

Methods for sparse analysis of high-dimensional data, II Methods for sparse analysis of high-dimensional data, II Rachel Ward May 23, 2011 High dimensional data with low-dimensional structure 300 by 300 pixel images = 90, 000 dimensions 2 / 47 High dimensional

More information

Regression Analysis. Ordinary Least Squares. The Linear Model

Regression Analysis. Ordinary Least Squares. The Linear Model Regression Analysis Linear regression is one of the most widely used tools in statistics. Suppose we were jobless college students interested in finding out how big (or small) our salaries would be 20

More information

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN Lecture Notes 5 Convergence and Limit Theorems Motivation Convergence with Probability Convergence in Mean Square Convergence in Probability, WLLN Convergence in Distribution, CLT EE 278: Convergence and

More information

3. Review of Probability and Statistics

3. Review of Probability and Statistics 3. Review of Probability and Statistics ECE 830, Spring 2014 Probabilistic models will be used throughout the course to represent noise, errors, and uncertainty in signal processing problems. This lecture

More information

Economics 241B Review of Limit Theorems for Sequences of Random Variables

Economics 241B Review of Limit Theorems for Sequences of Random Variables Economics 241B Review of Limit Theorems for Sequences of Random Variables Convergence in Distribution The previous de nitions of convergence focus on the outcome sequences of a random variable. Convergence

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning Christoph Lampert Spring Semester 2015/2016 // Lecture 12 1 / 36 Unsupervised Learning Dimensionality Reduction 2 / 36 Dimensionality Reduction Given: data X = {x 1,..., x

More information

CS 147: Computer Systems Performance Analysis

CS 147: Computer Systems Performance Analysis CS 147: Computer Systems Performance Analysis Summarizing Variability and Determining Distributions CS 147: Computer Systems Performance Analysis Summarizing Variability and Determining Distributions 1

More information

COMPSCI 240: Reasoning Under Uncertainty

COMPSCI 240: Reasoning Under Uncertainty COMPSCI 240: Reasoning Under Uncertainty Andrew Lan and Nic Herndon University of Massachusetts at Amherst Spring 2019 Lecture 20: Central limit theorem & The strong law of large numbers Markov and Chebyshev

More information

Modèles stochastiques II

Modèles stochastiques II Modèles stochastiques II INFO 154 Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 1 http://ulbacbe/di Modéles stochastiques II p1/50 The basics of statistics Statistics starts ith

More information

Lecture 2 and Lecture 3

Lecture 2 and Lecture 3 Lecture 2 and Lecture 3 1 Lecture 2 and Lecture 3 We can describe distributions using 3 characteristics: shape, center and spread. These characteristics have been discussed since the foundation of statistics.

More information

Review (Probability & Linear Algebra)

Review (Probability & Linear Algebra) Review (Probability & Linear Algebra) CE-725 : Statistical Pattern Recognition Sharif University of Technology Spring 2013 M. Soleymani Outline Axioms of probability theory Conditional probability, Joint

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Lecture 2: Repetition of probability theory and statistics

Lecture 2: Repetition of probability theory and statistics Algorithms for Uncertainty Quantification SS8, IN2345 Tobias Neckel Scientific Computing in Computer Science TUM Lecture 2: Repetition of probability theory and statistics Concept of Building Block: Prerequisites:

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

PCA and admixture models

PCA and admixture models PCA and admixture models CM226: Machine Learning for Bioinformatics. Fall 2016 Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar, Alkes Price PCA and admixture models 1 / 57 Announcements HW1

More information

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

[POLS 8500] Review of Linear Algebra, Probability and Information Theory [POLS 8500] Review of Linear Algebra, Probability and Information Theory Professor Jason Anastasopoulos ljanastas@uga.edu January 12, 2017 For today... Basic linear algebra. Basic probability. Programming

More information

Spanning and Independence Properties of Finite Frames

Spanning and Independence Properties of Finite Frames Chapter 1 Spanning and Independence Properties of Finite Frames Peter G. Casazza and Darrin Speegle Abstract The fundamental notion of frame theory is redundancy. It is this property which makes frames

More information

CS145: Probability & Computing

CS145: Probability & Computing CS45: Probability & Computing Lecture 5: Concentration Inequalities, Law of Large Numbers, Central Limit Theorem Instructor: Eli Upfal Brown University Computer Science Figure credits: Bertsekas & Tsitsiklis,

More information

Random Processes. DS GA 1002 Probability and Statistics for Data Science.

Random Processes. DS GA 1002 Probability and Statistics for Data Science. Random Processes DS GA 1002 Probability and Statistics for Data Science http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall17 Carlos Fernandez-Granda Aim Modeling quantities that evolve in time (or space)

More information

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population Lecture 5 1 Lecture 3 The Population Variance The population variance, denoted σ 2, is the sum of the squared deviations about the population mean divided by the number of observations in the population,

More information

BTRY 4090: Spring 2009 Theory of Statistics

BTRY 4090: Spring 2009 Theory of Statistics BTRY 4090: Spring 2009 Theory of Statistics Guozhang Wang September 25, 2010 1 Review of Probability We begin with a real example of using probability to solve computationally intensive (or infeasible)

More information

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions Pattern Recognition and Machine Learning Chapter 2: Probability Distributions Cécile Amblard Alex Kläser Jakob Verbeek October 11, 27 Probability Distributions: General Density Estimation: given a finite

More information

Asymptotic Distribution of the Largest Eigenvalue via Geometric Representations of High-Dimension, Low-Sample-Size Data

Asymptotic Distribution of the Largest Eigenvalue via Geometric Representations of High-Dimension, Low-Sample-Size Data Sri Lankan Journal of Applied Statistics (Special Issue) Modern Statistical Methodologies in the Cutting Edge of Science Asymptotic Distribution of the Largest Eigenvalue via Geometric Representations

More information

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom

Central Limit Theorem and the Law of Large Numbers Class 6, Jeremy Orloff and Jonathan Bloom Central Limit Theorem and the Law of Large Numbers Class 6, 8.5 Jeremy Orloff and Jonathan Bloom Learning Goals. Understand the statement of the law of large numbers. 2. Understand the statement of the

More information

Perhaps the most important measure of location is the mean (average). Sample mean: where n = sample size. Arrange the values from smallest to largest:

Perhaps the most important measure of location is the mean (average). Sample mean: where n = sample size. Arrange the values from smallest to largest: 1 Chapter 3 - Descriptive stats: Numerical measures 3.1 Measures of Location Mean Perhaps the most important measure of location is the mean (average). Sample mean: where n = sample size Example: The number

More information

Introduction to statistics

Introduction to statistics Introduction to statistics Literature Raj Jain: The Art of Computer Systems Performance Analysis, John Wiley Schickinger, Steger: Diskrete Strukturen Band 2, Springer David Lilja: Measuring Computer Performance:

More information

Ordinary Least Squares Linear Regression

Ordinary Least Squares Linear Regression Ordinary Least Squares Linear Regression Ryan P. Adams COS 324 Elements of Machine Learning Princeton University Linear regression is one of the simplest and most fundamental modeling ideas in statistics

More information

Asymptotic Statistics-III. Changliang Zou

Asymptotic Statistics-III. Changliang Zou Asymptotic Statistics-III Changliang Zou The multivariate central limit theorem Theorem (Multivariate CLT for iid case) Let X i be iid random p-vectors with mean µ and and covariance matrix Σ. Then n (

More information

STAT 512 sp 2018 Summary Sheet

STAT 512 sp 2018 Summary Sheet STAT 5 sp 08 Summary Sheet Karl B. Gregory Spring 08. Transformations of a random variable Let X be a rv with support X and let g be a function mapping X to Y with inverse mapping g (A = {x X : g(x A}

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

Chapter 2: Tools for Exploring Univariate Data

Chapter 2: Tools for Exploring Univariate Data Stats 11 (Fall 2004) Lecture Note Introduction to Statistical Methods for Business and Economics Instructor: Hongquan Xu Chapter 2: Tools for Exploring Univariate Data Section 2.1: Introduction What is

More information

Maximum variance formulation

Maximum variance formulation 12.1. Principal Component Analysis 561 Figure 12.2 Principal component analysis seeks a space of lower dimensionality, known as the principal subspace and denoted by the magenta line, such that the orthogonal

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

Statistics and Data Analysis

Statistics and Data Analysis Statistics and Data Analysis The Crash Course Physics 226, Fall 2013 "There are three kinds of lies: lies, damned lies, and statistics. Mark Twain, allegedly after Benjamin Disraeli Statistics and Data

More information

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics A short review of the principles of mathematical statistics (or, what you should have learned in EC 151).

More information

Regression diagnostics

Regression diagnostics Regression diagnostics Kerby Shedden Department of Statistics, University of Michigan November 5, 018 1 / 6 Motivation When working with a linear model with design matrix X, the conventional linear model

More information

Convergence of Eigenspaces in Kernel Principal Component Analysis

Convergence of Eigenspaces in Kernel Principal Component Analysis Convergence of Eigenspaces in Kernel Principal Component Analysis Shixin Wang Advanced machine learning April 19, 2016 Shixin Wang Convergence of Eigenspaces April 19, 2016 1 / 18 Outline 1 Motivation

More information

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters, transforms,

More information

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations: Measures of center The mean The mean of a distribution is the arithmetic average of the observations: x = x 1 + + x n n n = 1 x i n i=1 The median The median is the midpoint of a distribution: the number

More information

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra ECE 83 Fall 2 Statistical Signal Processing instructor: R Nowak, scribe: R Nowak Lecture 3: Review of Linear Algebra Very often in this course we will represent signals as vectors and operators (eg, filters,

More information

Lecture 4: Sampling, Tail Inequalities

Lecture 4: Sampling, Tail Inequalities Lecture 4: Sampling, Tail Inequalities Variance and Covariance Moment and Deviation Concentration and Tail Inequalities Sampling and Estimation c Hung Q. Ngo (SUNY at Buffalo) CSE 694 A Fun Course 1 /

More information

Lecture 6: Methods for high-dimensional problems

Lecture 6: Methods for high-dimensional problems Lecture 6: Methods for high-dimensional problems Hector Corrada Bravo and Rafael A. Irizarry March, 2010 In this Section we will discuss methods where data lies on high-dimensional spaces. In particular,

More information