Lecture 3. The Population Variance. The population variance, denoted σ 2, is the sum. of the squared deviations about the population

Lecture 5 1 Lecture 3 The Population Variance The population variance, denoted σ 2, is the sum of the squared deviations about the population mean divided by the number of observations in the population, N: σ 2 = (xi µ) 2 N = (x 1 µ) 2 + (x 2 µ) 2 + (x N µ) 2 N. σ 2 = x 2 i N Another alternative formula is: ( xi N ) 2 = x 2 i N µ2. REMARK: To avoid round-off errors, which accumulate quickly in these formulas, do not round until the last computation, and use as

Lecture 5 2 many decimal places as allowed in your calculator.

Lecture 5 3 The Sample Variance When the population is large, we approximate the population mean µ with the sample mean, x. Similarly, we approximate the population variance σ 2 by the sample variance, denoted s 2 : s 2 = (xi x) 2 n 1 = (x 1 x) 2 + (x 2 x) 2 + + (x n x) 2 n 1. The alternative form is: s 2 = (xi x) 2 n 1 ( x i ) 2 n(n 1). REMARK: Notice that we divide by the sample size minus one (this is different from the formula for the population variance).

Lecture 5 4 Informally, we say: a sample of size n has n degrees of freedom; one degree of freedom is used up in computing x, so there are only n 1 degrees of freedom available for the sample variance.

Lecture 5 5 The Standard Deviation For both cases (the population or the sample), the standard deviation is the square root of the corresponding variance: The population standard deviation is denoted by σ: σ = σ 2. The sample standard deviation is denoted by s: s = s 2. Advantage of the (population or sample) standard

Lecture 5 6 deviation: it is given in the same units as the observations. Advantage of the (population or sample) variance: it is easier to manipulate algebraically, in some cases. Both the standard deviations and variances are interpreted as follows: the larger they are, the more spread is the distribution (if they equal 0, the smallest possible value, then all observations must be equal). Remark 1. Standard deviation measures spread about the mean and should be used only when the mean is chosen as the measure of center. Remark 2. Standard deviation is not robust.

Lecture 5 7 Remark 3. The sum of the deviations of the observations from their mean will always be zero.

Lecture 5 8 Density curves Histograms are approximations to an exact variable distribution. Increasing the number of classes in a histogram makes each rectangle less wide and as the number of rectangles approaches infinity, the graph becomes a curve, called density curve. Properties of the density curve 1. The curve is always above the x-axis: the function f(x) describing the curve is nonnegative (could be zero) for all x 2. The total area underneath the curve and above the x-axis equal 1.

Lecture 5 9 Density curves, as we saw, have mean, medians and modes as well as standard deviation. the notations are similar to the one for the population mean and standard deviation (why?). Most of the time we use software to estimate density curves. Many times we assume that data follows a certain density curve.

Lecture 5 10 The normal distribution Often called Gaussian curve, the normal curve was introduced by Carl Friedrich Gauss in 1809 as an error curve of least square regression, about which we will talk next time. There are other symmetric bell-shaped density curves that are not normal. Remark 4. The curve is described completely by 2 parameters: µ-the mean and σ-the standard deviation.

Lecture 5 11 The Empirical Rule If the distribution is approximately bell shaped (not only normal), then: 1. Approximately 68% of the data will lie within one standard deviation of the mean. That is, about 68% of the data will be between µ σ and µ + σ. 2. Approximately 95% of the data will lie within two standard deviations of the mean. 3. Approximately 99.7% of the data will lie within three standard deviations of the mean. For exact values, we need to integrate to find the area between two points.

Lecture 5 12 In general, for any distribution, not only the normal distribution, Chebyshev s rule could be applied: The proportion of values from a data set that will fall within k standard deviations of the mean will be at least (1 1 k 2 )100% where k > 1. his rule could be applied to samples too.

Lecture 5 13 Finding the area under the normal density curve is not an easy task. It requires a lot of calculus. One way of avoiding this is to use tables that give us these areas (probabilities). But for each µ and σ we would need a new table. How can we avoid this? By transforming somehow all these curves into a standard one. Choose µ = 0 and σ 2 = 1 Standardizing Convert other values to standard units or z-scores, by subtracting the mean and dividing by standard deviation z = x µ σ

Lecture 5 14 Example: Standardize x = 3 with µ = 2 and σ = 4. What z-score range corresponds to (8, 17) with µ = 12 and σ 2 = 9?

Lecture 5 15 Interpretation: z is the number of standard deviations that x is away from the mean. The z-score is unit free. We can use it to compare observations from different sources ( apples to oranges ). Notation The standard normal distribution is denoted by N(0, 1) and any other normal distribution with mean µ and variance σ 2 by N(µ, σ).

Lecture 5 16 Relations between variables. Scatter diagrams In practice statisticians are interested in multiple variable relationships. For 2 variables, the pairs of data points match forming an observation. Sometimes we use the value of one variable in order to predict another variable.the response variable is the variable whose value can be explained by, or is determined by, the value of the explanatory variable. The response variable measures the outcome of a study. An explanatory variable explains or causes changes in the response variable. Example:

Lecture 5 17 The relationship between two variables could be represented by crosstabulation, side by side or clustered bar graphs, and scatterplots.

Lecture 5 18 Definition 5. A scatter diagram is a graph that shows the relationship between two quantitative variables measured on the same individual. How to draw a scatter diagram: The explanatory variable is plotted on the horizontal axis and the response variable is plotted on the vertical axis. Each individual in the data set is represented by a point in the scatter diagram. Do not connect the points when drawing a scatter diagram.

Lecture 5 19 How we interpret a scatter diagram Scatter diagrams imply a linear relationship nonlinear relationship no relation Definition 6. Two variables that are linearly related are said to be positively associated if, whenever the values of the predictor variable

Lecture 5 20 increase, the values of the response variable also increase, and it is said to be negatively associated if, whenever the values of the predictor variable increase, the value of the response variable decrease.

Lecture 5 21 Be careful!! Do not conclude causation through association.

Lecture 5 22 Definition 7. The linear correlation coefficient is a measure of the strength of linear relation between two quantitative variables. The sample correlation correlation coefficient is computed by: r = n i=1 ( x i x s x )( y i y s y ) n 1 where x is the sample mean of the predictor variable s x is the sample standard deviation of the predictor variable. y is the sample mean of the response variable s x is the sample standard deviation of the response variable. n is the number of individuals in the sample.

Lecture 5 23 The population correlation coefficient is denoted by ρ Example: (0, 0)(1, 2)(2, 2)(3, 5)(4, 6)

Lecture 5 24 Interpretation and properties of r 1 r 1 If r = 1 there is a perfect positive linear relation between the 2 variables. If r = 1 there is a perfect negative linear relation between the 2 variables. The closer r is to 1 the stronger the evidence of a positive linear relation and the closer to -1 the stronger the evidence of negative association between the two variables. If r is close to 0 there is evidence of no linear relation between the 2 variables. This does not mean no relation, just no linear relation.

Lecture 5 25 r is a untiles measure of association. r is not resistant. It is strongly affected by outlier. Both variables should be quantitative.