401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

Size: px

Start display at page:

Download "401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis."

Mildred Phillips
5 years ago
Views:

1 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis of one distribution. 2. Graphical comparison of two distributions. 3. One-sample and two-sample hypothesis tests. 4. Confidence intervals for the population mean. 5. Correlation analysis for bivariate data. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 7. Regression analysis parameter estimation, CI s and PI s, hypothesis tests, model building. 8. Sums of squares in regression. 9. Diagnostics and transformations for all of the above. Major Definitions 1. Random Variable: A natural process whose outcome can not be predicted with certainty. 2. Sample Space: The set of all possible outcomes for a particular random variable. 3. Probability: A number between 0 and 1 which describes the likelihood that some event will occur. 4. Distribution: The sample space, along with the probability of observing each point in the sample space. The distribution is a complete description of a random variable. 5. Qualitative/Quantitative (random variable): A random variable whose outcomes are numerical is quantitative (e.g. length, price, time). If this is not the case, the random variable is qualitative (e.g. gender, success, failure). 1

2 6. (Empirical) Cumulative Distribution Function (CDF/ECDF): The CDF function of a random variable X is a function F (t) such that F (t) = P (X t). The ECDF is an estimate ˆF (t) of the CDF based on a sample X 1,..., X n, where ˆF (t) is the proportion of the X i less than or equal to t. 7. Probability Density Function (PDF): The PDF f(t) of a random variable X is a function such that P (a X b) is the area under the graph of f between a and b. 8. Population vs. Sample: The population describes all possible outcomes of an experiment or measurement (possibly infinite). A sample is a list of outcomes actually obtained by carrying out the experiment some finite number of times. The population is fixed, but the sample is random (if you repeat your experiment, you will get different values in your sample). 9. iid sample/srs: iid stands for independent and identically distributed, SRS stands for Simple Random Sample. Both terms refer to a set of measurements that can be viewed as arising independently from a single distribution. In practice, this means that the same instruments and experimental procedures are followed for every data point, and the observations are selected at random. 10. Inference: A statement made about the population based on statistical analysis of a sample from the population. Since the data are random, we can never be certain that an inference is correct. A key goal of statistics is to use the information in the data efficiently so that inferences are correct as often as possible. 11. Prediction: Suppose you observe a sample from a population, and based on this sample you are able to learn something about the distribution. This allows you to make a better guess of a future random value from the distribution than you would have been able to make had you not observed the sample. This guess is called a prediction. 12. Sampling variation: The variation of a statistic due to random variation in the sample used to compute it. 13. Quantile: For a quantitative random variable X, if 0 p 1, Q(p) is the point in the sample space such that P (X Q(p)) = p. 14. Histogram: An estimate of the PDF. 15. Order statistics: Given a sample X 1, X 2,..., X n, the order statistics are the data listed in increasing order. The notation for the i th element of the sorted list (the i th order statistic) is X (i). 16. Resistant (estimator): A resistant estimator is not highly sensitive to changes in the value of a single data point. Specifically, regardless of how much a single data point is changed, a resistant estimator will only change by a bounded amount. 2

3 17. Mean (population/sample): The mean is one way to measure the most typical value of a distribution (a measure of location ). The population mean is a balancing point such that if you multiply each point in the sample space by its probability, the sums of these values to the right and to the left of the mean are equal. The sample mean is simply the average of the data. The population mean may also be called expected value or expectation. 18. Variance, standard deviation: These are measures of scale. The variance is a measure of how far random values tend to be from their mean. Specifically, it is the average squared distance to the mean (actually, not quite the average since you divide by n 1 rather than n). The standard deviation is the square-root of the variance. The sample variance and standard deviation are estimates of the population variance and standard deviation. 19. Median, IQR: The median is the 0.5 quantile. Roughly speaking, the median is the point θ such that half the data are greater than θ, and half the data are less than θ. The IQR is the difference between the 75 th and 25 th percentiles. The median is a resistant measure of location, and the IQR is a resistant measure of scale. 20. Median center: Subtracting the sample median from all data points gives a data set with median 0, but other statistical properties (such as the variance) are unchanged. 21. Standardize: Subtracting the mean from all data points and then dividing each value by the standard deviation yields a set of points with mean 0 and standard deviation 1. In other respects these values resemble the original values. 22. Right/left skew, symmetric (distribution): A right skewed distribution has more atypically large values than atypically small values. A left skewed distribution has more atypically small values than atypically large values. A symmetric distribution has equal tendency to produce atypically large and atypically small values. 23. Thick/thin (tail): A thick tail produces a relatively greater number of extreme values. A thick right tail will produce a greater number of extreme large values, and a thick left tail will produce a greater number of extreme small values (small means close to not close to 0). Similarly, thin tails produce a relatively smaller number of extreme values. A right-skewed distribution has a thicker right tail and a thinner left tail. A left-skewed distribution has a thicker left tail and a thinner right rail. A symmetric distribution has equally-thick right and left tails. 24. QQ (quantile/quantile) plot: A plot of the quantiles of one random variable against those of another. For example, if Q X (0.75) is the 75 th percentile of X and Q Y (0.75) is the 75 th percentile of Y, then the point (Q X (0.75), Q Y (0.75) would be plotted, along with all other points obtained by replacing 0.75 with other numbers between 0 and 1. If the two random variables have the same distribution, a diagonal line results. The further the QQ points fall from the diagonal, the greater the level of difference between the two distributions. Based on a QQ plot, one can determine which of the 3

4 two variables is larger on average, which is more variable, and which has thicker right or left tail. QQ plots are often made with median centered or standardized data, to highlight differences that are not obvious in other analyses. 25. Normal probability plot: A type of QQ plot in which a univariate sample is standardized, then the order statistics are plotted against the corresponding quantiles of the standard normal distribution. It s main application is to assess whether data are normal, since normality is an assumption of many statistical procedures. 26. Translation: The transform that adds a constant to each data point. 27. Scaling: The transform that multiplies each data point by a constant. 28. Invariant: A statistic is invariant to a certain transform if it doesn t change when the transform is applied to the data. For example, the variance is invariant to translations. 29. Power transform: If the data are transformed by a function of the form X q, they have been power transformed. If the exponent q is close to 0, the power transform is very similar to a log transform. 30. Normal distribution: This distribution is often used to describe the variation in data. More importantly, the sample mean and many other statistics have approximately normal distributions even if the underlying data used to form the statistic are not normally distributed. The standard normal distribution has mean zero and variance one. This is the distribution provided in the normal table. To obtain other normal probabilities you must standardize. 31. t-distribution: This family of distributions describes the variation in certain statistics involving the variance, in which the sample variance has been substituted for the population variance. Due to the extra uncertainty in the sample variance, the result is more variable than a normal distribution. 32. Degrees of freedom: This number determines which particular t-distribution is used in a given problem. The larger the degrees of freedom, the closer the resulting t-distribution is to a normal distribution. Degrees of freedom also occur in the F-distribution (which has two degrees of freedom, one for the MSR and one for the MSE). 33. Null and alternative hypotheses: The standard approach to hypothesis testing requires stating two hypotheses that are compared to each other. These hypotheses are not handled symmetrically if the evidence is not overwhelmingly in favor of the alternative hypothesis, the decision is made in favor of the null hypothesis. The hypotheses should be devised so that it is a more serious mistake to erroneously decide in favor of the alternative hypothesis than it is to erroneously decide in favor of the null hypothesis. 34. Test statistic: Given a sample X 1,..., X n from a population about which you wish to make some inference, a test statistic T is a function that compresses the data into a single number that contains all the relevant information for the inference. 4

5 35. Rejection region: The rejection region is the set of all test statistic values that are extreme enough that you may reject the null hypothesis. Typically this would be any test statistic value that has p-value< One-sided, two-sided, right-tailed, left-tailed (tests): If the parameter being tested is θ and the alternative hypothesis is θ > c (right tailed) or θ < c (left tailed) for some constant c, we have a one-sided test. If the alternative hypothesis is θ c, we have a two-sided test. 37. Type I/II error, false negative, false positive: A type I error (false positive) is a decision in favor of the alternative hypothesis when the null hypothesis is true. A type II error (false negative) is a decision in favor of the null hypothesis when the alternative hypothesis is true. In the usual hypothesis testing situation, false positives are worse than false negatives. 38. p-value: The probability of observing a test statistic at least as extreme as the observed test statistic value under the null hypothesis. 39. Power: The power is the probability of rejecting the null hypothesis when the alternative hypothesis is true. To calculate the power, you must know the specific alternative hypothesis (i.e. θ > 0 is not adequate, you must have a specific value like θ = 1), and you must know the p-value at which the null will be rejected (usually 0.05 or 0.01). 40. Effect size: The effect size is the difference between the alternative and null values of the parameter being tested. This term is usually used in the context of power analysis, where one can state the smallest effect size that is detectable at a given power, or the smallest samples size at which a given effect size is detectable at a given power. 41. Point estimate: A numerical estimate of an unknown quantity. For example, the sample mean is a point estimate of the population mean. The point estimate differs from the true value due to random variation in the data. 42. Confidence and prediction intervals: These are intervals that are constructed to cover some unknown quantity with a given probability. CI s are constructed to cover unknown constants (parameters) such as the population mean. PI s are constructed to cover observations that will be made in the future from some distribution. 43. Coverage probability: The actual probability that a CI or PI will cover the value that it is designed to cover. If all assumptions are met, this will be the coverage for which the interval was constructed (i.e. a 95% PI will actually cover 95% of the time). If all assumptions are not met, then the coverage can be lower or higher. 44. Width (of a CI or PI): The difference between the upper and lower bounds of the interval. We prefer CI s and PI s to be as short as possible, since this leads to a more precise statement. 5

6 45. One sample/two sample (hypothesis test): If two iid samples are observed from two possibly different populations, we are analyzing two-sample data. For this course the only analysis is a test of equality of the two population means. If one iid sample is observed from one population, we can test the population mean against a constant (usually zero). 46. Univariate/bivariate data: If one measurement is made per individual being studied, we are performing a univariate analysis. If two such measurements are made, we are performing a bivariate analysis. Be clear about the difference between bivariate data and two-sample data, they are very different things. 47. Scatterplot: A plot of bivariate data in which the (X i, Y i ) values are plotted as points in the plane. 48. Positive/negative trend (association): For bivariate data (X, Y ), if Y tends to increase when X increases (and hence X tends to increase when Y increases), then X and Y are positively associated. If Y tends to decrease when X increases (and hence X tends to decrease when Y increases), then X and Y are negatively associated. If neither relationship consistently holds, then X and Y have no association. 49. Correlation coefficient: A measure of the association between bivariate measurements X and Y. The correlation coefficient always falls between 1 and 1. Positive values indicate positive association, negative values indicate negative association, and values close to zero indicate no association. 50. Covariance: A measure of association between bivariate measurements. The scale of the covariance depends on the scale of the measurements, making it less useful for analysis. The correlation coefficient is a rescaled version of the covariance. 51. Fisher s (Z) transform: A transformation that stretches the correlation coefficient so that instead of falling between 1 and 1, it falls between and. Values close to zero are only slightly changed, but values close to ±1 are substantially changed. The specific form of the Fisher transform is f(r) = log((1 + r)/(1 r))/2, where log is the natural log. The Fisher transform produces a variable that has a normal distribution with mean f(ρ), where ρ is the population correlation coefficient, and the variance is 1/(n 3). It can be used to carry out hypothesis tests and calculate confidence intervals for correlation coefficients. 52. Conditional mean and variance: If (Y, X) are bivariate measurements, E(Y X = x) is a function of x whose value is the average of all Y values paired with X values equal to x. Similarly, var(y X = x) is the variance of all Y values paired with X values equal to x, and SD(Y X = x) is the standard deviation of all Y values paired with X values equal to x. 53. Heteroscedastic/homoscedastic: A heteroscedastic bivariate pair Y, X has the property that SD(Y X = x) varies with x. For a homoscedastic pair, SD(Y X = x) is constant as a function of x. 6

7 54. Simple linear regression: If we have bivariate data, assume E(Y X) = α + βx (i.e. is linear in X), and the data are homoscedastic, we have simple linear regression. 55. Errors: In a regression model, the observed response values differ from the expected response values by a random error term ɛ. The error term always has expected value Fitted values: In any regression model, once we have the parameter estimates, we can estimate the expected response value at each X i, denoted Ŷi, by plugging the parameter estimates into the mean function. For example, in simple linear regression, if we estimate ˆα and ˆβ, the fitted values are Ŷi = ˆα + ˆβX i. 57. Residuals: For any regression model, the residuals are the observed response values Y i minus the fitted response variables Ŷi: r i = Y i Ŷi. 58. Least squares: The process of estimating regression parameters by minimizing the sum of squares of the residuals. 59. Outlier: One of a small number of points that is dramatically different from the trend followed by the remaining points. Specifically, any observation i such that r i is greater than 2 or 2.5 times the IQR of all r i may be considered an outlier. It may be desirable to remove outliers during regression analysis, but they should still be considered as part of the overall analysis. 60. Diagnostic: Any method, especially a graphical method, that is designed to assess whether the assumptions of the linear model are approximately satisfied. The key diagnostics for simple linear regression are the scatterplot of residuals on fitted values (which should have no pattern), and the normal probability plot of the residuals (which should lie approximately on the 45 line). 61. Vector, matrix: A vector is a list of numbers, a matrix is a table of numbers. 62. Dimension (of a vector): The number of entries in a vector. 63. Linear combination: Starting with several vectors of the same dimension, if the vectors are scaled by (possibly different) constants and the resulting vectors are added, the final vector is a linear combination of the original vectors. 64. Dot product: Given two vectors of the same dimension, if corresponding elements are multiplied and the resulting products are summed, a single number results. This is the dot product (also called scalar product or inner product). 65. Perpendicular (orthogonal) vectors: Two vectors of the same dimension with zero dot product are perpendicular. 66. Linearly dependent: A set of vectors of the same dimension is linearly dependent if some linear combination (with at least one nonzero coefficient) of the vectors is zero. If a set of vectors is not linearly dependent, it is linearly independent. 7

8 67. Symmetric (matrix): A matrix A is symmetric if A ij = A ji for all indices i and j. 68. Square (matrix): A matrix is square if it has the same number of rows and columns. Otherwise it is rectangular. A rectangular matrix is tall and thin if it has more rows than columns. The design matrix in a regression problem is always tall and thin. 69. Matrix vector product: For an m n matrix A and a n-dimensional vector B, the matrix vector product AB is a m dimensional vector. One way to form AB is to take the dot product of each row of A with B, and place the results into a vector. A different, equivalent, way to form AB is to construct a linear combination of the columns of A using the elements of B as coefficients. 70. Nullspace (of a matrix): The nullspace of a matrix A is the set of all coefficient vectors B such that AB = 0. The vector B = 0 is always in the nullspace. For some matrices, other nonzero vectors may be in the nullspace as well. If 0 is the only vector in the nullspace, the matrix is nonsingular, otherwise it is singular. A matrix with more columns than rows is always singular. A matrix with equally many, or fewer columns than rows, may be singular or nonsingular. 71. Matrix matrix product: Two matrices A and B may be multiplied to form AB if the number of columns of A is equal to the number of rows of B. If A is m n and B is n r, AB is m r. The i, j element of AB is the dot product between row i of A and column j of B. The matrix matrix products X X and XX always exist. The former is called the column-wise inner product matrix while the latter is called the row-wise inner product matrix. 72. Identity matrix: The identity matrix I is a square m m matrix such that if A has m columns AI = A, and if A has m rows IA = A. 73. Matrix inverse: If A is a square matrix, the inverse of A is a matrix A 1 such that AA 1 = A 1 A = I, where I is the m m identity matrix. The inverse only exists if A is nonsingular. 74. Multiple regression: Data in which a single response measurement Y is paired with one or more predictor variables X j can be analyzed using multiple linear regression. The mean function is E(Y X) = α + β 1 X 1 + β 2 X β p X p, and the data should be homoscedastic. 75. Design matrix: All predictor variable values for all observations in a multiple regression problem can be stored in the design matrix. The first column contains all 1 s, and subsequent columns contain the predictor variable values. Each row contains the data for one observation, and each column contains the data for one predictor variable. 76. Proportion of explained variance (PVE): A number between zero and one, such that large values indicate that the predictor variables do a good job tracking the variation in the response values. The PVE is very interpretable, but has some technical drawbacks it always increases as new variables are added, and it is not easy to do any inference with the PVE. Larger PVE values indicate a better model. 8

9 77. F statistic: The F-statistic is MSR/MSE. Like the PVE, it is larger when the predictor variables do a good job tracking the variation in the response values. It ranges from 0 to and the which values are considered large depends on the degrees of freedom. It is less interpretable than the PVE, but is easy to use in hypothesis testing since tables of the F-distribution are easy to construct. Larger F values indicate a better model. 78. Akaike Information Criterion (AIC): A measure of fit for a regression model that explicitly accounts for the number of variables and the sizes of the residuals. The positive effects of small residuals can be offset by the negative effects of a complex model with many predictor variables. Smaller AIC values indicate a better model. 79. Main effects: For a predictor variable X j, the main effect is the term β j X j which appears in the regression function. 80. Interaction: In a multiple regression model, if the slope for one variable depends on the value of another variable, the two variables interact. The product term X j X k for two interacting variables can be included as a new variable in the regression model to account for this interaction. 81. Polynomial regression: If the relationship between Y and one of the predictor variables X j is not linear, polynomial terms X 2 j, X 3 j, etc. can be included as new variables in the regression model. 82. Full model: A multiple regression model in which main effects are included for every available predictor variable. 83. Forward/backward/all subsets selection: These are three ways to find the best model for a given dataset. The goal is to determine th population model, but like any inferential procedure the correct result will not always be obtained due to random variation in the data. Key scaling properties: 1. Measures of location: The mean and median scale and translate in the same way that the underlying data are scaled or translated. So if the data are translated by c, the mean and the median are translated by c. If the data are scaled by c, the mean and median are scaled by c. 2. Measures of scale: The variance, IQR, and standard deviation are invariant to translations. The IQR and standard deviation scale with the magnitude of the scale factor if the data are scaled by c, the IQR and standard deviation are scaled by c. The variance scales with the square of the scale factor if the data are scaled by c, the variance scales by c Measures of association: The correlation and covariance are invariant to translations in both the X and Y variables. If either the X or Y variable is scaled by c, the correlation is scaled by sgn(c) = c/ c, which is ±1. The covariance scales with the X values and 9

10 Y values separately, so if the X values are scaled by c and the Y values are scaled by d, the covariance is scaled by c d. 4. Slopes: For simple linear regression, if the Y values are scaled by c, ˆβ is scaled by c. If the X values are scaled by c, ˆβ is scaled by 1/c. Key sampling distributions: 1. Sample mean: E X = EX i var( X) = σ 2 /n SD( X) = σ/ n 2. Correlation coefficient ( denotes Fisher transformed value): 3. Simple linear regression slope Er ρ var(r ) 1/(n 3) SD(r ) 1/ n 3 E ˆβ = β var( ˆβ) = 4. Multiple regression slopes σ 2 (n 1)σ 2 X SD( ˆβ) = σ n 1σX E ˆβ j = β j var( ˆβ j ) = [σ 2 (X X) 1 ] jj Variance/Covariance Identities: var(x) = cov(x, X) var(x + Y ) = var(x) + var(y ) + 2cov(X, Y ) var(x Y ) = var(x) + var(y ) 2cov(X, Y ) If X and Y are independent: var(x + Y ) = var(x Y ) = var(x) + var(y ). cov(x, Y + Z) = cov(x, Y ) + cov(x, Z) 10

Linear Algebra V = T = ( 4 3 ).

Linear Algebra V = T = ( 4 3 ). Linear Algebra Vectors A column vector is a list of numbers stored vertically The dimension of a column vector is the number of values in the vector W is a -dimensional column vector and V is a 5-dimensional