Modules 1-2 are background; they are the same for regression analysis and time series.

Size: px

Start display at page:

Download "Modules 1-2 are background; they are the same for regression analysis and time series."

Imogene O’Brien’
5 years ago
Views:

1 Regression Analysis, Module 1: Regression models (The attached PDF file has better formatting.) Required reading: Chapter 1, pages 3 13 (until appendix 1.1). Updated: May 23, 2005 Modules 1-2 are background; they are the same for regression analysis and time series. Jacob: This chapter does not seem like background. It covers much of regression analysis, as if we have already had a course in this subject. Rachel: This chapter provides the basic formulas. Chapter 3 begins the actual course, deriving the formulas and explaining the intuition. This chapter is for candidates who have never dealt with regression analysis, and who are not familiar with the basic equations. Section 1.1 on page 3-7 covers curve fitting. Know the definition of the least squares criterion on page 5: The line of best fit minimizes the sum of squared deviations Figure 1.2 on page 5 shows the graphic interpretation. Jacob: Do we prove that minimizing the sum of square deviations is the best way to fit a curve? Rachel: No; this is a definition. We could use other loss functions instead. For the final exam, you must know the pros and cons of different loss functions. Throughout this course, we discuss deviations. The deviations may be from the regression line (error sum of squares) or from the mean (total sum of squares). To minimize the sum of squared deviations and keep the algebra simple, we solve a regression problem in two steps: (i) move the graph horizontally and vertically so that the means are zero, and (ii) determine the line of best fit passing through the origin. Jacob: What does move the graph horizontally and vertically mean? Do we move the regression line horizontally and vertically or do we move the co-ordinates of the graph? Rachel: Those are the same thing. Moving the line upward (vertically) ten units is moving the origin of the graph downward ten units. Figure 1.3 on page 6 shows alternative loss functions. One of the illustrative test questions compares alternative loss functions. We use the least squares loss function, which has some desirable properties. But we do not say that this loss function is correct, that other loss functions are incorrect, or that this loss function is better than others. We might minimize the absolute error, if it were mathematically tractable. In fact, minimizing absolute

2 errors has become a common alternative with the spread of spreadsheet capabilities. In this course, we use the least squares loss function. Section 1.2 on pages 7-10 is the essence of regression analysis. Equation 1.6 on page 9 in the shaded type is the core equation for this course. In practice, much of regression analysis is finding the value of b, which we refer to in this course as, the ordinary least squares estimator. Once we know b, we solve for a (referred to as ) from equation 1.3 on page 8. You are not tested on the derivation of the regression coefficients a and b; the final exam is multiple choice and does not test derivations. But it is worth your while to review the mathematics. We solve for b repeatedly, and it helps to know what this formula does. Read Example 1.1 on page 10 and 11. You must work similar problems for both the homework assignments and the final exam. One of the continuing illustrations in this course regresses scores on Course C on the hours each candidate spends studying. We show several variations of this problem, corresponding to the topics in each module. The equations for the regression coefficients can be written in terms of X and Y or in terms of x and y. The lower case variables are the deviations, or the upper-case variables minus the means: x = X. Some texts use the formulas in terms of X and Y. You should know those formulas, but we generally use the formulas in terms of the deviations. Appendix 1.1 covers properties of sums. We use these properties in the rest of the course. Know especially Rule 5 and Rule 6 on page 15; these are used in many of the homework assignments and exam questions. Rules 5 and 6 allow us to convert between absolute numbers (X and Y) and deviations (x and y). On an exam problem, you may be given the sum of squares and the mean and asked to derive the sum of squared deviations; this is Rule 6, equation A1.11, on page 15. Appendix 1.2 is optional. You will not be tested on the derivation of the ordinary least squares estimators and, but you must know the formulas for them (A1.23 and A1.24). Come back to this appendix several times: after modules 3, 5, 7, and 22. The homework assignment is similar to the grade point average example in Table 1.2 on page 11. Work through the example in the text, and then do the homework assignment. (The PDF attachment shows the Greek letters for the ordinary least squares estimators.)

3 Time Series, Modules 1, 2: Statistics and Regressions Practice Problems (The attached PDF file has better formatting.) Updated: May 23, 2005 Jacob: To which Module do these practice problems apply? Rachel: Module 2 is statistical background; Module 1 is the basic regression equations; and Regression Analysis Modules 3 and 4 cover the initial concepts. If you already know regression analysis, this material is easy. If you have not had regression analysis, make sure you understand these problems, since the concepts are used in time series as well. Exercise 1.2: Means Given the sample below, what are the estimated means of X and Y? i X Y i i Solution 1.2: = = ( ) / 5 = 10.0 = = ( ) / 5 = 15.0 Jacob: This is obvious; the mean is the average. What is the point of this problem? Rachel: The mean uses a divisor of N; the variance uses a divisor of N-1; see below. Exercise 1.3: Deviations Given the sample below, what are x i and y, i the deviations of X i and Y? i

4 i Xi Yi xi yi Mn Jacob: Is the deviation the same as the residual? Rachel: The deviation is the difference from the mean; the residual is the difference from the fitted value. Exercise 1.4: Sample Variances What are the sample variances of X and Y? Solution 1.4: 2 2 i Xi Yi xi yi xi yi Mn s (X) = = [( 2) + ( 1) ] / 4 = 10 / 4 = s (Y) = = [( 3) + ( 1) ] / 4 = 18 / 4 = 4.5 An alternative method is X = = s (X) = = ( ) / 4 = 2.5 Jacob: Why is the divisor N-1?

5 Rachel: The divisor is the degrees of freedom. Separate postings explain why the degrees of freedom is N-1. Exercise 1.5: Covariance What is estimated covariance between the two random variables? Solution 1.5: 2 2 i Xi Yi xi yi xi yi x i yi Mn = = 13 The estimated covariance of (x,y) is = 13 / 4 = Exercise 1.6: Correlation What is the correlation between the two random variables? Solution 1.6: ½ = = 3.25 / ( ) = {Note: The practice problems above are explained in Module 2, though many actuarial candidates are familiar with these formulas. The practice problem below is for Modules 1, 3, and 4.} Suppose we fit the two-variable regression model to these pairs of points. We assume now that Y is a random variable, but X is not a random variable. Exercise 1.7: Ordinary Least Squares Estimates

6 A. What are the values of and? B. What is the ordinary least squares estimate of? C. What is the ordinary least squares estimate of? D. Using the values of and computed earlier, what is the ordinary least squares estimate of using deviations? E. What are the values of? i F. What is the total sum of squares (TSS)? G. What is the error sum of squares (ESS), or the residual variance? H. What is the regression sum of squares (RSS), or the explained variance? I. What is the variance of the ordinary least squares estimator? J. What is the variance of the ordinary least squares estimator? Solution 1.7: We calculate the sum of squares, the deviations, the sum of squared deviations, the sum of the cross terms, and the sum of the cross deviations. 2 2 i Xi Yi Xi Yi X i Yi Total / Mean Part A: = 50.0, = The mean of X is 50 / 5 = 10; the mean of Y is 75 / 5 = 15. = , = 510.0, = Part B: = = ( / 5) / ( / 5) = Part C: = = 75 / / 5 = 2.0

7 Part D: = 10 and = 13, so = 13/10 = 1.3 Parts E-H: The table below shows the calculations. i Xi Yi i residual ESS Total / Mean The fit is good: the fitted values are close to the actual values and the residuals are small. We determine the fitted values by the estimated regression equation: Y = X. For the first row, = The average residual is zero, as is true by convention. If we work out the ordinary least squares estimators correctly, the average residual is zero. Jacob: Must the rows have different values for X? Rachel: No; several rows may have the same value of X. Suppose we regress Course C scores on the hours of study, and we examine all 2,500 candidates who take the test in 20XX. If we estimate hours of study to the nearest 10 hours, we may have 50 candidates who study 400 hours. The scores for these candidates may vary, though they have the same fitted score. Jacob: Is it possible to run a regression analysis if all data points have the same X value? Rachel: If all X values are the same, the sum of squared deviations is zero, and the ordinary least squares estimator for is not defined. Part F: We work out the total sum of squares two ways:! The deviations of Y are the Y values minus the mean, or 3, 1, 0, 2, 2. The squares of the deviations are 9, 1, 0, 4, 4; the sum of these squares is ! The sum of squared deviations is the sum of the Y N = 1, / 5 = 18. Part G: The error sum of squares is shown in the table as 1.10 Part H: The regression sum of squares is = Part I: R = / = 93.89%

8 2 Jacob: Can we derive R as the square of the correlation between X and Y? Rachel: Yes. The correlation is the covariance divided by the standard deviations of each ½ variable. We worked out the needed figures above. The correlation is 13 / (10 18) = 0.969; the square of the correlation is Part J: For the variances of the ordinary least squares estimators, we must know the variance of the error term. We estimate the variance of the error term from the variance of the residuals, which is the error sum of squares divided by the degrees of freedom. We have five data points, so three degrees of freedom. The estimated variance of the error term is 1.10 / 3 = For the variance of the, we divide the variance of the error term by the sum of squared deviations of the X variable: / 10 = Part K: The variance of is / 5 =

9 Regression Analysis, Module 1, Introduction to the Regression Model (The attached PDF file has better formatting.) Homework Assignment Updated: May 23, 2005 Modules 3 and 4 repeat the information in Module 1 with more explanation. The textbook assumes you know the basic formulas of simple linear regression. The textbook focuses on the concepts and the intuition, not the formulas. You see this from the exercises at the end of each chapter; they review the concepts, not the equations. Read this homework assignment after reading chapter 1; do the assignment after reading chapter 3. If you have not had regression analysis before, it takes a while to get used to the equations. After the first few weeks of this course, the equations come naturally. Problem 1: We examine four items: the ordinary least squares estimators for beta and alpha, an in-range forecast, and an out-of-range forecast. We are examining the effects on study on exam scores. For the eight candidates below, the table shows the number of hours studied and the score on Course C (Exam 4): Candidate Hours Studied Exam Score a b c d e f g h We fit a two-variable regression model (Y = A + B * X) to these observations, where X is hours studied and Y is the Course C exam score. A. What is the ordinary least squares estimator for beta? B. What is the ordinary least squares estimator for alpha? C. How many hours of study are needed to get a 6 on the exam according to the regression equation? Assume that scores are rounded to the nearest integer, so we solve for Y = 5.5, not Y = 6.)

10 D. If the candidate does not study, what is the predicted exam score from the regression equation? E. (Optional) Explain why the regression equation should not be used to estimate the exam score with no hours of study. (Part E is not required for the homework.) {Part E says that we can not use the regression equation to make forecasts about outlying scenarios, since we don t know that the regression equation extends to those points.}

11 Regression Analysis and Time Series, Module 2, Statistical Processes Required Reading (The attached PDF file has better formatting.) Updated: May 25, 2005 Module 2 is background that is the same for regression analysis and time series. If you are taking both courses, one homework assignment suffices for both courses. You learn this material in greater depth in Courses M and C (CAS Exams 3 and 4). Some subjects proceed in linear sequence: you learn Fact A, then Fact B, then Fact C. For regression analysis and time series, you may not understand Fact A until you have learned Facts Y and Z. These courses are frustrating the first several modules, until you understand the general themes. There is much material in Module 2. There is nothing that you must master now; you will understand the material as you see the statistical applications in later modules. If you have never had a statistics course, Module 2 is hard, since you can t understand the concepts with no context. This module summarizes the mathematics; as you learn the material in later modules, come back to Module 2 to review the mathematics. VARIANCES Read sections 2.1 and 2.2 on pages 19-28; know equations 2.1 through These are background knowledge: equations for the mean, variance, covariance, and correlation of a sample. Know especially equation 2.5 (correlations and covariances). We use this relation also in the corporate finance course for the CAPM beta. (The CAPM beta is the slope parameter of the regression equation where Y is the individual stock return and X is the market return.) The sample variance and covariance use N-1 as the denominator, not N. For the error variance from a regression equation, we use N-k, where k is the number of explanatory variables (the independent variables plus the constant term). Jacob: If the population variance uses N as the divisor and the sample variance uses N-1, why do we use the term variance for both? This just confuses the matter. Rachel: The sample variance is an unbiased estimator of the population variance. By sample variance, we mean the estimate of the population variance using sample data. Example 2.1 on page 23 gives the probability distribution for the population, so we use N, not N-1, to derive the variances and the covariance. In Section on page 24-27, we use samples, so we use N-1, not N.

12 Several of the modules have intuition postings, often in a question and answer format with numerical illustrations. Work through the numerical illustrations to make sure you follow the reasoning. It takes a while to grasp why the sample variance is an unbiased estimate of the population variance. DEGREES OF FREEDOM The degrees of freedom is essential for statistical analysis. Jacob: What does degrees of freedom signify? The textbook uses this term but does not give a clear definition. Rachel: Suppose we take N deviations from the population whose mean is known, and we want to determine the sum of squared deviations. {Definition: the deviation is the data point minus the mean.} For example, if the mean is 4, and we take 3 deviations of 3, +1, and +4, the sum of squared deviations is = 26. We need all three deviations to determine the sum of squared deviations. Now suppose we take a sample of N deviations from the population whose mean is not known, and we want to determine the sum of squared deviations. We estimate the population mean from the sample of N points. This implies that the sum of the deviations (not the squared deviations) is zero. For example, if we take 3 deviations, of which the first two are 3 and +1, the third deviation must be +2, and the sum of squared deviations is = 14. We need only two deviations to determine the sum of squared deviations. We restate this in statistical language. When we take three deviations from a population with a known mean, we have the freedom to change any of the three points. No constraint limits the relation among the three points. When we take a sample of three deviations from a population with an unknown mean, we use the sample to estimate the population mean. Once we know two of the deviations, the third deviation is constrained by the relation that the sum of the deviations is zero. We have the freedom to change only two of the three points; once two of the three points are known, the third is determined. Jacob: The terms sample and population confuse me. What is important is whether the mean is known or unknown. Rachel: What is important is whether the points being used for the sum of squared deviations are also used to determine the mean. If they are, they are constrained by a totality constraint, and the degrees of freedom is reduced by one. Note: Mahler s Guide to Regression Analysis discusses this topic in more detail.

13 The central limit theorem in sub-section on page 28 is used throughout actuarial science. The regression analysis and time series final exams do not test this theorem, but you are assumed to know the central limit theorem to understand certain results. Jacob: Where do we use the central limit theorem in regression analysis? Rachel: For statistical testing, we assume the error term has a normal distribution. We assume this (we don t prove this), and it is not true in many situations. Under certain conditions, the central limit theorem says this assumption is true in the limit, as the number of stochastic factors increases. Read section 2.3; understand the meaning of the four properties of estimators:! 2.3.1: Bias! 2.3.2: Efficiency! 2.3.3: Mean squared error! 2.3.4: Consistency We deal most with bias and efficiency in this course. Consistency is an important attribute for large samples, but it is not discussed much in this course. The textbook notes the relation of bias, efficiency, and mean squared error. The relation is used in many actuarial applications, but the final exam does not test this relation. (Note: This relation may be tested on the CAS transition exam; see Mahler s Guide to Regression Analysis.)! Bias: An estimator is unbiased if its expected value is the value we seek.! Efficiency: One estimator is more efficient if it has a smaller variance; see page 29. Jacob: Can you give an example of an unbiased estimator vs a biased estimator? Rachel: Suppose we want to determine the mean and variance of a population. We have no prior knowledge of the mean and variance, so we take a sample of 10 points, {X 1, X 2,, X }. 10 For the mean, we use X j / 10 as our estimate. This estimate is unbiased. If the mean of the population is 4, the sample average may be more than or less than 4, but the expected value of the sample average is 4. For the variance, we start with the sum of squared deviations. The deviation is the data point minus the mean. Since we don t know the mean, we use the sample average, which is an unbiased estimate of the population mean. In this illustration, the population mean is 4. But we don t know the population mean, so we use the sample average, which might be 3.5 or 5.2 or some other number. Suppose we divide the sum of squared deviations by 10, the number of points. If we have a population of exactly these ten points, the proper divisor is 10, and the sum of squared

14 deviations divided by 10 is the variance. But if we don t know the population mean, and we use the sample average as a proxy, dividing by 10 under-estimates the variance. Dividing by N is a biased estimator. In this case, we have an unbiased estimator as well: dividing by N-1. The proof that dividing by N-1 gives an unbiased estimator is Result 9 in Appendix 2.1 on pages The proof is not required for this course, though you must know the fact. Jacob: Do we always divide by N-1 to get an unbiased estimator of the variance? Rachel: We divide by the degrees of freedom. For the variance of the error terms of a twovariable regression model, we divide by N-2. We deal with this topic in depth when we 2 discuss the F statistic and the adjusted R. Jacob: Would we ever use biased estimators in practice? Rachel: In this example, we have an unbiased estimator. Not always does an unbiased estimator exist. Jacob: If we have two estimators, of which one is biased and one is unbiased, would we ever use the biased estimator? Rachel: In many actuarial pricing scenarios, we have two estimators: one is less biased, and the other is more efficient. Illustration: Suppose we are setting Homeowners rates for Iowa, and we must estimate the average severity of fire losses. We are making rates for 20X7, and we have Homeowners experience for 20X1 through 20X5. We have two estimators of average fire severity: A. The average observed fire loss in Iowa from 20X1 through 20X5. B. The average observed countrywide fire loss from 20X1 through 20X5. We adjust the figures for inflation (loss cost trends) and other known or expected changes. Estimator A is unbiased (if we have properly adjusted for inflation and other changes). The observed fire losses in Iowa is a sample of possible fire losses in Iowa, and the sample average is an unbiased estimate of the mean. But fire losses have a high variance, and Iowa is a small state. A few large losses or the absence of large losses may distort the observed average claim severity. Estimator B is biased. Homes in other states are different from homes in Iowa, in size, construction, and the fire protection facilities in their towns. (Fire protection facilities are fire departments and fire hydrants.) We may not know if the average fire loss in other states is higher or lower than the average loss in Iowa; that is, we may not know if the estimator is biased up or down. But it would be a coincidence if the average fire loss in the country as a whole were the same size as the average fire loss in Iowa.

15 Estimator B is more efficient. The number of fire losses in the country as a whole may be 100 times the number in Iowa. Random loss fluctuations distort the countrywide average must less than the Iowa average. Jacob: Isn t it always better to have an unbiased estimator with a large variance than a biased estimator with a small variance? Rachel: Statisticians tend to use unbiased estimators, and to choose the estimator with the least variance. This is the perspective in the textbook readings, though the authors point out that bias is not always the most important criterion. Actuaries, who often deal with unbiased but highly inefficient estimators, are sensitive to the distorting effects of variance. Illustration: Suppose the expected fire loss by state ranges from $35,000 to $40,000. Fire losses are volatile, and the average observed loss in any five year period in a small state may be $15,000 lower or $30,000 higher than its mean. Iowa s average loss is unbiased, but the high variance makes it an unstable estimator: some years the indicated rate may be twice as high as needed and some years it may be only 50% of adequate. The countrywide average fire loss may be biased up or down by 5%, but it is a stable estimator. For pricing Homeowners insurance in Iowa, we may prefer to use the countrywide average fire loss, which is slightly biased but more stable. Mean Squared Error: Minimizing mean squared error puts together bias and efficiency. A biased estimator has a higher mean squared error than an unbiased estimator (if they have the same efficiency) and a more efficient estimator has a lower mean squared error than a less efficient estimator (if they have the same bias). The mean squared error is the variance plus the square of the bias. The proof is not required for this course. We show an illustration to make the concepts clear. Illustration: Suppose we have several estimators of home prices.! A is biased upward by $1,000, but it has no variance.! B is unbiased, but it is always $1,000 too high or low, with 50% chance of each.! C is biased upward by $1,000, ±$1,000 with 50% chance of each (= A + B). We show the mean squared error of each estimator: 2! Estimator A has a mean squared error of 1,000 = 1,000, ! Estimator B has a mean squared error of ½ (1, ,000 ) = 1,000, ! Estimator C has a mean squared error of ½ (2, ) = 2,000,000. Jacob: If we use average absolute error, are the results similar? Rachel: All three estimators have an average absolute error of 1,000. Jacob: Which makes more sense: average absolute error or mean squared error?

16 Rachel: If one error of $2,000 is twice as bad as two errors of $1,000 each, mean squared error is better. If one error of $2,000 is no better or worse than two errors of $1,000 each, average absolute error is better. Mean squared error is a common test for optimal credibility. Howard Mahler, who wrote one of the credibility readings on Course C and the credibility reading on CAS Exam 9, has developed tools for judging the mean squared error of experience rating credibility. The textbook does not use this formula in the modules for the regression analysis or time series courses, and the final exam does not test this formula. (Note: The CAS transition exam may test this formula; Mahler s Guide to Regression Analysis has practice problems.) Consistency: A consistent estimator is close to the true value if the sample is large enough. Suppose we estimate the standard deviation of a population; the true standard deviation is ; and the estimate is s. This estimate is consistent if s becomes close to as the sample size grows. Jacob: I presume (i) only unbiased estimators are consistent and (ii) as the sample size grows, all unbiased estimators are consistent. Rachel: Neither statement is correct, as the illustration shows. Illustration: Suppose we want the standard deviation of home prices in a population. Two appraisal firms work in the town. Firm A uses N as the divisor instead of N-1. Firm B uses N-1 as the divisor, but it gives the result to the nearest $10,000. ½! Firm A s estimate is biased; it is always too small by a factor of [(N-1)/N]. As the sample size grows, this factor becomes close to one, and the bias becomes immaterial.! Firm B s estimate is unbiased. But the estimate may never get close to the true value. If the true standard deviation is $66,000, Firm B s estimate will be $70,000 for an infinitely large sample size. (In truth, Firm B s estimate may not be unbiased, depending on the distribution of prices.) This course emphasizes bias, not consistency. A branch of regression analysis deals with large samples, for which consistency is more important than bias. Jacob: What should we know about these four attributes for the final exam? Rachel: Know the definitions, and know the examples in the postings and the textbook. Most important, know that bias, efficiency, and consistency are different attributes. Estimator A may be more or less biased or efficient than Estimator B, and either of these estimators may be consistent or not consistent. Read section 2.4 on pages From section 2.4, you must know the normal distribution in subsection You are not tested on the equation of the normal distribution, but you must know its properties. The homework assignment for maximum likelihood (Module 21)

17 uses the equation for the normal distribution. You must know this distribution for Course C, so it does not hurt to learn the equation now. Jacob: What are the properties of the normal distribution that we must know? Rachel: The range of the normal distribution is to +, and the distribution is symmetric about its mean. The distribution is bell-shaped: its value is highest at the mean (its center) and becomes lower the further one moves from the mean. Know how to use the table for the cumulative normal distribution to test hypotheses. Given a cumulative normal distribution table, an ordinary least squares estimator, the variance of the estimator, and a significance level, you should be able to test hypotheses. Jacob: Should we know the commonly used values of the normal distribution, such as for a 90% confidence interval and 1.96 for a 95% confidence interval? Rachel: Any final exam problem that uses a significance level gives you the value. Jacob: So we don t need to practice with these values? Rachel: You should definitely practice. A common mistake is to confuse a one-tailed test with a two-tailed test. Another common mistake is to assume a null hypothesis of zero when it should be something else. A little practice with the procedures avoids these errors. You must know how to use the -squared, t, and F distributions to test hypotheses. The -squared distribution is needed to prove certain theorems in regression analysis, but you will not be tested on the properties of the -squared distribution. You must know certain attributes of each distribution: The t distribution has a thicker tail than the normal distribution; see the first full paragraph on page 36. This implies that the critical t values are greater than the critical z values. We deal with this when we discuss hypothesis testing; see the last paragraph on page 36 and see the comments below about hypothesis testing. Know the shape of the distributions in Figure 2.9; you need not know equation For the F distribution, know the shape in Figure 2.10, as summarized by the last sentence in the first paragraph on page 37: The F distribution has a skewed shape and ranges in value from 0 to infinity. Also important is the two parameters of the F distribution: the first being associated with the number of estimated parameters and the second being associated with the number of degrees of freedom (middle of first paragraph on page 37). Result 13 summarizes the F distribution mathematically. You will not be tested directly on these formulas. But you must know how to use the F distribution to test hypotheses, which uses Result 13. Don t try to learn Result 13 in abstraction. Learn the applications of the

18 F test in the two modules which discuss it. As you study those modules, you can refer back to Chapter 2 of the textbook to see the mathematical under-pinning. CONFIDENCE INTERVALS AND HYPOTHESIS TESTING Read section 2.5 on pages Know the definition of the t statistic on page 40 and the difference between a Z statistic and a t statistic. The Z statistic is used when we know the variance of the random variable; the t statistic is used when we estimate the variance from the observed data; see the paragraph beginning We have assumed on page 40. The textbook shows results assuming first that we know the variance of the population, using the Z statistic. It then shows the corresponding result if we don t know the variance, using the t statistic. In practice, we rarely know the variance. For large samples, the t statistic gives about the same result as the Z statistic. The 95% confidence interval is the interval (a, b) such that 2.5% of the probability lies in (, a) and 2.5% lies in (b, + ). For the normal distribution, the confidence intervals are symmetric and the shortest possible confidence intervals. This is not true for other distributions, which may be skewed, such as a lognormal, gamma, or Pareto distribution. (The distributions used for insurance claim severity are skewed distributions.) Page 41 discusses one-tailed tests and two-tailed tests. The authors don t discuss this topic much, since a 95% one-tailed test has the same t value as a 90% two tailed test. Jacob: Are we testing whether the hypothesized result is correct? Rachel: We do not test if a particular value is correct. We test whether a null hypothesis could co-exist with the empirical data. We reject a null hypothesis if the probability that we observe the empirical experience is less than z%, given that the null hypothesis is true. Subsections deals with Type 1 and Type 2 errors. Illustration: Suppose we measure a stock s CAPM beta as If we had no data, we would assume the stock has a CAPM beta of ! The null hypothesis is that the CAPM beta is ! The alternative hypothesis is that the CAPM beta is The statistical tests are done on the null hypothesis, not the alternative hypothesis. Four scenarios are possible: 1. The null hypothesis is true, and we do not discard it. 2. The null hypothesis is true, and we discard it. 3. The null hypothesis is false, and we do not discard it. 4. The null hypothesis is false, and we discard it.

19 Jacob: If the null hypothesis is true, why would we discard it? Rachel: Suppose the null hypothesis is that the average height of North American men is less than 6 feet (about 1.82 meters). To test the hypothesis, we observe the heights of ten men walking along a city street. If the ten men have an average height of 6½ feet (about 2 meters), we would reject the hypothesis. Jacob: Let me see if I understand this. If we observe ten members of a basketball team who are visiting the city, we might reject the null hypothesis even though it is true. Rachel: That s close, but not quite correct. If a basketball team is walking down the street, the assumptions of the regression analysis do not hold, since the heights are correlated. The proper scenario is that the ten men are unrelated, but by happenstance they are all tall. This might happen, though its probability is small. We use the words discard and do not discard. We could replace these terms with discard = do not accept and do not discard = accept. If we have no other information, we presume the null hypothesis is true, so do not discard might be replaced by accept. But statistical testing is like scientific testing. We can not prove that a scientific hypothesis is true. Illustration: Newtonian mechanics explains certain facts and not others. For centuries, it was the best explanation of physical events. But contradictory evidence eventually led us to replace Newtonian mechanics with quantum mechanics. Similarly, we can not prove that a regression coefficient is correct. The empirical data may suggest that the slope parameter is 1.250, but we do not prove that this hypothesis is true. Rather, we show that it is unlikely to get these empirical data if the slope parameter is actually Scenarios 1 and 4 are proper inferences. If the null hypothesis is true, we should not discard it and if it is false, we should discard it. Scenarios 2 and 3 are errors. Scenario 2 is a Type 1 error, and Scenario 3 is a Type 2 error; see section on page 42. Type 2 errors are common. In many situations, they are almost inevitable, and they do not much concern us. Consider the illustration about the CAPM betas. Suppose the CAPM beta of the stock is actually 1.020, not Since is so close to 1.000, it is unlikely that we will reject the null hypothesis, even though it is false. Hypothesis testing focuses on Type 1 errors. Perhaps the null hypothesis is true, but because of random fluctuations, we discard it. We avoid this by choosing a significance level like 5% or 1%, so there is only a small probability of discarding the null hypothesis when it is true. Section discusses p-values, a better way of stating the conclusion of hypothesis testing. Suppose we reject the null hypothesis at a 10% or 5% significance level, but not at a 2% or 1% significance level. This leaves us wondering: what about at a 3% or 4%

20 significance level? A p-value tells us the exact cut-off, such as 2.2% or 4.7%. A p-value of 2.2% makes us more confident that the null hypothesis is false than a p-value of 4.7%. Section 2.5.3, on page 43-45, shows the inter-relation of sample size and hypothesis testing; see Example 2.4. A larger sample makes a Type 1 error less common. Jacob: Does a larger sample size change the expected value of the regression coefficient? Rachel: No; the regression coefficient is unbiased regardless of the sample size. A larger sample reduces the variance of the ordinary least squares estimator of the coefficient. Jacob: Suppose that with a sample of 100, the result is significant at a 10% level. With a sample of 400, do we expect the result to be significant at a 2.5% level or a 5% level? Rachel: The question is incomplete. With a sample of 100, random fluctuations are more likely than with a sample of 400, so we can t compare expected significance. Both the number of observations and the size of the result affect the significance. Jacob: Let s change the question: If we have the same ordinary least squares estimator with a sample of 400, what is the expected significance of the result? Rachel: There is no simple relation; we must look up the answer in a cumulative normal distribution table. A larger sample size makes the confidence interval narrower. The final exam may give a similar scenario as the exercise in the textbook, using pass ratios of male and female candidates on Course C. Section 2.6 is not tested on the final exam, and it has no homework assignments. The material is useful for actuaries, who must present actuarial results to company officers. Historgrams and other graphics are useful, since they help non-actuaries interpret our results. But it is hard to test histograms on a final exam. Appendix 2.1 is simple. We assume you know this material. It is not tested per se on the final exam, and it is not used in the homework assignments. But these results are used throughout the course. We use results 1 through 8 all the time; you are expected to know them on any actuarial exam you take. Result 9 says that the sample variance is an unbiased estimate of the population variance. You are expected to know this, though you do not have to know the proof. Appendix 2.2 deals with maximum likelihood estimation. You learn this Module 21. It is not worth reading now; read it 6 weeks from now. This module has the most pages of any module in this course, but it is background. Most items that you must know are repeated in later modules.

21 Regression Analysis and Time Series, Module 2: Statistical Properties Intuition: Population Variance and Sample Variance (The attached PDF file has better formatting.) Updated: May 25, 2005 Throughout this course, we use sample variances as estimates of population variances. The population variance is divided by N, the number of equally likely scenarios; the sample variance is divided by N-1, or the number of data points minus one. 2 2 The sample variance (s ) is an unbiased estimator of the population variance ( ). Jacob and Rachel are discussing the relation of the sample variance to the population variance. Jacob: Suppose we have a sample of two points: 1 and +1. The mean is zero, and the 2 2 variation is ( 1 0) + (+1 0) = 2. The variation, or total sum of squares (TSS), is the sum of the squares of the deviations of each point from the mean. The sample variance is TSS / (N 1) = 2 / (2 1) = 2. The simplest hypothesis is that these two points come from a population of two points, 1 and +1, each with a 50% chance of occurring. The population variance of this distribution is TSS / N = 2 / 2 = 1. How can we say that the sample variance is an unbiased estimator of the population variance? Rachel: If the population has a distribution of two values, 1 and +1, each with a 50% chance of occurring, there are four equally like samples: ( 1, +1), (+1, 1), ( 1, 1), and (+1, +1). We compute the sample variances of each sample as well as the (incorrect) variances using N as the denominator instead of N-1. Pt A Pt B Mean Deviations Total Variation Sample Variance Population Variance , = , +1 (-1) + 1 = , (-1) = , = Average 1 ½ The true variance of the distribution is 1, not ½. By using the sample variance of the four samples, the estimated population variance is 1, not ½. Jacob: Can you show this for other distributions?

22 Rachel: When the distributions have more points, it is harder to give illustrations. We can prove this result, though the proof is not required for the regression analysis course. (The proof is in the textbook.) Let us look at a distribution with three points. Suppose a distribution has three possible values, 3, 0, +3, each with a probability of a. The mean of the distribution is zero, and the population variance of the distribution is { (3 0) + (0 0) + (3 0 ) } / 3 = 18 / 3 = 6. 2 If we draw a sample of three points and they are ( 3, 0, +3), the sample variance is ( ) / 2 = 18 / 2 = 9. But there are 27 equally likely samples of three values: Points Deviations A B C Mean A B C Total Variation Sample Variance Pop Variance Average

23 The table shows the 27 possible samples, means, deviations, total variation, sample variance, the incorrect sample variance using N as the denominator instead of (N-1), and the averages. The sample variance is an unbiased estimator of the population variance. {Recommendation: If you are not comfortable with sample and population variances, redo these examples with other numbers. Later in the course, you should redo the example using a regression equation with simple data points. Simple illustrations help you master the theory of this course.}

24 Regression Analysis and Time Series, Module 2: Means and Variances (The attached PDF file has better formatting.) Practice Problems Updated: December 11, 2006 Exercise 2.1: Means Given the sample below, what are the estimated means of X and Y? i Y X i i Solution 2.1: = = ( ) / 4 = 13.5 = = ( ) / 4 = 2.05 Exercise 2.2: Deviations What are x i and y, i the deviations of X i and Y? i Solution 2.2: Table 1.1 i y x i i Exercise 2.3: Sample Variances

25 What are the sample variances of X and Y? Solution 2.3: s (X) = = [( 0.5) ( 0.1) ] / 3 = s (Y) = = [( 0.15) ( 0.05) ] / 3 = An alternative method is = = s (X) = = ( ) / 3 = The same alternative formula can be used for the variance of Y. Jacob: Are the two formulas equivalent? 2 2 Rachel: The deviation x i is X i. The square of the deviation is X i 2 X i +. We do this for each of the N values are sum them to get X 2 X + N. 2 2 i i X = N, so this expression simplifies to X N. i We divide by N 1, since this is a sample. Jacob: Which method is easier to use? 2 2 i Rachel: If the problem gives you the mean and the sum of the squares, use the alternative method. Exercise 2.4: Covariance What is estimated covariance between the two random variables? Solution 2.4: = = The estimated covariance of (x,y) is

26 = 0.16 / 3 = Exercise 2.5: Correlation What is the correlation between the two random variables? Solution 2.5: ½ = = 0.16 / ( ) =

27 Regression Analysis and Time Series, Module 2: Means and Variances (The attached PDF file has better formatting.) Homework Assignment Updated: May 25, 2005 We use the sample in the table below. i Y X i i A. What are the estimated means of X and Y? B. What are x i and y, i the deviations of X i and Y? i C. What are the sample variances of X and Y? D. What is estimated covariance between the two random variables? E. What is the correlation between the two random variables? {The homework assignment reviews the material in the practice problems.}

28 Time Series, Module 3, Simple Extrapolation Models Required Reading (The attached PDF file has better formatting.) Updated: May 27, 2005 Read section 15.1 on pages ; this is an introduction. Focus on the difference between deterministic and stochastic models. The authors say at the bottom of page 467: These models are deterministic in that no reference is made to the sources or nature of the underlying randomness. Jacob: This definition seems convoluted. Why not say that a deterministic model gives a point estimate and a stochastic model gives a range of values? Rachel: Stochastic models also give point estimates, as the expected values of the ranges. Deterministic models also give ranges, as distributions about a point estimate. Jacob: What do the authors mean by the sources or nature of the underlying randomness? Rachel: Suppose auto insurance average claim severity is $10,000 in 20X0. We contrast a deterministic exponential trend model with a stochastic autoregressive model to predict future average claim severities. The deterministic exponential trend model says that the expected average claim severity 0.08t is $10,000 e, where t is the number of years after 20X0 and 8% is the continuously compounded loss cost trend. For example, the expected average claim severity for 20X is $10,000 e = $17,507. This is the best estimate of the average claim severity. The actual average claim severity will not be exactly this amount, but the linear trend model does not suggest a probability distribution for the claim severity. Jacob: If we add a probability distribution about the estimate, such as a normal distribution with a standard deviation of $1,000, is the trend model stochastic? Rachel: This distribution is ad hoc; it does not explain what the sources or nature of the uncertainty (randomness). Contrast a stochastic autoregressive model, where Y t+1 = e 0.08 Y t + t. If we know the distribution of the error term t, we can derive the probability distribution of Y t+k for any projection distance k. We vary the form of the model in several ways, each with its own explanation of the source of the uncertainty in the projection.! moving average vs autoregressive models (MA vs AR)! one period lags vs higher order lags! stationary vs homogeneous non-stationary models! combined ARMA or ARIMA models

29 Later modules discuss the sources of stochasticity for moving average vs autoregressive models of different lags and for stationary vs homogeneous non-stationary models. Read section on pages Know equations 15.2 (linear trend), 15.4 and its extension into 15.7 (exponential growth), 15.8 (autoregressive trend), 15.9 (logarithmic autoregressive trend); these models will be tested on the final exam. Jacob: Is the autoregressive trend model stochastic, as you describe above? Rachel: An autoregressive model can be stochastic or deterministic. The authors say autoregressive trend for the deterministic model and AR(p) for the stochastic model. The deterministic model has no error term. An actuary using a deterministic model knows that there is uncertainty, but the model does not quantify it. Skip the material on quadratic trend (15.10), logistic curve (15.11), and sales saturation model (15.12 and 15.13). These models are not used by actuaries. In contrast, the first four deterministic models are commonly used by actuaries to price insurance products. {Note: All the deterministic models, including quadratic trend, logistic curves, and sales saturation models are on the CAS transition exam. If you have questions about these models, post them on the discussion board, though we cannot promise that our faculty will have the time to answer these questions.} Example 15.1 shows the use of the models. Readers of Part 4 are assumed to be familiar with the regression techniques in Parts 1 and 2. You will not be asked to do a regression 2 on the time series final exam, but you must know how to evaluate a regression, using R, 2 adjusted R, and the Durbin-Watson statistic. See the examples in Modules Jacob: I took regression several years ago in college. We did not cover the adjusted R 2 or the Durbin-Watson statistic. What should I do? 2 2 Rachel: Review regression analysis, module 4 for R and adjusted R. Understand degrees of freedom; see especially the postings in Regression, Module 4. When we come to the Durbin-Watson statistic, we specify what you should know. 2 For the four models in Example 15.1, the R shows the percentage of the variance explained by the regression equation.! The autoregressive trend models are better than the linear trend models; actuaries use autoregressive trend models.! The logarithmic trend model is better than the linear trend model; actuaries use 2 logarithmic trend models. For the autoregressive models, the R is not materially different between the linear and logarithmic model. The F statistic, which says whether we should reject the null hypothesis that there is no trend, is significant for all four models. A higher F statistic means more significant (i.e., a

Read Section 1.1, Examples of time series, on pages 1-8. These example introduce the book; you are not tested on them.

TS Module 1 Time series overview (The attached PDF file has better formatting.)! Model building! Time series plots Read Section 1.1, Examples of time series, on pages 1-8. These example introduce the book;