Lecture 1, Jan 19. i=1 p i = 1.

Size: px

Start display at page:

Download "Lecture 1, Jan 19. i=1 p i = 1."

Sherman Wheeler
5 years ago
Views:

1 Lecture 1, Ja 19 Review of the expected value, covariace, correlatio coefficiet, mea, ad variace. Radom variable. A variable that takes o alterative values accordig to chace. More specifically, a radom variable assumes differet values, each with probability less tha or equal to 1. A cotiuous radom variable may take o ay value o the real umber lie. A discrete radom variable may take oly a specific umber of values. Probability mass fuctio (pmf) ad probability desity fuctio (pdf). The process that geerates the values of a radom variable. It lists all possible outcomes ad the probability that each will occur. Expected values. The mea, or the expected value, of a radom variable X is a weighted average of the possible outcomes, where the probabilities of the outcomes serve as the weights. Expectatio operator deoted E, mea of X deoted µ X. I the discrete case, N µ X = E(X) = p 1 X 1 + p 2 X p N X N = p i X i, where p i = P (X = X i ) ad N p i = 1. I the cotiuous case, µ X = E(X) = Ω f(x)xdx, where Ω is the support of X, ad f(x) is the pdf of X. The sample mea of a set of outcomes o X is deoted by X. X = 1 x i, where x 1,..., x are realizatios of X. 1

2 Variace. The variace of a radom variable provides a measure of the spread, or dispersio, aroud the mea. Var(X) = σx 2 = E[X E(X)] 2. I the discrete case, N Var(X) = σx 2 = p i [X i E(X)] 2. The positive square root of the variace is called the stadard deviatio. Joit distributio. I the discrete case, joit distributios of X ad Y are described by a list of probabilities of occurrece of all possible outcomes o both X ad Y. The covariace of X ad Y is Cov(X, Y ) = E[X E(X)(Y E(Y )]. I the discrete case, N M Cov(X, Y ) = p ij [X i E(X)][Y j E(Y )], j=1 where p ij represets the joit probability of X = X i ad Y = Y j occurrig. The covariace is a measure of the liear associatio betwee X ad Y. If both variables are always above ad below their meas at the same time, the covariace will be positive. If X is above its mea whe Y is below its mea ad vice versa, the covariace will be egative. The value of covariace depeds upo the uits i which X ad Y are measured. The correlatio coefficiet ρ XY is a measure of the associatio which has bee ormalized ad is scale-free. Cov(X, Y) ρ XY =, σ X σ Y where σ X ad σ Y represet the stadard deviatio of X ad Y respectively. The correlatio coefficiet is always betwee -1 ad 1. A positive correlatio idicates that the variables move i the same directio, while a egative correlatio implies that they move i opposite directios. Properties. Suppose X ad Y are radom variables, a ad b are costats: 1. E(aX + by ) = ae(x) + be(y ); 2. Var(aX + b) = a 2 Var(X); 3. Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y ); 4. If X ad Y are idepedet, the Cov(X, Y ) = 0. 2

3 Estimatio. Meas, variaces ad covariaces ca be measured with certaity oly if we kow all there is to kow about all possible outcomes. I practice, we may obtai a sample of the relevat iformatio eeded. Give x 1,..., x are radom observatios of X, we wat to estimate a populatio parameter (like the mea or the variace). The sample mea X is a ubiased estimator of the populatio mea µ X. E( X) = E( 1 x i ) = 1 E(x i ) = 1 µ X = µ X. The sample variace s 2 X = 1 (x 1 i X) 2 is a ubiased estimator of the populatio variace Var(X). The sample covariace s XY = 1 (x 1 i X)(y i Ȳ ) is a ubiased estimator of the populatio covariace Cov(X, Y ). The sample correlatio coefficiet is defied as r XY = (x i X)(y i Ȳ ) (x i X) 2 (y i Ȳ )2. Desired properties of estimators: 1. Lack of bias. The bias associated with a estimated parameter is defied to be: Bias( ˆβ) = E( ˆβ) β. ˆβ is a ubiased estimator if the mea or the expected value of ˆβ is equal to the true value, that is, E( ˆβ) = β. 2. Cosistecy. ˆβ is a cosistet estimator of β if for ay δ > 0, lim P ( β ˆβ < δ) = 1. As sample size approaches ifiity, the probability that ˆβ will differ from β will get very small. 3. Efficiecy. We say that ˆβ is a efficiet ubiased estimator if for a give sample size, the variace of ˆβ is smaller tha the variace of ay other ubiased estimators. 3

4 Lecture 2, Ja 26 Tradeoff betwee bias ad variace of estimators. Whe the goal is to maximize the precisio of the predictios, a estimator with low variace ad some bias may be more desirable tha a ubiased estimator with high variace. We may wat to miimize the mea square error, defied as Mea square error( ˆβ) = E( ˆβ β) 2. It ca be show that Mea square error( ˆβ) = [Bias( ˆβ)] 2 +Var( ˆβ). The criterio of miimizig mea square error take ito accout of both the variace ad the bias of the estimator. A alterative criterio to cosistecy is that the mea square error of the estimator approaches zero as the sample size icreases. This implies asymptotically, or whe the sample size is very large, the estimator is ubiased ad its variace goes to zero. A estimator with a mea square error that approaches zero will be cosistet estimator but that the reverse eed ot be true. 1

5 Probability distributios. The Normal distributio is a cotiuous bell-shaped probability distributio. It ca be fully described by its mea ad its variace. If X is ormally distributed, we write X N(µ X, σ 2 X), which is read X is ormally distributed with mea µ X ad variace σ 2 X. The probability that a sigle observatio of a ormally distributed variable will lie withi 1.96 stadard deviatios of its mea is approximately The probability that a sigle observatio of a ormally distributed variable will lie withi 2.57 stadard deviatios of its mea is approximately P (µ X 1.96σ X < X < µ X σ X ) 0.95, P (µ X 2.57σ X < X < µ X σ X ) Compariso of ormal distributios Desity Distributios Variace=1 Variace=4 Variace= x value The weighted sum of ormal radom variables is still ormal. 2

6 The chi square distributio with N degrees of freedom is the sum of the squares of N idepedetly distributed stadard ormal radom variables (with mea 0 ad variace 1). The chi square distributio starts at the origi, is skewed to the right, ad has a tail which exteds ifiitely to the right. The distributio becomes more ad more symmetric as the umber of degrees of freedom gets larger. It becomes close to ormal distributio whe degrees of freedom is very large. Compariso of chi square distributios Desity Distributios df=3 df=8 df= x value As a example, whe we calculate the sample variaces s 2 of observatios from a ormal distributio with variaces σ 2, ( 1)s 2 /σ 2 is chi square with 1 degrees of freedom. 3

7 The t distributio. Assume X is ormal with mea 0 ad variace 1, Z is chi square with N degrees of freedom, X ad Z are idepedet. The X/ Z/N has a t distributio with N degrees of freedom. The t distributio is symmetric, has fatter tails tha the ormal distributio, ad approximates the ormal distributio whe the degrees of freedom is large. Compariso of t distributios Desity Distributios df=1 df=3 df=8 df=30 ormal x value As a example, X N(µ X, σx), 2 ad X = 1 x i, is the sample mea based o a sample x 1,..., x of size. The ( X µ X )/(σ X / ) is ormal with mea 0 ad variace 1. For ukow σx, 2 we replace σx 2 by sample variace s 2 X, the ( X µx ) ( = X µ X )/(σ X / ) s X [( 1)s 2 X/σX]/( 2 1), which is a t distributio with 1 degrees of freedom. 4

8 The F distributio. If X ad Z are idepedet ad distributed as chi square with N 1 ad N 2 degrees of freedom, respectively, the (X/N 1 )/(Z/N 2 ) is distributed as a F distributio with N 1 ad N 2 degrees of freedom. These two parameters are called the umerator ad the deomiator degrees of freedom. F Distributios Desity Distributios F(33,10) F(5,10) F(2,7) x value 5

9 Forecastig: to calculate or predict some future evet or coditio, usually as a result of ratioal study or aalysis of pertiet data. We all make forecasts: a perso waitig for a bus; parets expectig a telephoe call from their childre; bak maager predicts cash flow for the ext quarter; compay maager predicts sales or estimates umber of ma-hours required to meet a give productio schedule. Future evets ivolve ucertaity. Forecasts are ot perfect. The objective of forecastig is to reduce the forecast error. Each forecast has its ow specificatios, ad solutios to oe are ot solutios i aother situatio. Geeral priciples for ay forecast system: Model specificatio; Model estimatio; Diagostic checkig; Forecast geeratio; Stability checkig; Forecast updatig. Choice of forecast model. (1) Degrees of accuracy required; (2) forecast horizo; (3) budget; (4) what data is available. Oe may ot costruct accurate empirical forecast models from limited ad icomplete data base. Forecast criteria. Actual observatio at time t is z t. Its forecast, which uses the iformatio up to ad icludig time t 1, is z t 1 (1). Objective is to make the future forecast error z t z t 1 (1) as small as possible. However, z t is ukow, we ca oly talk about its expected value, coditioal o the observed data up to ad icludig time t 1. We miimize the mea absolute error E z t z t 1 (1) or the mea square error E[z t z t 1 (1)] 2. We use the mea square error criterio for simpler mathematical calculatios. What we will discuss. I sigle-variable forecastig, we use past history of the series, z t, where t is the time idex, to extrapolate ito the future. I regressio forecastig, we use the relatioships betwee the variable to be forecast ad the other variables. I the most geeral form, the regressio model ca be writte as y t = f(x t1,..., x tp ; β 1,..., β m ) + ε t Describes relatioship betwee oe depedet variable Y ad p idepedet variables X 1,..., X p. Idex t meas at time t or for subject t. At idex t, we observe y t ad x t1,..., x tp. Ukow parameters β 1,..., β m. Kow mathematical fuctio form f. Radomess for this model rises from the error term ε t. 6

10 Lecture 3, Feb 2 I the most geeral form, the regressio model ca be writte as y t = f(x t1,..., x tp ; β 1,..., β m ) + ε t Describes relatioship betwee oe depedet variable Y ad p idepedet variables X 1,..., X p. Idex t meas at time t or for subject t. At idex t, we observe y t ad x t1,..., x tp. Ukow parameters β 1,..., β m. Kow mathematical fuctio form f. Radomess for this model rises from the error term ε t. Models liear i the parameter: 1. y = β 0 + β 1 x 1 + ε; 2. y = β 0 + β 1 x 1 + β 2 x 2 + ε; 3. y = β 0 + β 1 x 1 + β 2 x ε; Models liear i the idepedet variable: 1. y = β 0 + β 1 x 1 + ε; 2. y = β 0 + β 1 x 1 + β 2 x 2 + ε. Regressio through origi. y: salary of a sales perso; x: umber of products sold. No base salary! y: merit I get for icrease i salary; x: umber of papers published. Model: y = βx + ε. 1

11 Give observatios (x 1, y 1 ),..., (x, y ), we wat to miimize S(β) = (y i βx i ) 2. Take derivatives ad set to 0, the we have S (β) = ds(β) dβ = 2 (y i βx i )x i = 0 The least squares estimator of β is ˆβ = x iy i. x2 i Take secod order derivative to make sure this is a miimizer. Check S (β) = d2 S(β) dβ 2 = ds (β) dβ = 2 x 2 i 0. Simple liear regressio. y: resale price of a preowed car; x: milage! y: ruig time for 10-km road race; x: maximal aerobic capacity (oxyge uptake, milliliter per kilogram per miite, ml/(kg mi) Show MINITAB example! Scatter plot! Model: y = β 0 + β 1 x + ε. Give observatios (x 1, y 1 ),..., (x, y ), we wat to miimize S(β 0, β 1 ) = (y i β 0 β 1 x i ) 2. Take derivatives ad set to 0, the we have β 0 S(β 0, β 1 ) = 2 (y i β 0 β 1 x i ) = 0 2

12 β 1 S(β 0, β 1 ) = 2 (y i β 0 β 1 x i )x i = 0 We have two equatios ad two ukows, solve for β 0 ad β 1. The we have ˆβ 1 = s xy = (x i x)(y i ȳ) s 2 x (x, i x) 2 ˆβ0 = ȳ ˆβ 1 x, where ( s xy = 1 x i y i x ) i y i = (x i x)(y i ȳ) 1 1 [ ] s 2 x = 1 x 2 i ( x i) 2 = (x i x) We still eed to take secod order derivatives to make sure ( ˆβ 0, ˆβ 1 ) are actual miimizers. Model assumptios: 1. The relatioship betwee x ad y is liear, y i = β 0 + β 1 x i + ε i ; 2. The x are o-stochastic variable whose values x 1,..., x are fixed; 3. Normality The error term ε is ormally distributed; Homoscedasticity The error term ε has mea zero ad costat variace for all observatios, E(ε i ) = 0, Var(ε i ) = σ 2 ; Idepedecy The radom error ε i ad ε j are idepedet of each other for i j, or errors correspodig to differet observatios are idepedet. How do we check those assumptios? DO AN EXAMPLE BY HAND. Show MINITAB RESULTS! ˆβ 1, ˆβ 0 as weighted sum x Let w i = i x, the ˆβ (x i x) 2 1 = w iy i. We otice that w i = 0, ad w i x i = w i (x i x) = 1. 3

13 Thus, E( ˆβ 1 ) = E( w i y i ) = w i E(y i ) = = β 0 w i + β 1 w i x i = β 1, w i (β 0 + β 1 x i ) which meas ˆβ 1 is ubiased estimator of β 1. Furthermore wi 2 = (x i x) 2 [ (x i x) 2 ] 2 = (x i x) 2 [ (x i x) 2 ] = 1 2 (x i x), 2 ad Var( ˆβ 1 ) = Var( w i y i ) = wi 2 Var(y i ) = σ 2 wi 2 = σ 2 (x i x) 2. Similarly, ˆβ 0 = ȳ ˆβ 1 x = 1 E( ˆβ 0 ) = y i x w iy i = ( 1 xw i)y i. Thus ( 1 xw i)e(y i ) = ( 1 xw i)(β 0 + β 1 x i ) = β 0 xβ 0 w i + β 1 x β 1 x w i x i = β 0. which meas ˆβ 0 is ubiased estimator of β 0. Furthermore Var( ˆβ 0 ) = Var[ ( 1 xw i)y i ] = ( 1 xw i) 2 Var(y i ) = σ 2 ( 1 xw i) 2 = σ 2 ( x2 wi 2 2 xw i) = σ 2 ( 1 + x 2 (x i x) ) = 2 We ca also show that Cov( ˆβ 0, ˆβ xσ 2 1 ) = (x i x) 2 σ2 x2 i (x i x) 2 4

14 Gauss-Markov Theorem: If the previous model assumptios are satisfied, the amog all the liear ubiased estimators of β 0 ad β 1, the least squares estimators ˆβ 0 ad ˆβ 1 have the smallest variace. The Theorem implies that for ay ubiased estimator of β 1 with form wi y i, its variace should be Var( ˆβ σ 1 ) = 2. (x i x) 2 The fitted least squares lie: ŷ = ˆβ 0 + ˆβ 1 x is used to get the fitted value y. The i-th residual is the differece betwee the i-th observatio of depedet variable ad its fitted value: r i = y i ŷ i. I practice, Var(ε) = σ 2 is ukow ad we eed to estimate it. The sum of squares of error (SSE; also kow as sum of squares of residual) is SSE = (y i ŷ i ) 2 = (y i ˆβ 0 ˆβ 1 x i ) 2. It has 2 degrees of freedom. We have radom observatios, that s degrees of freedom. Estimatig ˆβ 0 ad ˆβ 1 uses 2 degrees of freedom. The mea squares error (MSE) is defied to be MSE = SSE 2 = (y i ˆβ 0 ˆβ 1 x i ) 2. 2 MSE is also deoted by ˆσ 2. It is a ubiased estimator of σ 2. It ca be show that E(MSE) = E(ˆσ 2 ) = σ 2. The i-th stadardized residual sr i = r i /ˆσ. I summarizatio, we have ˆβ 1 N ˆβ 0 N SSE σ 2 ( ) σ 2 β 1, (x. i x) 2 ( σ 2 ) β 0, x2 i (x. i x) 2 = ( 2)ˆσ2 σ 2 χ

15 Lecture 4, Feb 16 Simple liear regressio. Model: y = β 0 + β 1 x + ε. Model assumptios: 1. The relatioship betwee x ad y is liear, y i = β 0 + β 1 x i + ε i ; 2. The x are o-stochastic variable whose values x 1,..., x are fixed; 3. Normality The error term ε is ormally distributed; Homoscedasticity The error term ε has mea zero ad costat variace for all observatios, E(ε i ) = 0, Var(ε i ) = σ 2 ; Idepedecy The radom error ε i ad ε j are idepedet of each other for i j, or errors correspodig to differet observatios are idepedet. Give observatios (x 1, y 1 ),..., (x, y ), we wat to miimize S(β 0, β 1 ) = (y i β 0 β 1 x i ) 2. Take derivatives ad set to 0, the we have S(β 0, β 1 ) = 2 (y i β 0 β 1 x i ) = 0 β 0 β 1 S(β 0, β 1 ) = 2 (y i β 0 β 1 x i )x i = 0 We have two equatios ad two ukows, solve for β 0 ad β 1. The we have ˆβ 1 = s xy = (x i x)(y i ȳ) s 2 x (x, i x) 2 ˆβ0 = ȳ ˆβ 1 x, where 1

16 ( s xy = 1 x i y i x i y i 1 [ s 2 x = 1 1 x 2 i ( x i) 2 ) ] = = (x i x)(y i ȳ) 1 (x i x) 2 1 We still eed to take secod order derivatives to make sure ( ˆβ 0, ˆβ 1 ) are actual miimizers. The fitted least squares lie: ŷ = ˆβ 0 + ˆβ 1 x is used to get the fitted value y. The i-th residual is the differece betwee the i-th observatio of depedet variable ad its fitted value: r i = y i ŷ i. I practice, Var(ε) = σ 2 is ukow ad we eed to estimate it. The sum of squares of error (SSE; also kow as sum of squares of residual) is SSE = (y i ŷ i ) 2 = (y i ˆβ 0 ˆβ 1 x i ) 2. It has 2 degrees of freedom. We have radom observatios, that s degrees of freedom. Estimatig ˆβ 0 ad ˆβ 1 uses 2 degrees of freedom. The mea squares error (MSE) is defied to be MSE = SSE 2 = (y i ˆβ 0 ˆβ 1 x i ) 2. 2 MSE is also deoted by ˆσ 2. It is a ubiased estimator of σ 2. It ca be show that E(MSE) = E(ˆσ 2 ) = σ 2. The i-th stadardized residual sr i = r i /ˆσ. I summarizatio, we have ˆβ 1 N ˆβ 0 N SSE σ 2 ( ) σ 2 β 1, (x. i x) 2 ( σ 2 ) β 0, x2 i (x. i x) 2 = ( 2)ˆσ2 σ 2 χ

17 Cofidece iterval Deote σ 2ˆβ1 = σ 2 (x i x) 2, s 2ˆβ1 = ˆσ 2 (x i x) 2, the ˆβ 1 β 1 σ ˆβ1 N(0, 1). For kow σ 2, the 100(1 α)% cofidece iterval for β 1 is ˆβ 1 ± z α/2 σ ˆβ1. For ukow σ 2, plug i s ˆβ1 as a estimate of ukow σ ˆβ1 ad we have t = ˆβ 1 β 1 s ˆβ1 = ( ˆβ 1 β 1 )/σ ˆβ1 = s ( ˆβ1 β 1 )/σ ˆβ 1 2ˆβ1 /σ 2ˆβ1 ( 2)ˆσ 2 /( 2) σ 2 N(0, 1) χ 2 2 /( 2) = t 2 Deote the upper percetile t 2,α/2, which satisfies P (t > t 2,α/2 ) = α/2 for radom variable t t 2. Thus ( ) P rob t 2,α/2 < ˆβ 1 β 1 s ˆβ1 < t 2,α/2 = 100(1 α)% P rob( ˆβ 1 t 2,α/2 s ˆβ1 < β 1 < ˆβ 1 + t 2,α/2 s ˆβ1 ) = 100(1 α)% The 100(1 α)% cofidece iterval for β 1 is ˆβ 1 ± t 2,α/2 s ˆβ1. Hypothesis test We are iterested whether there is sigificat predictor effect. Null hypothesis: H 0 : β 1 = 0 v.s. H 1 : β 1 0. Test statistics: t = ˆβ 1 s ˆβ1. Uder H 0, t t 2. At 1 α sigificace level, reject H 0 if t > t 2,α/2. We are iterested whether there is sigificat positive predictor effect. Null hypothesis: H 0 : β 1 = 0 v.s. H 0 : β 1 > 0. Test statistics: t = ˆβ 1 s ˆβ1. Uder H 0, t t 2. At 1 α sigificace level, reject H 0 if t > t 2,α. We are iterested whether the slope is at a certai specified level c. Null hypothesis: H 0 : β 1 = c v.s. H 0 : β 1 c. Test statistics: t = ˆβ 1 c s ˆβ1. Uder H 0, t t 2. At 1 α sigificace level, reject H 0 if t > t 2,α/2. 3

18 Lecture 5, Feb 19 Please check with your classmates for missed lecture otes. The materials listed here are the key poits for ANOVA. Make sure you uderstad EV- ERYTHING listed here. We are goig to get back to ANOVA whe we deal with multiple liear regressio. This versio does ot have ay examples or demostratio i MINITAB. TUCAPTURE should be available soo. ANOVA The variatio of y has the followig decompositio (y i ȳ) 2 = (ŷ i ȳ) 2 + (y i ŷ i ) 2 Total Sum of Squares (SST)= Sum of Squares of Regressio (SSR)+ Sum of Squares of Error(SSE) R 2, or the R squared, coefficiet of determiatio, is the proportio of variatio i y explaied by the regressio equatio. R 2 = SSR SST = 1 SSE SST The sample correlatio coefficiet R = ± R 2. R > 0 whe ˆβ 1 > 0; ad R < 0 whe ˆβ 1 < 0. R measures how closely the data fit a straight lie. ANOVA table Source df Sum of Squares Mea Squares F Regressio 1 SSR = (ŷ i ȳ) 2 MSR=SSR/1 F=MSR/MSE Error -2 SSE = (y i ŷ i ) 2 MSE=SSE/(-2) Total -1 SST = (y i ȳ) 2 1

19 To test whether there is sigificat predictor effect, we do t use R 2. We use F, which takes ito cosideratio of the degrees of freedom. Null hypothesis: H 0 : β 1 = 0 v.s. H 1 : β 1 0. If test statistic F is large, which meas a large proportio of variatio i y is explaied by the regressio, the ituitio is that we reject H 0. It ca be show that uder H 0, F = MSR/MSE F 1, 2. Deote the upper percetile F 1, 2,α, which satisfies P (F > F 1, 2,α ) = α for radom variable F F 1, 2. At 1 α sigificace level, reject H 0 if F = MSR/MSE > F 1, 2,α. Notice that F = MSR MSE = SSR SSE/( 2) = ( 2)SSR SST SSR = F = ( 2)R2 1 R 2 ( 2)SSR/SST SST/SST SSR/SST Large R 2 with small may ot be sigificat; but moderate R 2 with large sample size ca be highly sigificat. It ca be show that the t test ad the F test agree with each other. 2

20 Lecture 6, Feb 23 Model assumptios: 1. The relatioship betwee x ad y is liear, y i = β 0 + β 1 x i + ε i ; 2. The x are o-stochastic variable whose values x 1,..., x are fixed; 3. Normality The error term ε is ormally distributed; Homoscedasticity The error term ε has mea zero ad costat variace for all observatios, E(ε i ) = 0, Var(ε i ) = σ 2 ; Idepedecy The radom error ε i ad ε j are idepedet of each other for i j, or errors correspodig to differet observatios are idepedet. Gauss-Markov Theorem: If the previous model assumptios are satisfied, the amog all the liear ubiased estimators of β 0 ad β 1, the least squares estimators ˆβ 0 ad ˆβ 1 have the smallest variace. The Theorem implies that for ay ubiased estimator of β 1 with form wi y i, its variace should be Var( ˆβ σ 1 ) = 2. (x i x) 2 Predictio Iterval Cosider a college tryig to predict first year GPA of a ew studet based o the studet s high school GPA. Based o preset studets, the regressio equatio is ŷ = x, where x =high school GPA ad y = first year college GPA. For a studet with high school GPA x ew = 3.5, we estimate its first year college GPA by ŷ ew = ˆβ 0 + ˆβ 1 x ew = = 3.2. The mea of the forecast error is E(y ew ŷ ew ) = E[(β 0 + β 1 x ew + ε ew ) ( ˆβ 0 + ˆβ 1 x ew )] = 0. 1

21 By the Gauss-Markov theorem, ŷ ew is the miimum mea square error forecast amog all liear ubiased forecasts. The variace of the forecast error is Var(y ew ŷ ew ) = Var[(β 0 + β 1 x ew + ε ew ) ( ˆβ 0 + ˆβ 1 x ew )] = σ 2 + Var( ˆβ 0 ) + x 2 ewvar( ˆβ 1 ) + 2x ew Cov( ˆβ 0, ˆβ 1 ) = σ 2 + σ2 x2 i (x i x) + x2 ewσ 2 2 (x i x) 2x ew xσ 2 2 (x i x) ( 2 = σ (x ) ew x) 2 (x i x) 2 For kow σ 2, the predictio iterval for y ew is ( ŷ ew ± z α/2 σ (x ew x) 2 ). (x i x) 2 is For ukow σ 2, plug i MSE = ˆσ 2 ad the predictio iterval for y ew ( ŷ ew ± t 2,α/2 ˆσ (x ew x) 2 ). (x i x) 2 Which predictio iterval is wider? Why? Cofidece Iterval for expectatio of y If we are ot iterested i just oe particular studet with high school GPA=3.5, but we are iterested i the expectatio of college GPA for all studets with high school GPA=3.5. With kow σ 2, the Cofidece Iterval for y 0 = E(y x 0 ) is ŷ 0 ± z α/2 σ 2 ( 1 + (x 0 x) 2 (x i x) 2 ). For ukow σ 2, plug i MSE = ˆσ 2 ad the predictio iterval for y 0 is ( 1 ŷ 0 ± t 2,α/2 ˆσ 2 + (x 0 x) 2 ). (x i x) 2 2

22 Here ŷ 0 = ˆβ 0 + ˆβ 1 x 0. Is the predictio iterval wider or the cofidece iterval wider? Why? Multiple liear regressio Istead of studyig the relatioship betwee respose y ad oe idepedet variable (predictor) x, we may also use liear model to study the relatioship betwee respose y ad multiple idepedet variables x 1,..., x p. y = β 0 + β 1 x β p x p + ε. We have set of observatios (y i ; x i1, x i2,..., x ip ) for i = 1,...,. We obtai least square estimators of the ukow parameters by miimizig the SSE: S(β 1,..., β p ) = [y i (β 0 + β 1 x i1 + + β p x ip )] 2. Set the first order derivatives equal to 0: S = 0, S = 0,, S = 0. β 0 β 1 β p The solutios are deoted by ˆβ 0, ˆβ 1,..., ˆβ p. The fitted least squares lie is ŷ = ˆβ 0 + ˆβ 1 x ˆβ p x p. The i-th residual is r i = y i ŷ i. ANOVA The variatio of y has the same decompositio as i the simple liear regressio (y i ȳ) 2 = (ŷ i ȳ) 2 + (y i ŷ i ) 2 SST=SSR+ SSE 3

23 ANOVA table Source df Sum of Squares Mea Squares F Regressio p SSR = (ŷ i ȳ) 2 MSR=SSR/p F=MSR/MSE Error -p-1 SSE = (y i ŷ i ) 2 MSE=SSE/(-p-1) Total -1 SST = (y i ȳ) 2 To test whether there is sigificat predictor effect, simultaeous test Null hypothesis H 0 : β 1 = β 2 = = β p = 0 v.s. Alterative hypothesis H 1 : At least oe coefficiet ot equal to zero. If test statistic F is large, which meas a large proportio of variatio i y is explaied by the regressio, the ituitio is that we reject H 0. It ca be show that uder H 0, F = MSR/MSE F p, p 1. Deote the upper percetile F p, p 1,α, which satisfies P (F > F p, p 1,α ) = α for radom variable F F p, p 1. At 1 α sigificace level, reject H 0 if F = MSR/MSE > F p, p 1,α, ad this meas the regressio is sigificat. Fail to reject H 0 if F = MSR/MSE < F p, p 1,α, ad this meas oe of the regressio coefficiets sigificatly differs from 0. Notice that F = MSR MSE = = p 1 p SSR/(p) SSE/( p 1) = p 1 SSR p SST SSR SSR/SST SST/SST SSR/SST = p 1 R 2 p 1 R. 2 The R 2, or coefficiet of determiatio, is R 2 = SSR/SST = 1 SSE/SST. R 2 always icrease as the umber of predictors p icrease, which may result i model overfittig. We eed a ew criterio. The corrected R 2, or R 2, is R 2 = 1 SSE/( p 1). It takes degrees of SST/( 1) freedom ito cosideratio ad pealizes overfittig. 4

24 Lecture 7, March 2 Please read Chapter 2 of the textbook, from page 8 to page 41. You may skip Sectio ad Sectio For the basics of liear algebra about matrix calculatio, you may fid the etry from Wikipedia useful: http : //e.wikipedia.org/wiki/matrix ( mathematics) Matrix represetatio of multiple liear regressio We have set of idepedet observatios (y i ; x i1, x i2,..., x ip ) from model Let y = β 0 + β 1 x β p x p + ε. y 1 β 0 + β 1 x β p x 1p + ε 1 y 2. = β 0 + β 1 x β p x 2p + ε 2. y β 0 + β 1 x β p x p + ε y 1 β 0 1 x 11 x 1p ε 1 y 2 Y =., β = β 1., X = 1 x 21 x 2p...., ε = ε 2.. y β p 1 x 1 x p ε The we have Y = Xβ + ε, where ε is multivariate ormal with E(ε) = 0 ad Var(ε) = σ 2 I The SSE ca be writte as S(β) = [y i (β 0 + β 1 x i1 + + β p x ip )] 2 = (Y Xβ) (Y Xβ). 1

25 Take derivatives with respect to β ad we have the followig ormal equatio: X Xβ = X Y, which leads to least squares estimator ˆβ = (X X) 1 X Y. Because E(ˆβ) = E[(X X) 1 X Y] = (X X) 1 X E(Y) = (X X) 1 X E(Xβ + ε) = (X X) 1 X [Xβ + E(ε)] = (X X) 1 X Xβ = I p+1 β = β, ˆβ is ubiased estimator of β. Because Var(AX) = AVar(X)A for radom vector X ad costat matrix A, we have Var(ˆβ) = Var[(X X) 1 X Y] = [(X X) 1 X ]Var(Y)[(X X) 1 X ] = Var[(X X) 1 X Y] = [(X X) 1 X ]Var(Xβ + ε)[(x X) 1 X ] = σ 2 (X X) 1 X I X(X X) 1 = σ 2 (X X) 1. Thus we kow ˆβ N p+1 (β, σ 2 (X X) 1 ). Thus ˆβ i N(β i, σ 2 d i ), where d i is the i-th elemet o the diagoal of matrix (X X) 1. 2

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet