Statistics GIDP Ph.D. Qualifying Exam Methodology

Size: px

Start display at page:

Download "Statistics GIDP Ph.D. Qualifying Exam Methodology"

Cuthbert August Moody
5 years ago
Views:

1 Statistics GIDP Ph.D. Qualifying Exam Methodology May 26, 2017, 9:00am-1:00pm Instructions: Put your ID (not your name) on each sheet. Complete exactly 5 of 6 problems; turn in only those sheets you wish to have graded. Each question, but not necessarily each part, is equally weighted. Provide answers on the supplied pads of paper and/or use a Microsoft word document or equivalent to report your software code and outputs. Number each problem. You may turn in only one electronic document. Embed relevant code and output/graphics into your word document. Write on only one side of each sheet if you use paper. You may use the computer and/or a calculator. Stay calm and do your best. Good luck! 1. A process engineer is testing the yield of a product manufactured on five machines. Each machine has two operators, one for the day shift and one for the night shift. Assume the operator factor is random. We take five samples from each machine for each operator and obtain the following data ( machine.csv ): Machine Day Operator Night Operator (a) What design is this? (b) State the statistical model and assumptions. (c) Analyze the data and draw a conclusion. (d) If these five machines were randomly selected from many machines in the factory, would the conclusion be same as the one obtained in (3)? Explain (no calculation needed). (e) Attach your SAS code here. 2. A nickel-titanium alloy is used to make components for jet turbine aircraft engines. Cracking is a potentially serious problem, as it can lead to non-recoverable failure. A test is run at the parts producer to determine the effects of four factors on cracks. The four factors are pouring temperature (A), titanium content (B), heat treatment method (C), and the amount of grain

2 refiner used (D). Each factor contains two levels and 16 runs are performed. Two operators need to take care of these 16 runs. There might be some variation between the two operators. A B C D Operator (a) Help them to divide the workload equally by filling in the table above. (b) Assume the response measurements in the above table (from top to bottom) are 25, 71, 48, 45, 68, 40, 60, 65, 43, 80, 25, 104, 55, 86, 70, 76 and the dataset is given in aircraft.csv. Use SAS code to estimate the factor effects. Which factor effects appear to be large? Is there a large variation between the two operators? (c) Conduct an analysis of variance to verify the conclusion of (b). (d) Attach your SAS code here.

3 3. A study was carried out to compare the writing lifetime of four premium brands of pens. It was thought that the writing surface will affect lifetime, so three different surfaces were used and the data are given as below. Surface average Brand average (a) What design is this? (b) State the statistical model with assumptions. (c) How would you check whether there exists any significant interaction between the surfaces and brands of pens? State your hypothesis in mathematical notation. (d) Analyze the data using the given dataset pen.csv. (e) Attach your SAS code. (f) Assume that in this study 3 observations were collected for each combination and each value in the above table was the average of 3 replicates. A two-way ANOVA model with interaction is fitted and the MSE is Complete the following ANOVA table and draw conclusions. Source DF SS MS F-value P-value Brand Surface Interaction Error

4 4. In a study of carbohydrate uptake (Y) as a function of other factors in male diabetics, observations were taken as follows: Y Age, x 1 Weight, x 2 Dietary Protein, x Analyze these data to determine which (if any) of the predictor variables (including any appropriate interactions) appear to significantly affect carbohydrate uptake. Throughout, set α = 0.05, but for simplicity do not employ any adjustments for multiplicity/multiple inferences when assessing the effects of the predictor variables. Remember to assess the quality of the fit via standard diagnostics. Attach supporting components of your computer code. Report your findings. The data are found in the file diet.csv. 5. A large dataset of n = 1030 samples of concrete involving a total of p 1 = 8 predictor variables was collected: x 1 = Age x 2 = Cement x 3 = Furnace Slag x 4 = Superplasticizer x 5 = Water x 6 = Fly ash x 7 = Coarse Aggregate x 8 =Fine Aggregate along with Y = Compressive Strength The data appear in the file concrete.csv. Consider a multiple linear regression (MLR) model for these data, with E[Y] = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + β 6 X 6 + β 7 X 7 + β 8 X 8. Conduct a variable selection search to identify a possible reduced model among the eight predictor variables with this data set. Employ backward elimination and take minimum-bic

5 as your selection criterion. Attach supporting components of your computer code. What is the recommended set of variables for further study? 6. Consider the simple linear regression model: Y i ~ indep. N(β 0 +β 1 X i, σ 2 ), i = 1,..., n, where in particular it is known that β 0 = 1, so that E[Y] = 1 + β 1 X. Suppose interest exists in estimating the X values at which E[Y] = 0. Let this target parameter be ξ. a) Find ξ as a function of β 1. Also find the maximum likelihood estimator for ξ. Call this ˆξ. b) Recall from Section in Casella & Berger that the Delta Method can be used to determine the asymptotic features of a function of random variables. In particular, for a random variable U and a differentiable function g(u), where E[U] = θ, a first-order approximation to E[g(U)] is E[g(U)] g(θ) + { g(θ)}e(u θ) θ Use this to find a first-order approximation for E[ˆξ]. c) In part (b), a second-order approximation to E[g(U)] is also available from Casella & Berger s book: E[g(U)] g(θ) + { g(θ) θ }E(U θ) + g(θ) ½{ 2 }E[(U θ) 2 ] θ 2 Use this to find a second-order approximation for E[ˆξ].

6 2017 May - method 1. A process engineer is testing the yield of a product manufactured on five machines. Each machine has two operators, one for the day shift and one for the night shift. Assume the operator factor is random. We take five samples from each machine for each operator and obtain the following data ( machine.csv ): Machine Day Operator Night Operator (a) What design is this? Nested design (operator is nested within the machine). (b) State the statistical model and assumptions. y!"# = μ + τ! + β!! + ε!!"! and! τ! = 0, and β!! ~N(0, σ!! ), ε!!" ~N(0, σ! ) (c) Analyze the data and draw a conclusion. From the SAS output of the above model it is obvious that the machine has a significant effect as the p-value is while the operator does not. Ty pe 1 Ana l y s i s o f Va r i a nc e Source D F Sum of Squares Me a n Square Ex pe c t e d Me a n Square Er r o r Te r m Er r o r DF F Va l u e Pr > F ma c h i n e Va r ( Re s i dua l ) + 5 Va r ( ope r a t or ( machi ne )) + Q(machine) MS ( o p e r a t o r ( ma c h i n e )) operat or( machi ne ) Va r ( Re s i dua l ) + 5 Va r ( ope r a t or ( ma c hi ne )) MS ( Re s i d u a l ) Re s i dua l Va r ( Re s i dua l ).... Or use the default setting method=reml, we get:

7 Ty pe 3 Te s t s o f Fi x e d Ef f e c t s Ef f e c t Num DF De n DF F Val ue Pr > F ma c h i n e (d) If these five machines were randomly selected from many machines in the factory, would the conclusion be same as the one obtained in (3)? Explain (no calculation needed). Yes, as the F-value for testing the machine effect is the same as the F-value in (3). (e) Attach your SAS code here. data Q1; input operator machine datalines; run; proc mixed method=typ1 data=q1; class operator machine; model y=machine; random operator(machine);

8 run; 2. A nickel-titanium alloy is used to make components for jet turbine aircraft engines. Cracking is a potentially serious problem, as it can lead to non-recoverable failure. A test is run at the parts producer to determine the effects of four factors on cracks. The four factors are pouring temperature (A), titanium content (B), heat treatment method (C), and the amount of grain refiner used (D). Each factor contains two levels and 16 runs are performed. Two operators need to take care of these 16 runs. There might be some variation between the two operators. A B C D Operator (a) Help them to divide the workload equally by filling in the table above. A B C D Operator

9 (b) Assume the response measurements in the above table (from top to bottom) are 25, 71, 48, 45, 68, 40, 60, 65, 43, 80, 25, 104, 55, 86, 70, 76 and the dataset is given in aircraft.csv. Use SAS code to estimate the factor effects. Which factor effects appear to be large? Is there a large variation between the two operators? It seems that the A, C, D, AC, and AD have large effect, as well as the operator factor. _NAME_ COL1 effect operator AC BCD ACD CD BD AB ABC BC B ABD C D AD A (c) Conduct an analysis of variance to verify the conclusion of (b). The ANOVA result below shows that the factors A, D, and interaction AC and AD are significant, as well as the operator.

10 Source DF Ty pe I I I SS Me a n S q u a r e F Val ue Pr > F A < C D AC < AD < operat or < (d) Attach your SAS code here. data Q2; input A B C D operater y; datalines; ; run; data inter; set Q2; AB=A*B; AC=A*C; AD=A*D; BC=B*C; BD=B*D; CD=C*D;ABC=AB*C; ABD=AB*D; ACD=AC*D;BCD=BC*D; block=abc*d; proc reg outest=effects data=inter; model y=a B C D AB AC AD BC BD CD ABC ABD ACD BCD block;

11 run; data effect2; set effects; drop y Intercept _RMSE_; run; proc transpose data=effect2 out=effect3; data effect4; set effect3; effect=col1*2; proc sort data=effect4; by effect; proc print data=effect4; run; data effect5; set effect4; where _NAME_^='block'; proc print data=effect5; run; proc rank data=effect5 normal=blom; var effect; ranks neff; run; proc gplot; plot effect*neff=_name_; run; proc glm data=inter; class A C D AC AD; model y=a C D AC AD; run;

12 3. A study was carried out to compare the writing lifetime of four premium brands of pens. It was thought that the writing surface will affect lifetime, so three different surfaces were used and the data are given as below. Surface average Brand average (a) What design is this? Randomized complete block design (RCBD), because the surfaces represent a known source of variation, and therefore they represent the block factor. (b) State the statistical model with assumptions. y!" = μ + τ! + β! + ε!", τ! = 0, β! = 0, and ε!" ~N(0, σ! ) (c) How would you check whether there exists any significant interaction between the surfaces and brands of pens? State your hypothesis in mathematical notation. Use one-degree freedom Tukey s method to check the interaction between these two factors. y!" = μ + τ! + β! + γτ! β! + ε!" H! : γ = 0 vs. H! : γ 0 (d) Analyze the data using the given dataset pen.csv. One-degree freedom Tukey s test shows that the interaction is not significant (pvalue= Source DF Ty pe I I I SS Me a n S q u a r e F Val ue Pr > F surface brand q So use the additive model: y!" = μ + τ! + β! + ε!" The type III SS ANOVA table shows that both surface and brand have significant effects. Source DF Ty pe I I I SS Me a n S q u a r e F Val ue Pr > F

13 Source DF Ty pe I I I SS Me a n S q u a r e F Val ue Pr > F surface brand Check model adequacy: The residual plot and QQ-plot show no unusual pattern. Te s t s f o r No r ma l i t y Te s t St at i s t i c p Val ue Shapi ro- Wi l k W Pr < W Ko l mo g o r o v - Smi rnov D Pr > D > Cr a me r - von Mi s e s W- Sq Pr > W- Sq Ande r s o n- Da r l i ng A- Sq Pr > A- Sq (e) Attach your SAS code. data Q3; input surface brand lifetime; datalines;

14 ; proc glm data=q3; class surface brand; model lifetime=surface brand; output out=diag r=res p=pred; run; data two; set diag; q=pred*pred; proc glm data=two; class surface brand; model lifetime=surface brand q/ss3; run; proc sgplot data=diag; scatter x=pred y=res; refline 0; run; proc univariate data=diag normal; var res; qqplot res/normal (L=1 mu=est sigma=est); run; (f) Assume that in this study 3 observations were collected for each combination and each value in the above table was the average of 3 replicates. A two-way ANOVA model with interaction is fitted and the MSE is Complete the following ANOVA table and draw conclusions.

15 Source DF SS MS F-value P-value Brand Surface Interaction Error Source DF SS MS F-value P-value Brand <0.001 Surface <0.01 Interaction >0.1 Error Both the brand and surface play a significant effect on the lifetime, but their interaction does not. μ = 692 τ1 = = 46 τ2 = = 10 τ3 = = 21 τ4 = = 15 β1 = = β2 = = 2.75 β3 = = 16 τβ11 = = 9.25 τβ12 = = τβ13 = = 2 τβ21 = = 9.75 τβ22 = = 8.75 τβ23 = = 1 τβ31 = = τβ32 = = 1.75 τβ33 = = 10 τβ41 = = τβ42 = = 0.75 τβ43 = = 13 SS_brand=3*3*(46^2+10^2+21^2+15^2)=25,938

16 SS_surface=4*3*(13.25^2+2.75^2+16^2)= SS_bs=3*(9.25^ ^2+2^2+9.75^2+8.75^ ^2+1.75^2+10^ ^2+0.75^2+13^2)= Or use the way below: If we assume that the table is average lifetime, then the totals for each treatment can be recovered by multiplying by 3, i.e., y!". = y!". 3, e.g., y!!. = = Similarly, the listed averages can recover other totals, SS!"#$% = y!!.. bn y! abn = SS!"#$ = y!.!. an y! abn = SS!"# = y!... n = 25, = y! abn = = 34,056 SS!"#$,!"!"# = SS!"# SS!"#$% SS!"#$ = SS! = MSE a b n 1 = In a study of carbohydrate uptake (Y) as a function of other factors in male diabetics, observations were taken as follows: Y Age, x 1 Weight, x 2 Dietary Protein, x

17 Analyze these data to determine which (if any) of the predictor variables (including any appropriate interactions) appear to significantly affect carbohydrate uptake. Throughout, set α = 0.05, but for simplicity do not employ any adjustments for multiplicity/multiple inferences when assessing the effects of the predictor variables. Remember to assess the quality of the fit via standard diagnostics. Attach supporting components of your computer code. Report your findings. The data are found in the file diet.csv. To start, always plot the data! Sample R code diet.df = read.csv( file.choose() ) attach( diet.df ) Y = Y; X1 = Age; X2 = Weight; X3 = Dietary.Protein pairs( cbind(y,x1,x2,x3), pch=19 )

18 No disturbing patterns are seen in the scatterplot matrix. Now fit the model; sample R code follows (notice use of centered predictors to properly accommodate the higher-order interaction terms): x1 = Age - mean(age); x2 = Weight - mean(weight) x3 = Dietary.Protein - mean(dietary.protein) diet.lm = lm( Y ~ x1*x2*x3 ) anova( diet.lm ) This yields the following ANOVA table (output edited):

19 Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) x x x x1:x x1:x x2:x x1:x2:x Residuals From the ANOVA table we see that the Sequential Sums of Squares working from the bottom (i.e., the Partial SS) up show no significant interaction of any type (pointwise, at the 5% level). Formally, we test this via: anova( lm(y~x1+x2+x3), diet.lm ) producing Model 1: Y ~ x1 + x2 + x3 Model 2: Y ~ x1 * x2 * x3 Res.Df RSS Df Sum of Sq F Pr(>F) The P-value for testing all four interactions is P = > 0.05 = α. Again, no interactions are significant. Move now to a reduced model with only main-effect terms (so return to the original, uncentered predictor variables): dietrm.lm = lm( Y~Age+Weight+Dietary.Protein ); anova( dietrm.lm ) Analysis of Variance Table Response: Y Df Sum Sq Mean Sq F value Pr(>F) Age Weight Dietary.Protein Residuals Examining the Partial SS shows that Protein is significant at the (pointwise) 5% level. To study the other terms we can either (i) rearrange the sequential order of the reduced-model ANOVA to isolate Weight and then Age, and test their Partial SS contributions, or (ii) since each is a 1 d.f. test, just examine the t-tests for assessing each pointwise β-coefficient. The latter approach is faster: summary( dietrm.lm )

20 producing (output edited) Call: lm(formula = Y ~ X1 + X2 + X3) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) Age Weight Dietary.Protein Residual standard error: on 16 degrees of freedom Multiple R-squared: 0.515, Adjusted R-squared: F-statistic: on 3 and 16 DF, p-value: We see Age is insignificant with P = > 0.05, but Weight is significant with P = < 0.05, each at the (pointwise) 5% level. Thus a final reduced model retains only Weight and Protein: dietfinal.lm = lm( Y~Weight+Dietary.Protein ); summary( dietfinal.lm ) Call: lm(formula = Y ~ X2 + X3) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) Weight Dietary.Protein Residual standard error: on 17 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 2 and 17 DF, p-value: For diagnostic quality assessment: (i) Check VIFs for multicollinearity between X2 and X3: library( car ) vif( dietfinal.lm ); mean(vif( dietfinal.lm )) Weight Dietary.Protein Since both values are below 10 and their mean is clearly equal to < 6, no concerns with multicollinearity are evident. (ii) Check normal Q-Q plot: sample R code qqnorm( resid(dietfinal.lm), main=null, pch=19) qqline(resid(dietfinal.lm)) produces the following graphic (no substantive concerns are evidenced).

21 (iii) Studentized residual plot (with outlier screen): sample R code is n = length(y); p = length( coef(dietfinal.lm) ) tcrit = qt( 1-.5*(.05/n), n-p-1 ) plot( rstudent(dietfinal.lm) ~ fitted(dietfinal.lm), pch=19, ylim=c(-ceiling(tcrit),ceiling(tcrit)) ) abline( h=0 ) abline( h=tcrit, lty=2 ); abline( h=-tcrit, lty=2 )

22 From the residual plot, no troublesome patterns are seen, and no outliers are observed to extend past the screening limits of ±t(1 (0.05/2n); n p 1) = ± (iv) Influence measures: sample R code is influence.measures( dietfinal.lm ) which produces the following output (edited) Influence measures of lm(formula = Y ~ Weight + Dietary.Protein) : dfb.1_ dfb.wght dfb.dt.p dffit cov.r cook.d hat inf *

23 * We see observations at i = 4 and i = 12 are marked for further study: At i = 4 and i =12 the hat matrix diagonals, h ii, exceed 2p/n = 0.3, indicating high leverage at these points. At i = 4 the value of DFFITS exceeds 1 in absolute value, so this point again exhibits high influence. The Cook s Distance D i values are available as the sixth column of the influence.measures object, so we can check their associated F-probability values via Di = influence.measures(dietfinal.lm)$infmat[,6] which( pf(di, df1=p, df2=n-p) > 0.5 ) the result of which is null. Thus no influence is seen on the Cook s Distance metric. Lastly, no values of DFBETAS 2 or DFBETAS 3 exceed 1 in absolute value, so no influence is seen on that measure. 5. A large dataset of n = 1030 samples of concrete involving a total of p 1 = 8 predictor variables was collected: x 1 = Age x 2 = Cement x 3 = Furnace Slag x 4 = Superplasticizer x 5 = Water x 6 = Fly ash x 7 = Coarse Aggregate x 8 =Fine Aggregate along with Y = Compressive Strength The data appear in the file concrete.csv. Consider a multiple linear regression (MLR) model for these data, with E[Y] = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + β 5 X 5 + β 6 X 6 + β 7 X 7 + β 8 X 8. Conduct a variable selection search to identify a possible reduced model among the eight predictor variables with this data set. Employ backward elimination and take minimum-bic as your selection criterion. Attach supporting components of your computer code. What is the recommended set of variables for further study? Begin by loading the data and creating X variables (notice that the response variable Y has already been transformed as Compressive Strength):

24 concrete.df = read.csv(file.choose()) attach( concrete.df ) Y = Y x1 = age x2 = cement x3 = slag x4 = superplasticizer x5 = water x6 = fly.ash x7 = coarse.aggregate x8 = fine.aggregate Always plot the data! The command pairs( concrete.df ) produces a scatterplot matrix (see next page), in which a number of interesting patterns appears. None. however, are grossly disturbing at face value.

Next, build the regression fit and apply backward elimination with BIC control: library( leaps ) cement.lm = lm( Y ~x1+x2+x3+x4+x5+x6+x7+x8 ) n = length(y) step( cement.

25 Next, build the regression fit and apply backward elimination with BIC control: library( leaps ) cement.lm = lm( Y ~x1+x2+x3+x4+x5+x6+x7+x8 ) n = length(y) step( cement.lm, direction="backward", k=log(n) ) #BIC This produces [output edited -- note that R writes AIC in the leaps output, but we did institute the stepwise regression with the BIC option k=log(n)]: Start: AIC=

26 Y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 Df Sum of Sq RSS AIC - x x <none> x x x x x x Step: AIC= Y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 Df Sum of Sq RSS AIC - x <none> x x x x x x Step: AIC= Y ~ x1 + x2 + x3 + x4 + x5 + x6 Df Sum of Sq RSS AIC <none> x x x x x x Call: lm(formula = Y ~ x1 + x2 + x3 + x4 + x5 + x6) Coefficients: (Intercept) x1 x2 x3 x x5 x

27 We see after two back-steps, a reduced model with only the first 6 predictors x1 = age x2 = cement x3 = slag x4 = superplasticizer x5 = water x6 = fly.ash is recommended for further study.

28 6. Consider the simple linear regression model: Y i ~ indep. N(β 0 +β 1 X i, σ 2 ), i = 1,..., n, where in particular it is known that β 0 = 1, so that E[Y] = 1 + β 1 X. Suppose interest exists in estimating the X values at which E[Y] = 0. Let this target parameter be ξ. a) Find ξ as a function of β 1. Also find the maximum likelihood estimator for ξ. Call this ˆξ. This is essentially a one-parameter inverse regression problem. We have E[Y] = 1 + β 1 X. Clearly, at E[Y] = 0 we have 0 = 1 + β 1 X, so solving for X produces ξ= 1/β 1. To find the MLE ˆξ, appeal to ML invariance and first find ˆβ 1. The fastest way to do so is to recognize that if E[Y] = 1 + β 1 X, then E[Y 1] = β 1 X. That is, we essentially regress the new response variable (Y i 1) against X i through the origin! Referring to the various equations in Sec. 4.4 of Kutner et al., we find that the least squares estimator for β 1 is ˆβ 1 = n i=1 X i(y i 1) / n i=1 X i 2. Under the homogeneous-variance, normal-parent assumption here, this estimator remains identical to the MLE, so take ˆξ = 1/ ˆβ 1 = n i=1 X i 2 / n i=1 X i(y i 1). b) Recall from Section in Casella & Berger that the Delta Method can be used to determine the asymptotic features of a function of random variables. In particular, for a random variable U and a differentiable function g(u), where E[U] = θ, a first-order approximation to E[g(U)] is E[g(U)] g(θ) + { g(θ)}e(u θ) θ Use this to find a first-order approximation for E[ˆξ]. Let g(β 1 ) = ξ = 1/β 1. We know that the MLE for β 1 is unbiased such that E[ ˆβ 1 ] = β 1. Then from the Delta Method we see E[ˆξ] = E[ 1/ ˆβ 1 ] g(β 1 ) + β 1 g(β 1 ) E( ˆβ 1 β 1 ) = 1/β 1 + β 1 g(β 1 )(0) = 1/β 1 = ξ. c) In part (b), a second-order approximation to E[g(U)] is also available from Casella & Berger s book: E[g(U)] g(θ) + { g(θ) θ }E(U θ) + g(θ) ½{ 2 }E[(U θ) 2 ] θ 2 Use this to find a second-order approximation for E[ˆξ]. Again, let g(β 1 ) = ξ = 1/β 1. We know that the MLE for β 1 is unbiased such that E[ ˆβ 1 ] = β 1. Thus for the second-order Delta Method approximation we have E( ˆβ 1 β 1 ) = 0 and E[( ˆβ 1 β 1 ) 2 ] = Var[ ˆβ 1 ] This latter quantity is Var[ ˆβ 1 ] = Var[ n i=1 X i(y i 1)/ n j=1 X j 2 ] = Var[ n i=1 X i(y i 1)]/( n j=1 X j 2 ) 2

29 = n i=1 X i 2 Var[Y i 1]/( n j=1 X j 2 ) 2 = n i=1 X i 2 Var[Y i ]/( n j=1 X j 2 ) 2 = n i=1 X i 2 σ 2 /( n j=1 X j 2 ) 2 = σ 2 n i=1 X i 2 /( n j=1 X j 2 ) 2 = σ 2 /( n j=1 X j 2 ). Collecting all this together yields E[ˆξ] = E[ 1/ ˆβ 1 ] g(β 1 ) + β 1 g(β 1 ) (0) + ½ 2 g(β 1 ) β 1 2 Var[ ˆβ 1 ] = 1 β 1 = 1 β σ 2 2β 1 3 n j=1 X j 2 σ 2 β 3 1 n = ξ j=1 X j 2 ξ 2 σ n j=1 X j 2. (We see that to second order, a bias exists in the point estimator. However, it can in fact be shown that E[ˆξ] does not exist, as E[ ˆξ ] diverges. Thus, one must always be careful with these sorts of approximate expansions.)

Statistics GIDP Ph.D. Qualifying Exam Methodology January 10, 9:00am-1:00pm

Statistics GIDP Ph.D. Qualifying Exam Methodology January 10, 9:00am-1:00pm Instructions: Put your ID (not name) on each sheet. Complete exactly 5 of 6 problems; turn in only those sheets you wish to have