2017 Financial Mathematics Orientation - Statistics

Size: px

Start display at page:

Download "2017 Financial Mathematics Orientation - Statistics"

Matthew Underwood
5 years ago
Views:

1 2017 Financial Mathematics Orientation - Statistics Written by Long Wang Edited by Joshua Agterberg August 21, 2018 Contents 1 Preliminaries Samples and Population Distributions Central limit theorem (CLT) Main Ideas One-sample Large-sample population mean Estimation Testing Duality between confidence interval and testing Choose sample size Small-sample population mean Estimation Testing Population variance

2 2.3.1 Estimation Testing Large-sample population proportion Estimation Testing Two-sample Large-sample two population means Estimation Testing Small-sample two population means (identical variance) Practical rule Estimation Testing Small-sample two population means (non-identical variance) Practical rule Estimation Testing Small-sample two population means (matched-pair) Estimation Testing Two population variances Estimation Testing Large-sample two population proportions Estimation Testing (identical proportion hypothesis) Testing (non-identical proportion hypothesis)

3 4 General distribution Estimation Mean-variance trade-off Method of moments (MOM) Maximum likelihood estimator (MLE) Asymptotic Normality for MLE Bootstrap Bayesian Testing Likelihood ratio test Generalized ratio test ANOVA One-way ANOVA Model Testing Estimation Randomized block design Model Testing Estimation Two-way ANOVA Model Testing Estimation Linear regression Simple linear regression

4 6.1.1 Model Testing Estimation T-test Coefficient of determination Variable Selection Measures of Fit Stepwise Regression Ridge Regression Lasso Regression Other important concepts to know Cross Validation Dimensionality Reduction Classification Clustering Non-parametric methods Model selection

5 1 Preliminaries 1.1 Samples and Population Sample is a randomly (i.i.d.) selected subset of the population/distribution. From a sample, we want to learn as best we can about properties/parameters of the population/distribution 1.2 Distributions Bernoulli: X Ber(p), P (X = 1) = p Binomial: X Bin(n, p), P (X = k) = ( ) n k p k (1 p) n k, Bin(n, p) = n i=1 Ber(p) Normal: X N(µ, σ 2 ), f(x) = 1 2πσ e (x µ)2 2σ 2 Bin(n, p) N(np, npq), N(µ,σ2 ) µ σ = N(0, 1) Chi-square: X χ 2 ν, χ 2 ν = ν i=1 N(0, 1)2 t-distribution: X t ν, t ν = N(0,1) χ 2 ν/ν, t = N(0, 1), t 0 = Cauchy (undefined mean and variance) F-distribution X F n,m, F n,m = χ2 n/n χ m/m, t2 ν = F 1,ν Others: Exponential, Poisson, Gamma, Beta, Negative Binomial, Central limit theorem (CLT) X 1,..., X n are i.i.d. random variables with mean µ and variance σ 2 <, then X µ σ/ n N(0, 1) 5

6 1.4 Main Ideas Assuming that there is a true parameter θ (e.g. the mean, the variance, etc.), can we use our data to discover the true distribution? (Frequentist method). We can: estimate θ, perform a hypothesis test, or find a confidence interval about the true mean. Always pay attention to assumptions! In many cases, assumptions do not hold, but they make our lives easier. We often assume a distribution, and then perform statistical inference based on the distributions. We typically say that X 1,..., X n i.i.d. F θ, where F is completely determined by θ (e.g. N(µ, σ 2 ), θ = (µ, σ)). This is parametric. If F is not completely determined by θ, then we typically say we are in a non-parametric setting. There is such thing as a semi-parametric setting, but we will ignore this for now. 2 One-sample Here, we assume that we have one sample of i.i.d. data. We often assume data are normal or approximately normal by the CLT. 2.1 Large-sample population mean Estimation Data: X i N(µ, σ 2 ) Estimator: ˆµ = X Distribution: X µ σ/ n N(0, 1) (why?) C.I.: X ± z(α/2) σ n or X + z(α) σ n (one-sided) or X z(α) σ n (one-sided) 6

7 2.1.2 Testing Hypothesis: H 0 : µ = µ 0 vs. H A : µ µ 0 or H A : µ < µ 0 or H A : µ > µ 0 Test statistics: Z = X µ 0 σ/ n N(0, 1) Rejection region: Z > z(α/2) or Z < z(α) or Z > z(α) H A : µ = µ A : Power = P (Reject H A ) Type-I (α) and Type-II error p-value: 2 p(z > z ) or p(z < z) or p(z > z) Duality between confidence interval and testing µ 0 C.I. Accept H 0 : µ = µ Choose sample size B: upper bound on the error of estimate. The width of confidence bound equals 2B B z(α/2) σ n n z(α/2)2 σ 2 B Small-sample population mean Estimation Data: X i N(µ, σ 2 ) Estimator: ˆµ = X Distribution: C.I.: X ± tn 1 (α/2) s n X µ s/ n t n 1 (proof: X µ X µ s/ = σ/ n ) n s 2 /σ Testing Hypothesis: H 0 : µ = µ 0 vs. H A : µ µ 0 Test statistics: T = X µ 0 s/ n t n 1 Rejection region: T > t n 1 (α/2) 7

8 2.3 Population variance Estimation Data: X i N(µ, σ 2 ) Estimator: ˆσ 2 = s 2, E [s 2 ] = σ 2 Distribution: (n 1)s2 χ 2 σ 2 n 1 (proof: (n 1)s2 + ( X µ) 2 σ 2 σ 2 /n ( ) C.I.: (n 1)s 2 χ 2 (α/2), (n 1)s 2 n 1 χ 2 n 1 (1 α/2) = n (X i µ) 2 i=1 ) σ Testing Hypothesis: H 0 : σ 2 = σ 2 0 vs. H A : σ 2 σ 2 0 Test statistics: X 2 = (n 1)s2 σ 2 0 χ 2 n 1 Rejection region: X 2 > χ 2 n 1(α/2) and X 2 < χ 2 n 1(1 α/2) 2.4 Large-sample population proportion Estimation Data: X i Ber(p) Estimator: ˆp = X Distribution: C.I.: ˆp ± z(α/2) ˆp p N(0, 1) p(1 p)/n ˆp(1 ˆp) n Testing Hypothesis: H 0 : p = p 0 vs. H A : p p 0 Test statistics: Z = ˆp p 0 p0 (1 p 0 )/n Rejection region: Z > z(α/2) N(0, 1) 8

9 3 Two-sample Note: we can come back to this if there is more interest in other sections. The setup is very similar as in Section II, only now we have two samples and want to perform inference on whether the means or other population parameters are similar. For example, suppose we have a sample of undergraduate GPAs from Hopkins graduate students and a sample of undergraduate GPAs from people who did not go to graduate school. Are the mean GPAs different? Are the distributions different? 3.1 Large-sample two population means Estimation Data: X i N(µ X, σ 2 X ), Y j N(µ Y, σ 2 Y ) Estimator: ˆµ X ˆµ Y = X Ȳ Distribution: X Ȳ (µ X µ Y ) σ 2 Xn + σ2 Ym N(0, 1) σ C.I.: X X Ȳ ± z(α/2) 2 n + σ2 Y m Testing Hypothesis: H 0 : µ X µ Y = D 0 vs. H A : µ X µ Y D 0 Test statistics: Z = X Ȳ D 0 σ 2 Xn + σ2 Ym N(0, 1) Rejection region: Z > z(α/2) 3.2 Small-sample two population means (identical variance) Practical rule max(s 2 X, s2 Y )/ min(s2 X, s2 Y ) < 3 9

10 3.2.2 Estimation Data: X i N(µ X, σ 2 ), Y j N(µ Y, σ 2 ) Estimator: ˆµ X ˆµ Y = X Ȳ Distribution: X Ȳ (µ X µ Y ) s 2 p n + s2 p m C.I.: X Ȳ ± t n+m 2 (α/2) t n+m 2, s 2 p = (n 1)s2 X +(m 1)s2 Y n+m 2 s 2 p n + s2 p m Testing Hypothesis: H 0 : µ X µ Y = D 0 vs. H A : µ X µ Y D 0 Test statistics: T = X Ȳ D 0 t n+m 2 s 2 p n + s2 p m Rejection region: T > t n+m 2 (α/2) 3.3 Small-sample two population means (non-identical variance) Practical rule max(s 2 X, s2 Y )/ min(s2 X, s2 Y ) > Estimation Data: X i N(µ X, σ 2 ), Y j N(µ Y, σ 2 ) Estimator: ˆµ X ˆµ Y = X Ȳ X Ȳ (µ X µ Y ) Distribution: t ν, ν = (s2 X /n+s2 Y /m)2 s 2 X n + s2 (s 2 Y X /n)2 m n 1 C.I.: X Ȳ ± t ν (α/2) s 2 X n + s2 Y m + (s2 Y /m)2 m Testing Hypothesis: H 0 : µ X µ Y = D 0 vs. H A : µ X µ Y D 0 Test statistics: T = X Ȳ D 0 s 2 X n + s2 Y m t ν Rejection region: T > t ν (α/2) 10

11 3.4 Small-sample two population means (matched-pair) Estimation Data: X i N(µ X, σ 2 ), Y j N(µ Y, σ 2 ), D i = X i Y i Estimator: ˆµ D = D Distribution: D µ D s D / n t n 1 C.I.: D ± tn 1 (α/2) sd n Testing Hypothesis: H 0 : µ D = µ 0 vs. H A : µ D µ 0 Test statistics: T = D µ 0 s D / n t n 1 Rejection region: T > t n 1 (α/2) 3.5 Two population variances Estimation Data: X i N(µ X, σ 2 X ), X j N(µ Y, σ 2 Y ) Estimator: ˆσ 2 X ˆσ 2 Y = s2 X s 2 Y Distribution: s2 X /σ2 X F s 2 n 1,m 1 Y /σ2 Y ( C.I.: s 2 X s 2 Y F n 1,m 1(α/2), s 2 X s 2 Y F n 1,m 1(1 α/2) ) Testing Hypothesis: H 0 : σ2 X σ 2 Y = r 0 vs. H A : σ2 X σ 2 Y r 0 Test statistics: F = s2 X s 2 Y r 0 F n 1,m 1 Rejection region: F > F n 1,m 1 (α/2) and F < F n 1,m 1 (1 α/2) 11

12 3.6 Large-sample two population proportions Estimation Data: X i Ber(p X ), Y j Ber(p Y ) Estimator: ˆp X ˆp Y = X Ȳ ˆp X ˆp Y (p X p Y ) px (1 p X ) + p Y (1 p Y ) n m Distribution: C.I.: ˆp X ˆp Y ± z(α/2) N(0, 1) ˆp X (1 ˆp X ) + ˆp Y (1 ˆp Y ) n m Testing (identical proportion hypothesis) Hypothesis: H 0 : p X = p Y vs. H A : p X p Y Test statistics: Z = ˆp X ˆp Y p(1 p) + p(1 p) n m Rejection region: Z > z(α/2) N(0, 1), ˆp = n X+mȲ n+m Testing (non-identical proportion hypothesis) Hypothesis: H 0 : p X p Y = D 0 vs. H A : p X p Y D 0 Test statistics: Z = ˆp X ˆp Y D 0 px (1 p X ) + p Y (1 p Y ) n m Rejection region: Z > z(α/2) N(0, 1) 4 General distribution 4.1 Estimation Mean-variance trade-off Unbiased: E(ˆθ) = θ ] Mean square error: E [(ˆθ θ) 2 ] ( ] 2 ) = E [(ˆθ E(ˆθ) + E(ˆθ) θ) 2 = E [ˆθ θ) + Var (ˆθ Method of moments (MOM) µ k = E [ X k] and ˆµ k = 1 n n i=1 Xk i 12

13 Example: X P oi(λ) : µ 1 = λ ˆµ 1 = X and ˆλ = X X N(µ, σ 2 ) : µ 1 = µ, µ 2 = µ 2 + σ 2 ˆµ = X, ˆσ 2 = 1 n n i=1 (X i X) 2 (biased) X Γ(α, β) : µ 1 = α/β, µ 2 = α(α+1) ˆβ = ˆµ 1 β 2 ˆµ 2, α = ˆβ ˆµ ˆµ X U(0, θ) : µ 1 = θ ˆθ = 2 X (could make no sense) Pros: easy, consistent, asymptotically unbiased Cons: could make no sense, not efficient Maximum likelihood estimator (MLE) ˆθ = arg max lik(θ) = arg max n i=1 f(x i θ) ˆθ = arg max l(θ) = arg max n i=1 log f(x i θ) Example: X P oi(λ): P (X = x) = λx e λ x! l(λ) = n i=1 (X i log λ λ log X!) = logλ n i=1 X i nλ n i=1 log X! l (λ) = 1 λ n i=1 X i n = 0 ˆλ = X X N(µ, σ 2 ): lik(µ, σ) = n i=1 l(µ, σ 2 ) = n log σ n 2 l = 1 n µ σ 2 i=1 (X i µ) = 0 ˆµ = X 1 2πσ e (X i µ)2 2σ log 2π 1 2σ 2 n i=1 (X i µ) 2 l = n + 1 n σ σ σ 3 i=1 (X i µ) 2 ˆσ 2 = 1 n X Γ(α, β): lik(α, β) = n i=1 β α Γ(α) xα 1 e βx n i=1 (X i X) 2 l(α, β) = nα log β + (α 1) n i=1 log X i λ n i=1 X i n log Γ(α) l = n log β + n α i=1 log X i n Γ (α) Γ(α) l = nα n β β i=1 X i = 0, ˆβ = ˆᾱ X = 0 (no closed-form solution) 13

14 X U(0, θ) : lik(θ) = n i=1 ˆθ = X (n) 1 θ n 1 {Xi <θ} Pros: consistent, asymptotically unbiased, asymptotically normally, asymptotically efficient (unbiased + achieve Cramer-Rao lower bound (CRLB)) Cons: not robust, could have no closed-form solution (could be sensitive to optimization algorithm) Asymptotic Normality for MLE Asymptotic normality: ni(θ 0 )(ˆθ θ 0 ) N(0, 1) C.I.: ˆθ 1 ± z(α/2) ni(ˆθ), I(θ) = E [ log f(x θ)] 2 θ (Fisher information) [ ] Under appropriate smoothness conditions: I(θ) = E 2 log f(x θ) θ 2 Example: X P oi(λ) : f(x λ) = λx e λ x!, E [X] = λ, var (X) = λ, ˆλ MLE = X I(λ) = E [ log f(x λ)] 2 ( λ = E X 1) 2 λ = 1 E(X 2 ) 2 λ+λ2 E(X) + 1 = λ 2 λ λ 2 [ ] I(λ) = E 2 log f(x λ) = E ( ) X λ 2 λ = 1 2 λ = 1 λ C.I.: X ± z(α/2) X n X N(µ, σ 2 ) : log f(x µ, σ 2 ) = 1 2 log(2πσ2 ) (x µ)2 2σ 2 l(θ) = x µ µ σ 2 I(µ) = 1 σ 2, 2 l(θ) µ 2 = 1 σ 2 C.I.: X ± z(α/2) σ n. ) Cramer-Rao lower bound: var (ˆθ > 1 ni(θ) for unbiased estimator Bootstrap Parametric v.s. non-parametric Bootstrap samples: θ 14

15 Percentile method: C.I.: (θ L, θ U ) Normal approximation: se(ˆθ) s θ : C.I.: ˆθ ± z(α/2)s θ Distribution approximation: C.I.: θ 0 is (2ˆθ θ U, 2ˆθ θ L ) P ( L ˆθ θ 0 U ) = 1 α C.I.: (ˆθ U, ˆθ L ) = ˆθ θ 0 θ ˆθ L θ L ˆθ, U θ U ˆθ C.I.: (2ˆθ θ U, 2ˆθ θ L ) If θ is symmetric about ˆθ: ˆθ θ L = θ U ˆθ, then 2ˆθ θ U = θ L and 2ˆθ θ L = θ U See also: Cross-validation (in the literature these are often considered simultaneously) Bayesian Prior + likelihood posterior (good in small sample, small effect in large sample) Example: Prior: U(0, 1) : π(p) 1 or Beta(α, β) : π(p) p α 1 (1 p) β 1 Likelihood: Bin(n, p) : lik(p) (1 p) n N p N Posterior: π (p) p α+n 1 (1 p) β+n N 1 Beta(α + N, β + n N) µ = α α+β, σ2 = µ = α+n α+β+n = σ 2 = αβ (α+β) 2 (α+β+1) α+β α + α+β+n α+β (α+n)(β+n N) (α+β+n) 2 (α+β+n+1) 0 n N N α+β+n n n weighted average Prior: Γ(a, b) : π(θ) = ba Γ(a) θα 1 e bθ, E(θ) = a b, V ar(θ) = a b 2 Likelihood: P oi(θ) : lik(θ) = n θ X ie θ i=1 X i! θ S e nθ Posterior: π (θ) θ S+a 1 e (n+b)θ Γ(a + S, b + n) µ = a+s b+n = b a + n S S ; b+n b b+n n n σ 2 = a+s 0 (b+n) 2 Prior: Γ(a, b) : π(λ) = ba Γ(a) λα 1 e bλ, E(λ) = a b, V ar(λ) = a b 2 15

16 Likelihood: exp(λ) : lik(λ) = λe λx i λ n e sλ Posterior: π (λ) λ n+a 1 e (s+b)θ Γ(a + n, b + s) µ = a+n b+s = b a + s n. b+s b b+s s σ 2 = a+n (b+s) 2 Prior: N (a, b 2 ) : π(θ) e (θ a)2 2b 2 Likelihood: N (θ, σ 2 ) : L(θ) e (xi θ) 2 2σ 2 Posterior: π (θ) e ( θ2 2aθ+a 2 e (θ aσ 2 +n Xb 2 σ 2 +nb 2 )2 2 b2 σ 2 σ 2 +nb 2 θ = aσ2 +n Xb 2 σ 2 +nb 2 = σ2 σ 2 +nb 2 a + 2b 2 + nθ2 2n Xθ+ X 2 nb2 σ 2 +nb 2 e SS+n(θ x)2 2σ 2 2σ 2 ) e σ2 θ 2 2aσ 2 θ+σ 2 a 2 +nb 2 θ 2 2n Xb 2 θ+ X 2 b 2 2b 2 σ 2 X X σ 2 = b2 σ 2 1 = σ 2 +nb 2 1 b σ 2 /n Testing Likelihood ratio test Hypothesis: H 0 : µ = µ 0 ; H A : µ = µ A Test statistic: Λ = f(x H 0) f(x H A ) Rejection region: small value of Λ(X) Most powerful for simple null vs. simply alternative Example: H 0 : µ = µ 0 ; H A : µ = µ A Λ = exp[ 1 2σ 2 n i=1 (X i µ 0 ) 2 ] exp[ 1 2σ 2 n i=1 (X i µ A ) 2 ] Reject for small n i=1 (X i µ A ) 2 n i=1 (X i µ 0 ) 2 = 2n X(µ 0 µ A ) + nµ 2 A nµ2 0. If µ 0 > µ A, reject for small value of X. If µ0 < µ A, reject for large value of X 16

17 4.2.2 Generalized ratio test Hypothesis: composite null vs. composite alternative Test statistic: Λ = max θ H 0 f(x θ) max θ H0 H A f(x θ) 2 log Λ χ2 dim Ω dim ω 0 Rejection region: small value of Λ(X) or large value of 2 log Λ Example: as n H 0 : µ = µ 0 ; H A : µ µ 0. σ 2 is known Λ(X) = exp[ 1 2σ 2 n i=1 (X i µ 0 ) 2 ] exp[ 1 2σ 2 n i=1 (X i X) 2 ] = exp ( 1 2σ 2 [ n i=1 (X i µ 0 ) 2 n i=1 (X i X) 2]) 2 log Λ(X) = 1 σ 2 ( n i=1 (X i µ 0 ) 2 n i=1 (X i X) 2) = n σ 2 ( X µ 0 ) 2 = χ 2 1; [ n i=1 (X i µ 0 ) 2 = n i=1 (X i X) 2 + n( X µ 0 ) 2] ( ) 2 Reject when 2 log Λ > χ 2 1(α) X µ0 σ/ n > χ 2 1(α) Poisson Dispersion Test: X µ 0 σ/ n > z(α/2) ( X µ0 σ/ n ) 2 X i P oi(λ i ), H 0 : λ i = λ Λ = n i=1 ˆλ x ie ˆλ/x i! n i=1 λ i x i e λ i/xi! = n i=1 ( x x i ) xi e x i x 2 log Λ = 2 n i=1 [x i log( x/x i ) + (x i x)] = 2 n i=1 x i log( x/x i ) χ 2 n 1 Reject when 2 log Λ > χ 2 n 1(α) 5 ANOVA 5.1 One-way ANOVA Model Y ij N(µ + α i, σ 2 ) for i = 1,..., I and j = 1,..., n i Constraint: I i=1 α i = 0 17

18 5.1.2 Testing SST = SSB + SSE SST = i,j (Y ij Ȳ ) 2, SSB = i n i(ȳi Ȳ ) 2, SSE = i,j (Y ij Ȳi ) 2 d.f.: (n 1) = (I 1) + (n I) Hypothesis: H 0 : α 1 = = α I = 0 Test-statistic: F = SSB/(I 1) SSE/(I(J 1)) = MSB MSE F I 1,I(J 1). Rejection region: F > F I 1,I(J 1) (α) Estimation ˆµ = Ȳ, ˆα i = Ȳi Ȳ, ˆσ 2 = s 2 p = MSE = SSE/(I(J 1)) C.I. for µ + α i : Ȳ i ± t n I (α/2) C.I. for α i α j : MSE n i Two-sample: (Ȳi Ȳj ) ± t n I (α/2) MSE n i + MSE n j Bonferroni: (Ȳi Ȳj ) ± t n I ((α/2)/ ( I 2 )) MSE n i + MSE n j Tukey (n 1 = = n I = J): (Ȳi Ȳj ) ± q I,n I (α) MSE J 5.2 Randomized block design Model Y ij N(µ + α i + β j, σ 2 ) for i = 1,..., I and j = 1,..., J Constraint: i α i = 0, j β j = Testing SST = SSA + SSB + SSE SST = i,j (Y ij Ȳ ) 2, SSA = i J(Ȳi Ȳ ) 2, SSB = j I(Ȳ j Ȳ ) 2 SSE = i,j (Y ij Ȳi Ȳ j + Ȳ ) 2 18

19 d.f.: (IJ 1) = (I 1) + (J 1) + (I 1)(J 1) Hypothesis: H 0 : α 1 = = α I = 0 Test-statistic: F = SSA/(I 1) SSE/((I 1)(J 1)) = MSA MSE F I 1,(I 1)(J 1) Rejection region: F > F I 1,(I 1)(J 1) (α) Hypothesis: H 0 : β 1 = = β J = 0 Test-statistic: F = SSB/(J 1) SSE/((I 1)(J 1)) = MSB MSE F J 1,(I 1)(J 1) Rejection region: F > F J 1,(I 1)(J 1) (α) Estimation ˆµ = Ȳ, ˆα i = Ȳi Ȳ, ˆσ 2 = s 2 p = MSE = SSE/((I 1)(J 1)) MSE C.I. for µ + α i : Ȳ i ± t (I 1)(J 1) (α/2) J C.I. for µ + β j : Ȳ i ± t (I 1)(J 1) (α/2) C.I. for α i α j : MSE I Two-sample: (Ȳi Ȳj ) MSE ± t (I 1)(J 1) (α/2) + MSE J J Tukey: (Ȳi Ȳj ) MSE ± q I,(I 1)(J 1) (α) J C.I. for β i β j : Two-sample: (Ȳ i Ȳ j) MSE ± t (I 1)(J 1) (α/2) + MSE I I Tukey: (Ȳ i Ȳ i) MSE ± q J,(I 1)(J 1) (α) I 5.3 Two-way ANOVA Model Y ijk N(µ + α i + β j + δ ij, σ 2 ) for i = 1,..., I, j = 1,..., J, k = 1,..., K Constraint: i α i = 0, j β j = 0, i δ ij = 0, j δ ij = 0 19

20 5.3.2 Testing SST = SSA + SSB + SSAB + SSE SST = i,j,k (Y ijk Ȳ ) 2, SSA = i JK(Ȳi Ȳ ) 2, SSB = j IK(Ȳ j Ȳ ) 2 SSAB = i,j K(Y ij Ȳi Ȳ j + Ȳ ) 2, SSE = i,j,k (Y ijk Ȳij ) 2 d.f.: (IJK 1) = (I 1) + (J 1) + (I 1)(J 1) + IJ(K 1) Hypothesis: H 0 : α 1 = = α I = 0 Test-statistic: F = SSA/(I 1) SSE/(IJ(K 1)) = MSA MSE F I 1,IJ(K 1) Rejection region: F > F I 1,IJ(K 1) (α) Hypothesis: H 0 : β 1 = = β J = 0 Test-statistic: F = SSB/(J 1) SSE/(IJ(K 1)) = MSB MSE F J 1,IJ(K 1) Rejection region: F > F J 1,IJ(K 1) (α) Hypothesis: H 0 : δ ij = 0 for all i, j Test-statistic: F = SSAB/((I 1)(J 1) SSE/(IJ(K 1)) = MSAB MSE Rejection region: F > F (I 1)(J 1),IJ(K 1) (α) F (I 1)(J 1),IJ(K 1) Estimation ˆσ 2 = s 2 p = MSE = SSE/(IJ(K 1)) Tukey (interaction is significant): (Ȳij Ȳi j ) ± q IJ,J(K 1) (α) MSE K 6 Linear regression 6.1 Simple linear regression Model y N(β 0 + β 1 x, σ 2 ) 20

21 6.1.2 Testing SST = SSR + SSE SST = (y i ȳ) 2, SSR = (ŷ i ȳ) 2, SSE = (y i ŷ i ) 2 d.f.: (n 1) = (2 1) + (n 2) Hypothesis: H 0 : β 1 = 0 Test-statistics: F = SSR/(2 1) SSE/(n 2) = MSR MSE F 1,n 2 Rejection region: F > F 1,n 2 (α) Estimation Least-square estimation: ˆβ 1 = Sxy S xx, ˆβ 0 = ȳ b x, S xx = (x i x)(y i ȳ) and S xx = (x i x) 2 Distribution: C.I.: ˆβ 1 β 1 σ 2 /S xx N(0, 1), ˆβ 1 β 1 MSE/Sxx t n 2 ˆβ 0 β 0 ˆβ N(0, 1), 0 β 0 t n 2 σ 2 x 2 /S xx MSE x 2 /S xx ŷ (β 0 +β 1 x ) ( MSE 1 n + (x x) 2 Sxx ) t n 2 β 1 : ˆβ MSE 1 ± t n 2 (α) S xx β 0 : ˆβ MSEx 0 ± t n 2 (α) 2 S xx β 0 + β 1 x : ŷ ± t n 2 (α/2) y : ŷ ± t n 2 (α/2) MSE MSE ( ) 1 + (x x) 2 n S xx ( ) (x x) 2 n S xx T-test H 0 : β 1 = 0, T = H 0 : β 0 = 0, T = ˆβ 1 MSE/Sxx t n 2 ˆβ 0 MSE x 2 /S xx t n 2 Equivalent to the analysis of variance 21

22 6.1.5 Coefficient of determination r 2 = SSR T SS = S2 xy S xxs yy The percent reduction in the total variation obtained by using the regression line Correlation Analysis: H 0 : ρ = 0, T = r t n 2 Equivalent to the t-test n 2 1 r Variable Selection Sometimes it is inefficient to use all the variables. So we seek to choose our variables. There are several ways to do this. First, we need to discuss measures of fit. R 2 is a good measure, but only works for linear regression Measures of Fit We can minimize: AIC BIC MSE/Classification Error AIC: 2k 2 log(ˆl), where k is the number of variables, and ˆL is the maximum likelihood. BIC: log(n)k 2 log(ˆl), where n is the number of observations. MSE: same as before Stepwise Regression Forward stepwise regression: 1. Initiate V := 2. for each variable v / V : 22

23 (a) Run a model with all the variables in V and the variable v i (b) keep track of the AIC/BIC 3. Find the variable v that maximizes AIC, and set V := V v. 4. go back to step Ridge Regression Solve: Lasso Regression arg min β (β i x i y i ) 2 + λ i i β 2 i. Solve: arg min β (β i x i y i ) 2 + λ i i β i. 7 Other important concepts to know 7.1 Cross Validation 7.2 Dimensionality Reduction PCA Variable Selection Stepwise Regression 7.3 Classification Logistic Regression K-NN 23

24 Random Forests 7.4 Clustering K-means E-M algorithm Other clustering methods (subspace clustering, K-medoids, etc.) 7.5 Non-parametric methods Random Forest Support Vector Machines Neural Networks (technically semi-parametric) 7.6 Model selection Elbow method See also: Cross-validation 24

Exercises and Answers to Chapter 1

Exercises and Answers to Chapter The continuous type of random variable X has the following density function: a x, if < x < a, f (x), otherwise. Answer the following questions. () Find a. () Obtain mean