Cross-Validation with Confidence

Size: px
Start display at page:

Download "Cross-Validation with Confidence"

Transcription

1 Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017

2 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

3 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

4 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

5 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

6 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

7 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

8 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval CVC

9 Outline Background: cross-validation, overfitting, and uncertainty of model selection Cross-validation with confidence A hypothesis testing framework p-value calculation Validity of the confidence set Model selection consistency for (low dim.) sparse linear models Examples

10 A regression setting Data: D = {(X i,y i ) : 1 i n}, i.i.d from joint distribution P on R p R 1 Y = f (X) + ε, with E(ε X) = 0 Loss function: l(, ) : R 2 R Goal: find ˆf f so that Q(ˆf ) E [ l(ˆf (X),Y) ˆf ] is small.

11 A regression setting Data: D = {(X i,y i ) : 1 i n}, i.i.d from joint distribution P on R p R 1 Y = f (X) + ε, with E(ε X) = 0 Loss function: l(, ) : R 2 R Goal: find ˆf f so that Q(ˆf ) E [ l(ˆf (X),Y) ˆf ] is small. The framework can be extended to unsupervised learning problems.

12 Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m

13 Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m 1. such that it equals the true model

14 Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m 1. such that it equals the true model 2. such that it minimizes Q(ˆf ) over all m M with high probability.

15 Cross-validation Sample split: Let I tr and I te be a partition of {1,...,n}. Fitting: ˆf m = ˆf (D tr,m), where D tr = {(X i,y i ) : i I tr }. Validation: ˆQ(ˆf m ) = n 1 te i Ite l(ˆf m (X i ),Y i ). CV model selection: ˆm cv = argmin m M ˆQ(ˆf m ). V-fold cross-validation: 1. For V 2, split the data into V folds. 2. Rotate over each fold as I tr to obtain ˆQ (v) (ˆf (v) m ) 3. ˆm = argminv 1 V v=1 ˆQ (v) (ˆf (v) m ) 4. Popular choices of V: 10 and V = n: leave-one-out cross-validation

16 Why can cross-validation be successful? To find the best model 1. The fitting procedure ˆf (D,m) needs to be stable, so that the best model (almost) always gives the best fit. 2. Conditional inference: Given (ˆf m : m M ), cross-validation approximately minimizes Q(ˆf m ) over all m.

17 Why can cross-validation be successful? To find the best model 1. The fitting procedure ˆf (D,m) needs to be stable, so that the best model (almost) always gives the best fit. 2. Conditional inference: Given (ˆf m : m M ), cross-validation approximately minimizes Q(ˆf m ) over all m. The value of cross-validation is in conditional inference.

18 A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1.

19 A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1. (Shao 93, Zhang 93, Yang 07) ˆm cv is inconsistent unless n tr = o(n).

20 A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1. (Shao 93, Zhang 93, Yang 07) ˆm cv is inconsistent unless n tr = o(n). V-fold does not help!

21 A closer look at the example Two potential sources of mistake: ˆf (D tr,m) is not stable, or CV does not work as expected.

22 A closer look at the example Two potential sources of mistake: ˆf (D tr,m) is not stable, or CV does not work as expected. Cross-validation makes a mistake if ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε 2 tr 2 ε tr ε te < 0.

23 A closer look at the example Two potential sources of mistake: ˆf (D tr,m) is not stable, or CV does not work as expected. Cross-validation makes a mistake if ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε 2 tr 2 ε tr ε te < 0. But Q(ˆf 2 ) Q(ˆf 1 ) = ε 2 tr > 0. So the best model indeed gives the best fit. The problem is in CV!

24 A closer look at the example Two potential sources of mistake: ˆf (D tr,m) is not stable, or CV does not work as expected. Cross-validation makes a mistake if ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε 2 tr 2 ε tr ε te < 0. But Q(ˆf 2 ) Q(ˆf 1 ) = ε 2 tr > 0. So the best model indeed gives the best fit. The problem is in CV! Here ε 2 tr is the signal, and 2 ε tr ε te is the noise.

25 A closer look at the example Two potential sources of mistake: ˆf (D tr,m) is not stable, or CV does not work as expected. Cross-validation makes a mistake if ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε 2 tr 2 ε tr ε te < 0. But Q(ˆf 2 ) Q(ˆf 1 ) = ε 2 tr > 0. So the best model indeed gives the best fit. The problem is in CV! Here ε 2 tr is the signal, and 2 ε tr ε te is the noise. Observation Cross-validation makes a mistake when it fails to take into account the uncertainty in the testing sample.

26 Cross-validation with confidence (CVC) We want to avoid making such a mistake as in the simple example. We want to use conventional split ratios, with V-fold implementation.

27 A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )?

28 A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )? A standard solution uses hypothesis testing H 0 : Q(ˆf 1 ) Q(ˆf 2 ) conditioning on ˆf 1, ˆf 2.

29 A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )? A standard solution uses hypothesis testing H 0 : Q(ˆf 1 ) Q(ˆf 2 ) conditioning on ˆf 1, ˆf 2. Can do this using a paired sample t-test, say with type I error level α.

30 CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1?

31 CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1? No. Because if we consider H 0 : Q(ˆf 2 ) Q(ˆf 1 ). H 0 will not be rejected either (probability of rejecting H 0 is bounded away from 0.) Most likely, we do not reject H 0 or H 0.

32 CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1? No. Because if we consider H 0 : Q(ˆf 2 ) Q(ˆf 1 ). H 0 will not be rejected either (probability of rejecting H 0 is bounded away from 0.) Most likely, we do not reject H 0 or H 0. We accept both fitted models ˆf 1 and ˆf 2, as they are very similar and the difference cannot be noticed from the data.

33 Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions.

34 Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions. Our approach: one step, with provable coverage and power under mild assumptions in high dimensions.

35 Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions. Our approach: one step, with provable coverage and power under mild assumptions in high dimensions. Key technique: high-dimensional Gaussian comparison of sample means (Chernozhukov et al).

36 CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ].

37 CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ]. For each m, test hypothesis (conditioning on ˆf 1,..., ˆf M ) Let ˆp m be a valid p-value. H 0,m : min j m Q(ˆf j ) Q(ˆf m ).

38 CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ]. For each m, test hypothesis (conditioning on ˆf 1,..., ˆf M ) Let ˆp m be a valid p-value. H 0,m : min j m Q(ˆf j ) Q(ˆf m ). A cvc = {m : ˆp m > α} is our confidence set for the best fitted model: P(m A cvc ) 1 α, where m = argmin m Q(ˆf m ).

39 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m.

40 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges

41 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large.

42 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large. 2. Potentially high correlation between ξ m,j and ξ m,j.

43 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large. 2. Potentially high correlation between ξ m,j and ξ m,j. 3. Vastly different scaling: Var(ξ m,j ) can be O(1) or O(n 1 ).

44 Calculating ˆp m H 0,m : E(ξ m,j ) 0, j m. Let ˆµ m,j and ˆσ m,j be the sample mean and standard deviation of (ξ (i) m,j : i I te). Naturally, one would reject H 0,m for large values of max j m ˆµ m,j ˆσ m,j. Approximate the null distribution using high dimensional Gaussian comparison.

45 Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j 3. ˆp m = B 1 B 1(Tb > T m ). b=1 m,j ˆµ m,j ζ i

46 Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j m,j ˆµ m,j 3. ˆp m = B 1 B b=1 1(T b > T m ). - The studentization takes care of the scaling difference. ζ i

47 Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j m,j ˆµ m,j 3. ˆp m = B 1 B 1(Tb > T m ). b=1 - The studentization takes care of the scaling difference. - The bootstrap Gaussian comparison takes care of the dimensionality and correlation. ζ i

48 Properties of CVC A cvc = {m : ˆp m > α}. Let ˆm cv = argmin m ˆQ(ˆf m ). By construction T ˆmcv 0. Proposition If α < 0.5, then P( ˆm cv A cvc ) 1 as B. [ 1 Proof: nte (i) ξ i I te m,j ˆµ m,j ˆσ m,j ζ i ] j m is a zero-mean Gaussian random vector. So the upper α quantile of its maximum must be positive. Can view ˆm cv as the center of the confidence set.

49 Coverage of A cvc Recall ξ m,j = l(ˆf m (X),Y) l(ˆf j (X),Y), with independent (X,Y). Let µ m,j = E [ ξ m,j ˆf m, ˆf m ], σ 2 m,j = Var [ ξ m,j ˆf m, ˆf m ]. Theorem Assume that (ξ m,j µ m,j )/(A n σ m,j ) has sub-exponential tail for all m j and some A n 1 such that for some c > 0 A 6 n log 7 (M n) = O(n 1 c ). ( ) ( ) µm,j 1. If max j m σ = o 1 m,j + nlog(m n), then P(m A cvc ) 1 α + o(1). 2. If max j m ( µm,j σ m,j )+ CA n and α n 1, then P(m A cvc ) = o(1). log(m n) n for some constant C,

50 Coverage of A cvc 1. If m is the best fitted model which minimizes Q(ˆf m ) over all {ˆf m : m M }, then µ m,j/σ m,j 0 for all j. Thus P(m A cvc ) 1 α + o(1). 2. If µ m,j = 0 for all j, then P(m A cvc ) = 1 α + o(1). 3. Part 2 of the theorem ensures that bad models are excluded with high probability.

51 Proof of coverage Let Z(Σ) = maxn(0,σ), and z(1 α,σ) its 1 α quantile. Let ˆΓ and Γ be sample and population correlation matrices of (ξ (i) m,j ) i I te,j m. When B, [ P(ˆp m α) = P max j Tools (2, 3 are due to Chernozhukov et al.) ] ˆµ m,j nte z(1 α, ˆΓ) ˆσ m,j 1. Concentration: ˆµ n m,j te ˆσ m,j ˆµ n m,j µ m,j te σ m,j + o(1/ logm) ˆµ 2. Gaussian comparison: max m,j µ m,j d d j nte σ m,j Z(Γ) Z( ˆΓ) 3. Anti-concentration: Z( ˆΓ) and Z(Γ) have densities logm

52 Split data into V folds. V-fold CVC

53 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i.

54 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v.

55 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n.

56 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n. Calculate T m and Tb correspondingly using the n (M 1) cross-validated error difference matrix (ξ (i) m,j ) 1 i n,j m.

57 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n. Calculate T m and Tb correspondingly using the n (M 1) cross-validated error difference matrix (ξ (i) m,j ) 1 i n,j m. Rigorous justification is hard due to dependence between folds. But empirically much better.

58 Example: the diabetes data (Efron et al 04) n = 442, with 10 covariates: age, sex, bmi, blood pressure, etc. Response is diabetes progression after one year. Including all quadratic terms, p = fold CVC with α = 0.05, using Lasso with 50 values of λ. test error log(λ) Triangle: models in A cvc, solid triangle: ˆm cv.

59 Simulations: coverage of A cvc Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple), or β = (1,1,1,0.7,0.5,0.3,0,...,0) T (mixed). 5-fold CVC with α = 0.05 using Lasso with 50 values of λ setting of (Σ,β) coverage A cvc cv is opt. identity, simple.92 (.03) 5.1 (.19).27 (.04) identity, mixed.95 (.02) 5.1 (.18).37 (.05) correlated, simple.96 (.02) 7.5 (.18).18 (.04) correlated, mixed.93 (.03) 7.4 (.23).19 (.04)

60 Simulations: coverage of A cvc Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple), or β = (1,1,1,0.7,0.5,0.3,0,...,0) T (mixed). 5-fold CVC with α = 0.05 using forward stepwise setting of (Σ,β) coverage A cvc cv is opt. identity, simple 1 (0) 3.7 (.29).87 (.03) identity, mixed.95 (.02) 5.2 (.33).58 (.05) correlated, simple.97 (.02) 4.1 (.31).80 (.04) correlated, mixed.93 (.03) 6.3 (.36).44 (.05)

61 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models.

62 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable.

63 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable. 2. A cvc can be used to answers questions like Is fitting procedure A better than procedure B?

64 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable. 2. A cvc can be used to answers questions like Is fitting procedure A better than procedure B? 3. We can also simply choose the most parsimonious model in A cvc.

65 The most parsimonious model in A cvc Now consider the linear regression problem: Y = X T β + ε. Let J m be the subset of variables selected using model m ˆm cvc.min = arg min m A cvc J m. ˆm cvc.min is the simplest model that gives a similar predictive risk as ˆm cv.

66 A classical setting Y = X T β + ε, X R p, Var(X) = Σ has full rank. ε has mean zero and variance σ 2 <. Assume that (p,σ,σ 2 ) are fixed and n. M contains the true model m, and at least one overfitting model. n tr /n te 1. Using squared loss, the true model and all overfitting models give n-consistent estimates. Early results (Shao 93, Zhang 93, Yang 07) show that P( ˆm cv m ) is bounded away from 0.

67 Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n

68 Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n Sub-Gaussianity of X and ε implies that (Y X T β) 2 is sub-exponential.

69 Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n Sub-Gaussianity of X and ε implies that (Y X T β) 2 is sub-exponential. Can allow p to grow slowly as n using union bound.

70 Example in low-dim. variable selection Synthetic data with p = 5, n = 40, as in [Shao 93]. Y = X T β + ε, β = (2,9,0,4,8) T, ε N(0,1). Generated additional rows for n = 60, 80, 100, 120, 140, 160. Candidates: (1,4,5),(1,2,4,5),(1,3,4,5),(1,2,3,4,5) Repeated 1000 times, using OLS with 5-fold CVC. rate of correct selection cvc.min cv n

71 Simulations: variable selection with ˆm cvc.min Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple) 5-fold CVC with α = 0.05 using forward stepwise setting of (Σ,β) oracle ˆm cvc.min ˆm cv identity, simple correlated, simple Proportion of correct model selection over 100 independent data sets. Oracle method: the number of steps that gives smallest prediction risk.

72 Simulations: variable selection with ˆm cvc.min Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple) 5-fold CVC with α = 0.05 using Lasso + Least Square setting of (Σ,β) oracle ˆm cvc.min ˆm cv identity, simple correlated, simple Proportion of correct model selection over 100 independent data sets. Oracle method: the λ value that gives smallest prediction risk.

73 The diabetes data revisited Split n = 442 into 300 (estimation) and 142 (risk approximation). 5-fold CVC applied on the 300 sample points, with a final re-fit. The final estimate is evaluated using the 142 hold-out sample. Repeat 100 times, using Lasso with 50 values of λ. test error model size cv cvc.min cv cvc.min

74 Summary CVC: confidence sets for model selection ˆm cvc.min has similar risk as ˆm cv using a simpler model. Extensions Validity of CVC in high dimensions and nonparametric settings. Unsupervised problems 1. Clustering 2. Matrix decomposition (PCA, SVD, etc) 3. Network models Other sample-splitting based inference methods.

75 Thanks! Questions? Paper: Cross-Validation with Confidence, arxiv.org/ Slides:

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University WHOA-PSI Workshop, St Louis, 2017 Quotes from Day 1 and Day 2 Good model or pure model? Occam s razor We really

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

Bootstrapping high dimensional vector: interplay between dependence and dimensionality

Bootstrapping high dimensional vector: interplay between dependence and dimensionality Bootstrapping high dimensional vector: interplay between dependence and dimensionality Xianyang Zhang Joint work with Guang Cheng University of Missouri-Columbia LDHD: Transition Workshop, 2014 Xianyang

More information

Central Bank of Chile October 29-31, 2013 Bruce Hansen (University of Wisconsin) Structural Breaks October 29-31, / 91. Bruce E.

Central Bank of Chile October 29-31, 2013 Bruce Hansen (University of Wisconsin) Structural Breaks October 29-31, / 91. Bruce E. Forecasting Lecture 3 Structural Breaks Central Bank of Chile October 29-31, 2013 Bruce Hansen (University of Wisconsin) Structural Breaks October 29-31, 2013 1 / 91 Bruce E. Hansen Organization Detection

More information

Diagnostics. Gad Kimmel

Diagnostics. Gad Kimmel Diagnostics Gad Kimmel Outline Introduction. Bootstrap method. Cross validation. ROC plot. Introduction Motivation Estimating properties of an estimator. Given data samples say the average. x 1, x 2,...,

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

A Sparse Linear Model and Significance Test. for Individual Consumption Prediction

A Sparse Linear Model and Significance Test. for Individual Consumption Prediction A Sparse Linear Model and Significance Test 1 for Individual Consumption Prediction Pan Li, Baosen Zhang, Yang Weng, and Ram Rajagopal arxiv:1511.01853v3 [stat.ml] 21 Feb 2017 Abstract Accurate prediction

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

IV Quantile Regression for Group-level Treatments, with an Application to the Distributional Effects of Trade

IV Quantile Regression for Group-level Treatments, with an Application to the Distributional Effects of Trade IV Quantile Regression for Group-level Treatments, with an Application to the Distributional Effects of Trade Denis Chetverikov Brad Larsen Christopher Palmer UCLA, Stanford and NBER, UC Berkeley September

More information

arxiv: v2 [math.st] 9 Feb 2017

arxiv: v2 [math.st] 9 Feb 2017 Submitted to the Annals of Statistics PREDICTION ERROR AFTER MODEL SEARCH By Xiaoying Tian Harris, Department of Statistics, Stanford University arxiv:1610.06107v math.st 9 Feb 017 Estimation of the prediction

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

Statistical Inference of Covariate-Adjusted Randomized Experiments

Statistical Inference of Covariate-Adjusted Randomized Experiments 1 Statistical Inference of Covariate-Adjusted Randomized Experiments Feifang Hu Department of Statistics George Washington University Joint research with Wei Ma, Yichen Qin and Yang Li Email: feifang@gwu.edu

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Selection of Variables and Functional Forms in Multivariable Analysis: Current Issues and Future Directions

Selection of Variables and Functional Forms in Multivariable Analysis: Current Issues and Future Directions in Multivariable Analysis: Current Issues and Future Directions Frank E Harrell Jr Department of Biostatistics Vanderbilt University School of Medicine STRATOS Banff Alberta 2016-07-04 Fractional polynomials,

More information

How do we compare the relative performance among competing models?

How do we compare the relative performance among competing models? How do we compare the relative performance among competing models? 1 Comparing Data Mining Methods Frequent problem: we want to know which of the two learning techniques is better How to reliably say Model

More information

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006 Least Squares Model Averaging Bruce E. Hansen University of Wisconsin January 2006 Revised: August 2006 Introduction This paper developes a model averaging estimator for linear regression. Model averaging

More information

Sparse PCA in High Dimensions

Sparse PCA in High Dimensions Sparse PCA in High Dimensions Jing Lei, Department of Statistics, Carnegie Mellon Workshop on Big Data and Differential Privacy Simons Institute, Dec, 2013 (Based on joint work with V. Q. Vu, J. Cho, and

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Inference After Variable Selection

Inference After Variable Selection Department of Mathematics, SIU Carbondale Inference After Variable Selection Lasanthi Pelawa Watagoda lasanthi@siu.edu June 12, 2017 Outline 1 Introduction 2 Inference For Ridge and Lasso 3 Variable Selection

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee Predictive

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8 Contents 1 Linear model 1 2 GLS for multivariate regression 5 3 Covariance estimation for the GLM 8 4 Testing the GLH 11 A reference for some of this material can be found somewhere. 1 Linear model Recall

More information

MFE Financial Econometrics 2018 Final Exam Model Solutions

MFE Financial Econometrics 2018 Final Exam Model Solutions MFE Financial Econometrics 2018 Final Exam Model Solutions Tuesday 12 th March, 2019 1. If (X, ε) N (0, I 2 ) what is the distribution of Y = µ + β X + ε? Y N ( µ, β 2 + 1 ) 2. What is the Cramer-Rao lower

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods. TheThalesians Itiseasyforphilosopherstoberichiftheychoose Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods Ivan Zhdankin

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Econ 583 Final Exam Fall 2008

Econ 583 Final Exam Fall 2008 Econ 583 Final Exam Fall 2008 Eric Zivot December 11, 2008 Exam is due at 9:00 am in my office on Friday, December 12. 1 Maximum Likelihood Estimation and Asymptotic Theory Let X 1,...,X n be iid random

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to

More information

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier

More information

The Slow Convergence of OLS Estimators of α, β and Portfolio. β and Portfolio Weights under Long Memory Stochastic Volatility

The Slow Convergence of OLS Estimators of α, β and Portfolio. β and Portfolio Weights under Long Memory Stochastic Volatility The Slow Convergence of OLS Estimators of α, β and Portfolio Weights under Long Memory Stochastic Volatility New York University Stern School of Business June 21, 2018 Introduction Bivariate long memory

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Lecture 24 May 30, 2018

Lecture 24 May 30, 2018 Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University. SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables? Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what

More information

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee May 13, 2005

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 14, 2017 1 Regularized Loss Minimization Assume h is defined by a vector w = (w 1,..., w d ) T R d (e.g., linear models) Regularization

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013 Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you

More information

Robust Performance Hypothesis Testing with the Sharpe Ratio

Robust Performance Hypothesis Testing with the Sharpe Ratio Robust Performance Hypothesis Testing with the Sharpe Ratio Olivier Ledoit Michael Wolf Institute for Empirical Research in Economics University of Zurich Outline 1 The Problem 2 Solutions HAC Inference

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs De-biasing the Lasso: Optimal Sample Size for Gaussian Designs Adel Javanmard USC Marshall School of Business Data Science and Operations department Based on joint work with Andrea Montanari Oct 2015 Adel

More information

High-dimensional regression with unknown variance

High-dimensional regression with unknown variance High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f

More information

Resampling techniques for statistical modeling

Resampling techniques for statistical modeling Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Dimension Reduction Methods

Dimension Reduction Methods Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.

More information

Linear regression COMS 4771

Linear regression COMS 4771 Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression In simple linear regression we are concerned about the relationship between two variables, X and Y. There are two components to such a relationship. 1. The strength of the relationship.

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 5, 06 Reading: See class website Eric Xing @ CMU, 005-06

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Multivariate Regression (Chapter 10)

Multivariate Regression (Chapter 10) Multivariate Regression (Chapter 10) This week we ll cover multivariate regression and maybe a bit of canonical correlation. Today we ll mostly review univariate multivariate regression. With multivariate

More information

Gaussians Linear Regression Bias-Variance Tradeoff

Gaussians Linear Regression Bias-Variance Tradeoff Readings listed in class website Gaussians Linear Regression Bias-Variance Tradeoff Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 22 nd, 2007 Maximum Likelihood Estimation

More information

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017

Machine Learning. Model Selection and Validation. Fabio Vandin November 7, 2017 Machine Learning Model Selection and Validation Fabio Vandin November 7, 2017 1 Model Selection When we have to solve a machine learning task: there are different algorithms/classes algorithms have parameters

More information

Threshold Autoregressions and NonLinear Autoregressions

Threshold Autoregressions and NonLinear Autoregressions Threshold Autoregressions and NonLinear Autoregressions Original Presentation: Central Bank of Chile October 29-31, 2013 Bruce Hansen (University of Wisconsin) Threshold Regression 1 / 47 Threshold Models

More information

Probably Approximately Correct (PAC) Learning

Probably Approximately Correct (PAC) Learning ECE91 Spring 24 Statistical Regularization and Learning Theory Lecture: 6 Probably Approximately Correct (PAC) Learning Lecturer: Rob Nowak Scribe: Badri Narayan 1 Introduction 1.1 Overview of the Learning

More information

11. Bootstrap Methods

11. Bootstrap Methods 11. Bootstrap Methods c A. Colin Cameron & Pravin K. Trivedi 2006 These transparencies were prepared in 20043. They can be used as an adjunct to Chapter 11 of our subsequent book Microeconometrics: Methods

More information

arxiv: v1 [stat.me] 30 Dec 2017

arxiv: v1 [stat.me] 30 Dec 2017 arxiv:1801.00105v1 [stat.me] 30 Dec 2017 An ISIS screening approach involving threshold/partition for variable selection in linear regression 1. Introduction Yu-Hsiang Cheng e-mail: 96354501@nccu.edu.tw

More information

Construction of PoSI Statistics 1

Construction of PoSI Statistics 1 Construction of PoSI Statistics 1 Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI 2018 1 Joint work with "Larry s Group" at Wharton,

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Chapter 7: Model Assessment and Selection

Chapter 7: Model Assessment and Selection Chapter 7: Model Assessment and Selection DD3364 April 20, 2012 Introduction Regression: Review of our problem Have target variable Y to estimate from a vector of inputs X. A prediction model ˆf(X) has

More information

A General Framework for High-Dimensional Inference and Multiple Testing

A General Framework for High-Dimensional Inference and Multiple Testing A General Framework for High-Dimensional Inference and Multiple Testing Yang Ning Department of Statistical Science Joint work with Han Liu 1 Overview Goal: Control false scientific discoveries in high-dimensional

More information

Multivariate Regression

Multivariate Regression Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the

More information

Chapter 12 - Lecture 2 Inferences about regression coefficient

Chapter 12 - Lecture 2 Inferences about regression coefficient Chapter 12 - Lecture 2 Inferences about regression coefficient April 19th, 2010 Facts about slope Test Statistic Confidence interval Hypothesis testing Test using ANOVA Table Facts about slope In previous

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Bias and variance (v5) Ramesh Johari ramesh.johari@stanford.edu 1 / 49 Our plan today We saw in last lecture that model scoring methods seem to be trading off two different

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota Submitted to the Annals of Statistics arxiv: math.pr/0000000 SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION By Wei Liu and Yuhong Yang University of Minnesota In

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

Inference on distributions and quantiles using a finite-sample Dirichlet process

Inference on distributions and quantiles using a finite-sample Dirichlet process Dirichlet IDEAL Theory/methods Simulations Inference on distributions and quantiles using a finite-sample Dirichlet process David M. Kaplan University of Missouri Matt Goldman UC San Diego Midwest Econometrics

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

ECO375 Tutorial 4 Introduction to Statistical Inference

ECO375 Tutorial 4 Introduction to Statistical Inference ECO375 Tutorial 4 Introduction to Statistical Inference Matt Tudball University of Toronto Mississauga October 19, 2017 Matt Tudball (University of Toronto) ECO375H5 October 19, 2017 1 / 26 Statistical

More information

Linear models and their mathematical foundations: Simple linear regression

Linear models and their mathematical foundations: Simple linear regression Linear models and their mathematical foundations: Simple linear regression Steffen Unkel Department of Medical Statistics University Medical Center Göttingen, Germany Winter term 2018/19 1/21 Introduction

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

Learning gradients: prescriptive models

Learning gradients: prescriptive models Department of Statistical Science Institute for Genome Sciences & Policy Department of Computer Science Duke University May 11, 2007 Relevant papers Learning Coordinate Covariances via Gradients. Sayan

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

1 Outline. 1. Motivation. 2. SUR model. 3. Simultaneous equations. 4. Estimation

1 Outline. 1. Motivation. 2. SUR model. 3. Simultaneous equations. 4. Estimation 1 Outline. 1. Motivation 2. SUR model 3. Simultaneous equations 4. Estimation 2 Motivation. In this chapter, we will study simultaneous systems of econometric equations. Systems of simultaneous equations

More information

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

Define y t+h t as the forecast of y t+h based on I t known parameters. The forecast error is. Forecasting

Define y t+h t as the forecast of y t+h based on I t known parameters. The forecast error is. Forecasting Forecasting Let {y t } be a covariance stationary are ergodic process, eg an ARMA(p, q) process with Wold representation y t = X μ + ψ j ε t j, ε t ~WN(0,σ 2 ) j=0 = μ + ε t + ψ 1 ε t 1 + ψ 2 ε t 2 + Let

More information

Unsupervised Learning: Dimensionality Reduction

Unsupervised Learning: Dimensionality Reduction Unsupervised Learning: Dimensionality Reduction CMPSCI 689 Fall 2015 Sridhar Mahadevan Lecture 3 Outline In this lecture, we set about to solve the problem posed in the previous lecture Given a dataset,

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information