Cross-Validation with Confidence

Size: px
Start display at page:

Download "Cross-Validation with Confidence"

Transcription

1 Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University WHOA-PSI Workshop, St Louis, 2017

2 Quotes from Day 1 and Day 2 Good model or pure model? Occam s razor We really want a λ that looks reasonable

3 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

4 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

5 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

6 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

7 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

8 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval

9 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval CVC

10 Outline Background: cross-validation, overfitting, and uncertainty of model selection Cross-validation with confidence A hypothesis testing framework p-value calculation Validity of the confidence set Examples

11 A regression setting Data: D = {(X i,y i ) : 1 i n}, i.i.d from joint distribution P on R p R 1 Y = f (X) + ε, with E(ε X) = 0 Loss function: l(, ) : R 2 R Goal: find ˆf f so that Q(ˆf ) E [ l(ˆf (X),Y) ˆf ] is small.

12 A regression setting Data: D = {(X i,y i ) : 1 i n}, i.i.d from joint distribution P on R p R 1 Y = f (X) + ε, with E(ε X) = 0 Loss function: l(, ) : R 2 R Goal: find ˆf f so that Q(ˆf ) E [ l(ˆf (X),Y) ˆf ] is small. The framework can be extended to unsupervised learning problems, including network data.

13 Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m

14 Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m 1. such that it equals the true model

15 Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m 1. such that it equals the true model 2. such that it minimizes Q(ˆf ) over all m M with high probability.

16 Cross-validation Sample split: Let I tr and I te be a partition of {1,...,n}. Fitting: ˆf m = ˆf (D tr,m), where D tr = {(X i,y i ) : i I tr }. Validation: ˆQ(ˆf m ) = n 1 te i Ite l(ˆf m (X i ),Y i ). CV model selection: ˆm cv = argmin m M ˆQ(ˆf m ). V-fold cross-validation: 1. For V 2, split the data into V folds. 2. Rotate over each fold as I tr to obtain ˆQ (v) (ˆf (v) m ) 3. ˆm = argminv 1 V v=1 ˆQ (v) (ˆf (v) m ) 4. Popular choices of V: 10 and V = n: leave-one-out cross-validation

17 A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1.

18 A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1. (Shao 93, Zhang 93, Yang 07) ˆm cv is inconsistent unless n tr = o(n).

19 A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1. (Shao 93, Zhang 93, Yang 07) ˆm cv is inconsistent unless n tr = o(n). V-fold does not help!

20 A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )?

21 A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )? A standard solution uses hypothesis testing H 0 : Q(ˆf 1 ) Q(ˆf 2 ) conditioning on ˆf 1, ˆf 2.

22 A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )? A standard solution uses hypothesis testing H 0 : Q(ˆf 1 ) Q(ˆf 2 ) conditioning on ˆf 1, ˆf 2. Can do this using a paired sample t-test, say with type I error level α.

23 CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1?

24 CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1? No. Because if we consider H 0 : Q(ˆf 2 ) Q(ˆf 1 ). H 0 will not be rejected either (probability of rejecting H 0 is bounded away from 0.) Most likely, we do not reject H 0 or H 0.

25 CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1? No. Because if we consider H 0 : Q(ˆf 2 ) Q(ˆf 1 ). H 0 will not be rejected either (probability of rejecting H 0 is bounded away from 0.) Most likely, we do not reject H 0 or H 0. We accept both fitted models ˆf 1 and ˆf 2, as they are very similar and the difference cannot be noticed from the data.

26 Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions.

27 Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions. Our approach: one step, with provable coverage and power under mild assumptions in high dimensions.

28 Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions. Our approach: one step, with provable coverage and power under mild assumptions in high dimensions. Key technique: high-dimensional Gaussian comparison of sample means (Chernozhukov et al).

29 CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ].

30 CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ]. For each m, test hypothesis (conditioning on ˆf 1,..., ˆf M ) Let ˆp m be a valid p-value. H 0,m : min j m Q(ˆf j ) Q(ˆf m ).

31 CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ]. For each m, test hypothesis (conditioning on ˆf 1,..., ˆf M ) Let ˆp m be a valid p-value. H 0,m : min j m Q(ˆf j ) Q(ˆf m ). A cvc = {m : ˆp m > α} is our confidence set for the best fitted model: P(m A cvc ) 1 α, where m = argmin m Q(ˆf m ).

32 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m.

33 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges

34 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large.

35 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large. 2. Potentially high correlation between ξ m,j and ξ m,j.

36 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large. 2. Potentially high correlation between ξ m,j and ξ m,j. 3. Vastly different scaling: Var(ξ m,j ) can be O(1) or O(n 1 ).

37 Calculating ˆp m H 0,m : E(ξ m,j ) 0, j m. Let ˆµ m,j and ˆσ m,j be the sample mean and standard deviation of (ξ (i) m,j : i I te). Naturally, one would reject H 0,m for large values of max j m ˆµ m,j ˆσ m,j. Approximate the null distribution using high dimensional Gaussian comparison.

38 Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j 3. ˆp m = B 1 B 1(Tb > T m ). b=1 m,j ˆµ m,j ζ i

39 Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j m,j ˆµ m,j 3. ˆp m = B 1 B b=1 1(T b > T m ). - The studentization takes care of the scaling difference. ζ i

40 Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j m,j ˆµ m,j 3. ˆp m = B 1 B 1(Tb > T m ). b=1 - The studentization takes care of the scaling difference. - The bootstrap Gaussian comparison takes care of the dimensionality and correlation. ζ i

41 Properties of CVC A cvc = {m : ˆp m > α}. Let ˆm cv = argmin m ˆQ(ˆf m ). By construction T ˆmcv 0. Proposition If α < 0.5, then P( ˆm cv A cvc ) 1 as B. [ 1 Proof: nte (i) ξ i I te m,j ˆµ m,j ˆσ m,j ζ i ] j m is a zero-mean Gaussian random vector. So the upper α quantile of its maximum must be positive. Can view ˆm cv as the center of the confidence set.

42 Coverage of A cvc Recall ξ m,j = l(ˆf m (X),Y) l(ˆf j (X),Y), with independent (X,Y). Let µ m,j = E [ ξ m,j ˆf m, ˆf m ], σ 2 m,j = Var [ ξ m,j ˆf m, ˆf m ]. Theorem Assume that (ξ m,j µ m,j )/(A n σ m,j ) has sub-exponential tail for all m j and some A n 1 such that for some c > 0 A 6 n log 7 (M n) = O(n 1 c ). ( ) ( ) µm,j 1. If max j m σ = o 1 m,j + nlog(m n), then P(m A cvc ) 1 α + o(1). 2. If max j m ( µm,j σ m,j )+ CA n and α n 1, then P(m A cvc ) = o(1). log(m n) n for some constant C,

43 Proof of coverage Let Z(Σ) = maxn(0,σ), and z(1 α,σ) its 1 α quantile. Let ˆΓ and Γ be sample and population correlation matrices of (ξ (i) m,j ) i I te,j m. When B, [ P(ˆp m α) = P max j Tools (2, 3 are due to Chernozhukov et al.) ] ˆµ m,j nte z(1 α, ˆΓ) ˆσ m,j 1. Concentration: ˆµ n m,j te ˆσ m,j ˆµ n m,j µ m,j te σ m,j + o(1/ logm) ˆµ 2. Gaussian comparison: max m,j µ m,j d d j nte σ m,j Z(Γ) Z( ˆΓ) 3. Anti-concentration: Z( ˆΓ) and Z(Γ) have densities logm

44 Split data into V folds. V-fold CVC

45 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i.

46 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v.

47 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n.

48 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n. Calculate T m and Tb correspondingly using the n (M 1) cross-validated error difference matrix (ξ (i) m,j ) 1 i n,j m.

49 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n. Calculate T m and Tb correspondingly using the n (M 1) cross-validated error difference matrix (ξ (i) m,j ) 1 i n,j m. Rigorous justification is hard due to dependence between folds. But empirically much better.

50 Example: the diabetes data (Efron et al 04) n = 442, with 10 covariates: age, sex, bmi, blood pressure, etc. Response is diabetes progression after one year. Including all quadratic terms, p = fold CVC with α = 0.05, using Lasso with 50 values of λ. test error log(λ) Triangle: models in A cvc, solid triangle: ˆm cv.

51 Simulations: coverage of A cvc Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple), or β = (1,1,1,0.7,0.5,0.3,0,...,0) T (mixed). 5-fold CVC with α = 0.05 using Lasso with 50 values of λ setting of (Σ,β) coverage A cvc cv is opt. identity, simple.92 (.03) 5.1 (.19).27 (.04) identity, mixed.95 (.02) 5.1 (.18).37 (.05) correlated, simple.96 (.02) 7.5 (.18).18 (.04) correlated, mixed.93 (.03) 7.4 (.23).19 (.04)

52 Simulations: coverage of A cvc Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple), or β = (1,1,1,0.7,0.5,0.3,0,...,0) T (mixed). 5-fold CVC with α = 0.05 using forward stepwise setting of (Σ,β) coverage A cvc cv is opt. identity, simple 1 (0) 3.7 (.29).87 (.03) identity, mixed.95 (.02) 5.2 (.33).58 (.05) correlated, simple.97 (.02) 4.1 (.31).80 (.04) correlated, mixed.93 (.03) 6.3 (.36).44 (.05)

53 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models.

54 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable.

55 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable. 2. A cvc can be used to answers questions like Is fitting procedure A better than procedure B?

56 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable. 2. A cvc can be used to answers questions like Is fitting procedure A better than procedure B? 3. We can also simply choose the most parsimonious model in A cvc.

57 The most parsimonious model in A cvc Now consider the linear regression problem: Y = X T β + ε. Let J m be the subset of variables selected using model m ˆm cvc.min = arg min m A cvc J m. ˆm cvc.min is the simplest model that gives a similar predictive risk as ˆm cv.

58 A classical setting Y = X T β + ε, X R p, Var(X) = Σ has full rank. ε has mean zero and variance σ 2 <. Assume that (p,σ,σ 2 ) are fixed and n. M contains the true model m, and at least one overfitting model. n tr /n te 1. Using squared loss, the true model and all overfitting models give n-consistent estimates. Early results (Shao 93, Zhang 93, Yang 07) show that P( ˆm cv m ) is bounded away from 0.

59 Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n

60 Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n Sub-Gaussianity of X and ε implies that (Y X T β) 2 is sub-exponential.

61 Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n Sub-Gaussianity of X and ε implies that (Y X T β) 2 is sub-exponential. Can allow p to grow slowly as n using union bound.

62 Example in low-dim. variable selection Synthetic data with p = 5, n = 40, as in [Shao 93]. Y = X T β + ε, β = (2,9,0,4,8) T, ε N(0,1). Generated additional rows for n = 60, 80, 100, 120, 140, 160. Candidates: (1,4,5),(1,2,4,5),(1,3,4,5),(1,2,3,4,5) Repeated 1000 times, using OLS with 5-fold CVC. rate of correct selection cvc.min cv n

63 Simulations: variable selection with ˆm cvc.min Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple) 5-fold CVC with α = 0.05 using forward stepwise setting of (Σ,β) oracle ˆm cvc.min ˆm cv identity, simple correlated, simple Proportion of correct model selection over 100 independent data sets. Oracle method: the number of steps that gives smallest prediction risk.

64 Simulations: variable selection with ˆm cvc.min Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple) 5-fold CVC with α = 0.05 using Lasso + Least Square setting of (Σ,β) oracle ˆm cvc.min ˆm cv identity, simple correlated, simple Proportion of correct model selection over 100 independent data sets. Oracle method: the λ value that gives smallest prediction risk.

65 The diabetes data revisited Split n = 442 into 300 (estimation) and 142 (risk approximation). 5-fold CVC applied on the 300 sample points, with a final re-fit. The final estimate is evaluated using the 142 hold-out sample. Repeat 100 times, using Lasso with 50 values of λ. test error model size cv cvc.min cv cvc.min

66 Summary CVC: confidence sets for model selection ˆm cvc.min has similar risk as ˆm cv using a simpler model. Extensions Validity of CVC in high dimensions and nonparametric settings. Unsupervised problems 1. Clustering 2. Matrix decomposition (PCA, SVD, etc) 3. Network models Other sample-splitting based inference methods.

67 Thanks! Questions? Paper: Cross-Validation with Confidence, arxiv.org/ Slides:

Cross-Validation with Confidence

Cross-Validation with Confidence Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Bootstrapping high dimensional vector: interplay between dependence and dimensionality

Bootstrapping high dimensional vector: interplay between dependence and dimensionality Bootstrapping high dimensional vector: interplay between dependence and dimensionality Xianyang Zhang Joint work with Guang Cheng University of Missouri-Columbia LDHD: Transition Workshop, 2014 Xianyang

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

How do we compare the relative performance among competing models?

How do we compare the relative performance among competing models? How do we compare the relative performance among competing models? 1 Comparing Data Mining Methods Frequent problem: we want to know which of the two learning techniques is better How to reliably say Model

More information

Central Bank of Chile October 29-31, 2013 Bruce Hansen (University of Wisconsin) Structural Breaks October 29-31, / 91. Bruce E.

Central Bank of Chile October 29-31, 2013 Bruce Hansen (University of Wisconsin) Structural Breaks October 29-31, / 91. Bruce E. Forecasting Lecture 3 Structural Breaks Central Bank of Chile October 29-31, 2013 Bruce Hansen (University of Wisconsin) Structural Breaks October 29-31, 2013 1 / 91 Bruce E. Hansen Organization Detection

More information

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University. SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Linear regression methods

Linear regression methods Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response

More information

Statistical Inference of Covariate-Adjusted Randomized Experiments

Statistical Inference of Covariate-Adjusted Randomized Experiments 1 Statistical Inference of Covariate-Adjusted Randomized Experiments Feifang Hu Department of Statistics George Washington University Joint research with Wei Ma, Yichen Qin and Yang Li Email: feifang@gwu.edu

More information

Regularization: Ridge Regression and the LASSO

Regularization: Ridge Regression and the LASSO Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression

More information

Diagnostics. Gad Kimmel

Diagnostics. Gad Kimmel Diagnostics Gad Kimmel Outline Introduction. Bootstrap method. Cross validation. ROC plot. Introduction Motivation Estimating properties of an estimator. Given data samples say the average. x 1, x 2,...,

More information

Selection of Variables and Functional Forms in Multivariable Analysis: Current Issues and Future Directions

Selection of Variables and Functional Forms in Multivariable Analysis: Current Issues and Future Directions in Multivariable Analysis: Current Issues and Future Directions Frank E Harrell Jr Department of Biostatistics Vanderbilt University School of Medicine STRATOS Banff Alberta 2016-07-04 Fractional polynomials,

More information

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:

More information

Construction of PoSI Statistics 1

Construction of PoSI Statistics 1 Construction of PoSI Statistics 1 Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI 2018 1 Joint work with "Larry s Group" at Wharton,

More information

IV Quantile Regression for Group-level Treatments, with an Application to the Distributional Effects of Trade

IV Quantile Regression for Group-level Treatments, with an Application to the Distributional Effects of Trade IV Quantile Regression for Group-level Treatments, with an Application to the Distributional Effects of Trade Denis Chetverikov Brad Larsen Christopher Palmer UCLA, Stanford and NBER, UC Berkeley September

More information

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006 Least Squares Model Averaging Bruce E. Hansen University of Wisconsin January 2006 Revised: August 2006 Introduction This paper developes a model averaging estimator for linear regression. Model averaging

More information

Inference After Variable Selection

Inference After Variable Selection Department of Mathematics, SIU Carbondale Inference After Variable Selection Lasanthi Pelawa Watagoda lasanthi@siu.edu June 12, 2017 Outline 1 Introduction 2 Inference For Ridge and Lasso 3 Variable Selection

More information

A Sparse Linear Model and Significance Test. for Individual Consumption Prediction

A Sparse Linear Model and Significance Test. for Individual Consumption Prediction A Sparse Linear Model and Significance Test 1 for Individual Consumption Prediction Pan Li, Baosen Zhang, Yang Weng, and Ram Rajagopal arxiv:1511.01853v3 [stat.ml] 21 Feb 2017 Abstract Accurate prediction

More information

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8 Contents 1 Linear model 1 2 GLS for multivariate regression 5 3 Covariance estimation for the GLM 8 4 Testing the GLH 11 A reference for some of this material can be found somewhere. 1 Linear model Recall

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Multivariate Regression (Chapter 10)

Multivariate Regression (Chapter 10) Multivariate Regression (Chapter 10) This week we ll cover multivariate regression and maybe a bit of canonical correlation. Today we ll mostly review univariate multivariate regression. With multivariate

More information

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff

IEOR 165 Lecture 7 1 Bias-Variance Tradeoff IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to

More information

arxiv: v2 [math.st] 9 Feb 2017

arxiv: v2 [math.st] 9 Feb 2017 Submitted to the Annals of Statistics PREDICTION ERROR AFTER MODEL SEARCH By Xiaoying Tian Harris, Department of Statistics, Stanford University arxiv:1610.06107v math.st 9 Feb 017 Estimation of the prediction

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier

More information

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24 Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Sparse PCA in High Dimensions

Sparse PCA in High Dimensions Sparse PCA in High Dimensions Jing Lei, Department of Statistics, Carnegie Mellon Workshop on Big Data and Differential Privacy Simons Institute, Dec, 2013 (Based on joint work with V. Q. Vu, J. Cho, and

More information

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee Predictive

More information

Robust Performance Hypothesis Testing with the Sharpe Ratio

Robust Performance Hypothesis Testing with the Sharpe Ratio Robust Performance Hypothesis Testing with the Sharpe Ratio Olivier Ledoit Michael Wolf Institute for Empirical Research in Economics University of Zurich Outline 1 The Problem 2 Solutions HAC Inference

More information

Simple Linear Regression

Simple Linear Regression Simple Linear Regression In simple linear regression we are concerned about the relationship between two variables, X and Y. There are two components to such a relationship. 1. The strength of the relationship.

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

Econ 583 Final Exam Fall 2008

Econ 583 Final Exam Fall 2008 Econ 583 Final Exam Fall 2008 Eric Zivot December 11, 2008 Exam is due at 9:00 am in my office on Friday, December 12. 1 Maximum Likelihood Estimation and Asymptotic Theory Let X 1,...,X n be iid random

More information

The Slow Convergence of OLS Estimators of α, β and Portfolio. β and Portfolio Weights under Long Memory Stochastic Volatility

The Slow Convergence of OLS Estimators of α, β and Portfolio. β and Portfolio Weights under Long Memory Stochastic Volatility The Slow Convergence of OLS Estimators of α, β and Portfolio Weights under Long Memory Stochastic Volatility New York University Stern School of Business June 21, 2018 Introduction Bivariate long memory

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Probably Approximately Correct (PAC) Learning

Probably Approximately Correct (PAC) Learning ECE91 Spring 24 Statistical Regularization and Learning Theory Lecture: 6 Probably Approximately Correct (PAC) Learning Lecturer: Rob Nowak Scribe: Badri Narayan 1 Introduction 1.1 Overview of the Learning

More information

Threshold Autoregressions and NonLinear Autoregressions

Threshold Autoregressions and NonLinear Autoregressions Threshold Autoregressions and NonLinear Autoregressions Original Presentation: Central Bank of Chile October 29-31, 2013 Bruce Hansen (University of Wisconsin) Threshold Regression 1 / 47 Threshold Models

More information

11. Bootstrap Methods

11. Bootstrap Methods 11. Bootstrap Methods c A. Colin Cameron & Pravin K. Trivedi 2006 These transparencies were prepared in 20043. They can be used as an adjunct to Chapter 11 of our subsequent book Microeconometrics: Methods

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee May 13, 2005

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs

De-biasing the Lasso: Optimal Sample Size for Gaussian Designs De-biasing the Lasso: Optimal Sample Size for Gaussian Designs Adel Javanmard USC Marshall School of Business Data Science and Operations department Based on joint work with Andrea Montanari Oct 2015 Adel

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

MFE Financial Econometrics 2018 Final Exam Model Solutions

MFE Financial Econometrics 2018 Final Exam Model Solutions MFE Financial Econometrics 2018 Final Exam Model Solutions Tuesday 12 th March, 2019 1. If (X, ε) N (0, I 2 ) what is the distribution of Y = µ + β X + ε? Y N ( µ, β 2 + 1 ) 2. What is the Cramer-Rao lower

More information

Prediction & Feature Selection in GLM

Prediction & Feature Selection in GLM Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables? Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04

More information

Inference on distributions and quantiles using a finite-sample Dirichlet process

Inference on distributions and quantiles using a finite-sample Dirichlet process Dirichlet IDEAL Theory/methods Simulations Inference on distributions and quantiles using a finite-sample Dirichlet process David M. Kaplan University of Missouri Matt Goldman UC San Diego Midwest Econometrics

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

High-dimensional regression with unknown variance

High-dimensional regression with unknown variance High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f

More information

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:

More information

Introduction to Machine Learning

Introduction to Machine Learning 1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013 Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you

More information

Lecture 6: Discrete Choice: Qualitative Response

Lecture 6: Discrete Choice: Qualitative Response Lecture 6: Instructor: Department of Economics Stanford University 2011 Types of Discrete Choice Models Univariate Models Binary: Linear; Probit; Logit; Arctan, etc. Multinomial: Logit; Nested Logit; GEV;

More information

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology

More information

Math 494: Mathematical Statistics

Math 494: Mathematical Statistics Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

The risk of machine learning

The risk of machine learning / 33 The risk of machine learning Alberto Abadie Maximilian Kasy July 27, 27 2 / 33 Two key features of machine learning procedures Regularization / shrinkage: Improve prediction or estimation performance

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Time Series and Forecasting Lecture 4 NonLinear Time Series

Time Series and Forecasting Lecture 4 NonLinear Time Series Time Series and Forecasting Lecture 4 NonLinear Time Series Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 23-27, 2012 Bruce Hansen (University of Wisconsin) Foundations

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate

More information

Feature Engineering, Model Evaluations

Feature Engineering, Model Evaluations Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering

More information

Dimension Reduction Methods

Dimension Reduction Methods Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.

More information

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Why uncertainty? Why should data mining care about uncertainty? We

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Independent and conditionally independent counterfactual distributions

Independent and conditionally independent counterfactual distributions Independent and conditionally independent counterfactual distributions Marcin Wolski European Investment Bank M.Wolski@eib.org Society for Nonlinear Dynamics and Econometrics Tokyo March 19, 2018 Views

More information

A Significance Test for the Lasso

A Significance Test for the Lasso A Significance Test for the Lasso Lockhart R, Taylor J, Tibshirani R, and Tibshirani R Ashley Petersen May 14, 2013 1 Last time Problem: Many clinical covariates which are important to a certain medical

More information

Multivariate Regression

Multivariate Regression Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the

More information

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press, Lecture Slides for INTRODUCTION TO Machine Learning ETHEM ALPAYDIN The MIT Press, 2004 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml CHAPTER 14: Assessing and Comparing Classification Algorithms

More information

A union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling

A union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling A union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling Min-ge Xie Department of Statistics, Rutgers University Workshop on Higher-Order Asymptotics

More information

Robust Backtesting Tests for Value-at-Risk Models

Robust Backtesting Tests for Value-at-Risk Models Robust Backtesting Tests for Value-at-Risk Models Jose Olmo City University London (joint work with Juan Carlos Escanciano, Indiana University) Far East and South Asia Meeting of the Econometric Society

More information

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota Submitted to the Annals of Statistics arxiv: math.pr/0000000 SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION By Wei Liu and Yuhong Yang University of Minnesota In

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference

Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference Andreas Buja joint with: Richard Berk, Lawrence Brown, Linda Zhao, Arun Kuchibhotla, Kai Zhang Werner Stützle, Ed George, Mikhail

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 5, 06 Reading: See class website Eric Xing @ CMU, 005-06

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Introduction to bivariate analysis

Introduction to bivariate analysis Introduction to bivariate analysis When one measurement is made on each observation, univariate analysis is applied. If more than one measurement is made on each observation, multivariate analysis is applied.

More information

Linear regression COMS 4771

Linear regression COMS 4771 Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between

More information

Big Data, Machine Learning, and Causal Inference

Big Data, Machine Learning, and Causal Inference Big Data, Machine Learning, and Causal Inference I C T 4 E v a l I n t e r n a t i o n a l C o n f e r e n c e I F A D H e a d q u a r t e r s, R o m e, I t a l y P a u l J a s p e r p a u l. j a s p e

More information

Fundamentals of Machine Learning (Part I)

Fundamentals of Machine Learning (Part I) Fundamentals of Machine Learning (Part I) Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp April 12, 2018 Mohammad Emtiyaz Khan 2018 1 Goals Understand (some) fundamentals

More information

STAT 730 Chapter 4: Estimation

STAT 730 Chapter 4: Estimation STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum

More information

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu

More information

Shrinkage Methods: Ridge and Lasso

Shrinkage Methods: Ridge and Lasso Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and

More information

Introduction to bivariate analysis

Introduction to bivariate analysis Introduction to bivariate analysis When one measurement is made on each observation, univariate analysis is applied. If more than one measurement is made on each observation, multivariate analysis is applied.

More information

GAUSSIAN PROCESS REGRESSION

GAUSSIAN PROCESS REGRESSION GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The

More information

Resampling techniques for statistical modeling

Resampling techniques for statistical modeling Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information