Cross-Validation with Confidence
|
|
- Kelley Kennedy
- 5 years ago
- Views:
Transcription
1 Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University WHOA-PSI Workshop, St Louis, 2017
2 Quotes from Day 1 and Day 2 Good model or pure model? Occam s razor We really want a λ that looks reasonable
3 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval
4 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval
5 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval
6 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval
7 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval
8 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval
9 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation Interval est. Confidence interval CVC
10 Outline Background: cross-validation, overfitting, and uncertainty of model selection Cross-validation with confidence A hypothesis testing framework p-value calculation Validity of the confidence set Examples
11 A regression setting Data: D = {(X i,y i ) : 1 i n}, i.i.d from joint distribution P on R p R 1 Y = f (X) + ε, with E(ε X) = 0 Loss function: l(, ) : R 2 R Goal: find ˆf f so that Q(ˆf ) E [ l(ˆf (X),Y) ˆf ] is small.
12 A regression setting Data: D = {(X i,y i ) : 1 i n}, i.i.d from joint distribution P on R p R 1 Y = f (X) + ε, with E(ε X) = 0 Loss function: l(, ) : R 2 R Goal: find ˆf f so that Q(ˆf ) E [ l(ˆf (X),Y) ˆf ] is small. The framework can be extended to unsupervised learning problems, including network data.
13 Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m
14 Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m 1. such that it equals the true model
15 Model selection Candidate set: M = {1,...,M}. Each m M corresponds to a candidate model. 1. m can represent a competing theory about P (e.g., f is linear, f is quadratic, variable j is irrelevant, etc). 2. m can represent a particular value of a tuning parameter of a certain algorithm to calculate ˆf (e.g., λ in the lasso, number of steps in forward selection) Given m and data D, there is an estimate ˆf (D,m) of f. Model selection: find the best m 1. such that it equals the true model 2. such that it minimizes Q(ˆf ) over all m M with high probability.
16 Cross-validation Sample split: Let I tr and I te be a partition of {1,...,n}. Fitting: ˆf m = ˆf (D tr,m), where D tr = {(X i,y i ) : i I tr }. Validation: ˆQ(ˆf m ) = n 1 te i Ite l(ˆf m (X i ),Y i ). CV model selection: ˆm cv = argmin m M ˆQ(ˆf m ). V-fold cross-validation: 1. For V 2, split the data into V folds. 2. Rotate over each fold as I tr to obtain ˆQ (v) (ˆf (v) m ) 3. ˆm = argminv 1 V v=1 ˆQ (v) (ˆf (v) m ) 4. Popular choices of V: 10 and V = n: leave-one-out cross-validation
17 A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1.
18 A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1. (Shao 93, Zhang 93, Yang 07) ˆm cv is inconsistent unless n tr = o(n).
19 A simple negative example Model: Y = µ + ε, where ε N(0,1). M = {1,2}. m = 1: µ = 0; m = 2: µ R. Truth: µ = 0 Consider a single split: ˆf 1 0, ˆf 2 = ε tr. ˆm cv = 1 0 < ˆQ(ˆf 2 ) ˆQ(ˆf 1 ) = ε tr 2 2 ε tr ε te. If n tr /n te 1, then n ε tr and n ε te are independent normal random variables with constant variances. So P( ˆm cv = 1) is bounded away from 1. (Shao 93, Zhang 93, Yang 07) ˆm cv is inconsistent unless n tr = o(n). V-fold does not help!
20 A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )?
21 A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )? A standard solution uses hypothesis testing H 0 : Q(ˆf 1 ) Q(ˆf 2 ) conditioning on ˆf 1, ˆf 2.
22 A fix for the simple example: hypothesis testing The fundamental question: When we see ˆQ(ˆf 2 ) < ˆQ(ˆf 1 ), do we feel confident to say Q(ˆf 2 ) < Q(ˆf 1 )? A standard solution uses hypothesis testing H 0 : Q(ˆf 1 ) Q(ˆf 2 ) conditioning on ˆf 1, ˆf 2. Can do this using a paired sample t-test, say with type I error level α.
23 CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1?
24 CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1? No. Because if we consider H 0 : Q(ˆf 2 ) Q(ˆf 1 ). H 0 will not be rejected either (probability of rejecting H 0 is bounded away from 0.) Most likely, we do not reject H 0 or H 0.
25 CVC for the simple example Recall that H 0 : Q(ˆf 1 ) Q(ˆf 2 ). When H 0 is not rejected, does it mean we shall just pick m = 1? No. Because if we consider H 0 : Q(ˆf 2 ) Q(ˆf 1 ). H 0 will not be rejected either (probability of rejecting H 0 is bounded away from 0.) Most likely, we do not reject H 0 or H 0. We accept both fitted models ˆf 1 and ˆf 2, as they are very similar and the difference cannot be noticed from the data.
26 Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions.
27 Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions. Our approach: one step, with provable coverage and power under mild assumptions in high dimensions.
28 Existing work Hansen et al (2011, Econometrica): sequential testing, only for low dimensional problems. Ferrari and Yang (2014): F-tests, need a good variable screening procedure in high dimensions. Our approach: one step, with provable coverage and power under mild assumptions in high dimensions. Key technique: high-dimensional Gaussian comparison of sample means (Chernozhukov et al).
29 CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ].
30 CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ]. For each m, test hypothesis (conditioning on ˆf 1,..., ˆf M ) Let ˆp m be a valid p-value. H 0,m : min j m Q(ˆf j ) Q(ˆf m ).
31 CVC in general Now suppose we have a set of candidate models M = {1,...,M}. Split the data into D tr and D te, and use D tr to obtain ˆf m for each m. Recall that the model quality is Q(ˆf ) = E [ l(ˆf (X),Y) ˆf ]. For each m, test hypothesis (conditioning on ˆf 1,..., ˆf M ) Let ˆp m be a valid p-value. H 0,m : min j m Q(ˆf j ) Q(ˆf m ). A cvc = {m : ˆp m > α} is our confidence set for the best fitted model: P(m A cvc ) 1 α, where m = argmin m Q(ˆf m ).
32 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m.
33 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges
34 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large.
35 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large. 2. Potentially high correlation between ξ m,j and ξ m,j.
36 Calculating ˆp m Recall that D tr is the training data and D te is the testing data. The test and p-values are conditional on D tr. Data: n te (M 1) matrix (I te is the index set of D te ) [ ξ (i) m,j ] i I te, j m, where ξ (i) m,j = l(ˆf m (X i ),Y i ) l(ˆf j (X i ),Y i ) Multivariate mean testing. H 0,m : E(ξ m,j ) 0, j m. Challenges 1. High dimensionality: M can be large. 2. Potentially high correlation between ξ m,j and ξ m,j. 3. Vastly different scaling: Var(ξ m,j ) can be O(1) or O(n 1 ).
37 Calculating ˆp m H 0,m : E(ξ m,j ) 0, j m. Let ˆµ m,j and ˆσ m,j be the sample mean and standard deviation of (ξ (i) m,j : i I te). Naturally, one would reject H 0,m for large values of max j m ˆµ m,j ˆσ m,j. Approximate the null distribution using high dimensional Gaussian comparison.
38 Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j 3. ˆp m = B 1 B 1(Tb > T m ). b=1 m,j ˆµ m,j ζ i
39 Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j m,j ˆµ m,j 3. ˆp m = B 1 B b=1 1(T b > T m ). - The studentization takes care of the scaling difference. ζ i
40 Studentized Gaussian Multiplier Bootstrap ˆµ m,j 1. T m = max nte j m ˆσ m,j 2. Let B be the bootstrap sample size. For b = 1,...,B, 2.1 Generate iid standard Gaussian ζ i, i I te. 2.2 T b = max j m 1 nte (i) ξ i I te ˆσ m,j m,j ˆµ m,j 3. ˆp m = B 1 B 1(Tb > T m ). b=1 - The studentization takes care of the scaling difference. - The bootstrap Gaussian comparison takes care of the dimensionality and correlation. ζ i
41 Properties of CVC A cvc = {m : ˆp m > α}. Let ˆm cv = argmin m ˆQ(ˆf m ). By construction T ˆmcv 0. Proposition If α < 0.5, then P( ˆm cv A cvc ) 1 as B. [ 1 Proof: nte (i) ξ i I te m,j ˆµ m,j ˆσ m,j ζ i ] j m is a zero-mean Gaussian random vector. So the upper α quantile of its maximum must be positive. Can view ˆm cv as the center of the confidence set.
42 Coverage of A cvc Recall ξ m,j = l(ˆf m (X),Y) l(ˆf j (X),Y), with independent (X,Y). Let µ m,j = E [ ξ m,j ˆf m, ˆf m ], σ 2 m,j = Var [ ξ m,j ˆf m, ˆf m ]. Theorem Assume that (ξ m,j µ m,j )/(A n σ m,j ) has sub-exponential tail for all m j and some A n 1 such that for some c > 0 A 6 n log 7 (M n) = O(n 1 c ). ( ) ( ) µm,j 1. If max j m σ = o 1 m,j + nlog(m n), then P(m A cvc ) 1 α + o(1). 2. If max j m ( µm,j σ m,j )+ CA n and α n 1, then P(m A cvc ) = o(1). log(m n) n for some constant C,
43 Proof of coverage Let Z(Σ) = maxn(0,σ), and z(1 α,σ) its 1 α quantile. Let ˆΓ and Γ be sample and population correlation matrices of (ξ (i) m,j ) i I te,j m. When B, [ P(ˆp m α) = P max j Tools (2, 3 are due to Chernozhukov et al.) ] ˆµ m,j nte z(1 α, ˆΓ) ˆσ m,j 1. Concentration: ˆµ n m,j te ˆσ m,j ˆµ n m,j µ m,j te σ m,j + o(1/ logm) ˆµ 2. Gaussian comparison: max m,j µ m,j d d j nte σ m,j Z(Γ) Z( ˆΓ) 3. Anti-concentration: Z( ˆΓ) and Z(Γ) have densities logm
44 Split data into V folds. V-fold CVC
45 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i.
46 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v.
47 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n.
48 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n. Calculate T m and Tb correspondingly using the n (M 1) cross-validated error difference matrix (ξ (i) m,j ) 1 i n,j m.
49 V-fold CVC Split data into V folds. Let v i be the fold that contains data point i. Let ˆf m,v be the estimate using model m and all data but fold v. ξ (i) m,j = l(ˆf m,vi (X i ),Y i ) l(ˆf m,vi (X i,y i )), for all 1 i n. Calculate T m and Tb correspondingly using the n (M 1) cross-validated error difference matrix (ξ (i) m,j ) 1 i n,j m. Rigorous justification is hard due to dependence between folds. But empirically much better.
50 Example: the diabetes data (Efron et al 04) n = 442, with 10 covariates: age, sex, bmi, blood pressure, etc. Response is diabetes progression after one year. Including all quadratic terms, p = fold CVC with α = 0.05, using Lasso with 50 values of λ. test error log(λ) Triangle: models in A cvc, solid triangle: ˆm cv.
51 Simulations: coverage of A cvc Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple), or β = (1,1,1,0.7,0.5,0.3,0,...,0) T (mixed). 5-fold CVC with α = 0.05 using Lasso with 50 values of λ setting of (Σ,β) coverage A cvc cv is opt. identity, simple.92 (.03) 5.1 (.19).27 (.04) identity, mixed.95 (.02) 5.1 (.18).37 (.05) correlated, simple.96 (.02) 7.5 (.18).18 (.04) correlated, mixed.93 (.03) 7.4 (.23).19 (.04)
52 Simulations: coverage of A cvc Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple), or β = (1,1,1,0.7,0.5,0.3,0,...,0) T (mixed). 5-fold CVC with α = 0.05 using forward stepwise setting of (Σ,β) coverage A cvc cv is opt. identity, simple 1 (0) 3.7 (.29).87 (.03) identity, mixed.95 (.02) 5.2 (.33).58 (.05) correlated, simple.97 (.02) 4.1 (.31).80 (.04) correlated, mixed.93 (.03) 6.3 (.36).44 (.05)
53 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models.
54 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable.
55 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable. 2. A cvc can be used to answers questions like Is fitting procedure A better than procedure B?
56 How to use A cvc? We are often interested in picking one model, not a subset of models. A cvc provides some flexibility of picking among a subset of highly competitive models. 1. A cvc may contain a model that includes a particularly interesting variable. 2. A cvc can be used to answers questions like Is fitting procedure A better than procedure B? 3. We can also simply choose the most parsimonious model in A cvc.
57 The most parsimonious model in A cvc Now consider the linear regression problem: Y = X T β + ε. Let J m be the subset of variables selected using model m ˆm cvc.min = arg min m A cvc J m. ˆm cvc.min is the simplest model that gives a similar predictive risk as ˆm cv.
58 A classical setting Y = X T β + ε, X R p, Var(X) = Σ has full rank. ε has mean zero and variance σ 2 <. Assume that (p,σ,σ 2 ) are fixed and n. M contains the true model m, and at least one overfitting model. n tr /n te 1. Using squared loss, the true model and all overfitting models give n-consistent estimates. Early results (Shao 93, Zhang 93, Yang 07) show that P( ˆm cv m ) is bounded away from 0.
59 Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n
60 Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n Sub-Gaussianity of X and ε implies that (Y X T β) 2 is sub-exponential.
61 Consistency of ˆm cvc.min Theorem Assume that X and ε are independent and sub-gaussian, and A cvc is the output of CVC with α = o(1) and α n 1, then lim P( ˆm cvc.min = m ) = 1. n Sub-Gaussianity of X and ε implies that (Y X T β) 2 is sub-exponential. Can allow p to grow slowly as n using union bound.
62 Example in low-dim. variable selection Synthetic data with p = 5, n = 40, as in [Shao 93]. Y = X T β + ε, β = (2,9,0,4,8) T, ε N(0,1). Generated additional rows for n = 60, 80, 100, 120, 140, 160. Candidates: (1,4,5),(1,2,4,5),(1,3,4,5),(1,2,3,4,5) Repeated 1000 times, using OLS with 5-fold CVC. rate of correct selection cvc.min cv n
63 Simulations: variable selection with ˆm cvc.min Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple) 5-fold CVC with α = 0.05 using forward stepwise setting of (Σ,β) oracle ˆm cvc.min ˆm cv identity, simple correlated, simple Proportion of correct model selection over 100 independent data sets. Oracle method: the number of steps that gives smallest prediction risk.
64 Simulations: variable selection with ˆm cvc.min Y = X T β + ε, X N(0,Σ), ε N(0,1), n = 200, p = 200 Σ = I 200 (identity), or Σ jk = δ jk (correlated). β = (1,1,1,0,...,0) T (simple) 5-fold CVC with α = 0.05 using Lasso + Least Square setting of (Σ,β) oracle ˆm cvc.min ˆm cv identity, simple correlated, simple Proportion of correct model selection over 100 independent data sets. Oracle method: the λ value that gives smallest prediction risk.
65 The diabetes data revisited Split n = 442 into 300 (estimation) and 142 (risk approximation). 5-fold CVC applied on the 300 sample points, with a final re-fit. The final estimate is evaluated using the 142 hold-out sample. Repeat 100 times, using Lasso with 50 values of λ. test error model size cv cvc.min cv cvc.min
66 Summary CVC: confidence sets for model selection ˆm cvc.min has similar risk as ˆm cv using a simpler model. Extensions Validity of CVC in high dimensions and nonparametric settings. Unsupervised problems 1. Clustering 2. Matrix decomposition (PCA, SVD, etc) 3. Network models Other sample-splitting based inference methods.
67 Thanks! Questions? Paper: Cross-Validation with Confidence, arxiv.org/ Slides:
Cross-Validation with Confidence
Cross-Validation with Confidence Jing Lei Department of Statistics, Carnegie Mellon University UMN Statistics Seminar, Mar 30, 2017 Overview Parameter est. Model selection Point est. MLE, M-est.,... Cross-validation
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationData Mining Stat 588
Data Mining Stat 588 Lecture 02: Linear Methods for Regression Department of Statistics & Biostatistics Rutgers University September 13 2011 Regression Problem Quantitative generic output variable Y. Generic
More informationStatistical Inference
Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park
More informationBootstrapping high dimensional vector: interplay between dependence and dimensionality
Bootstrapping high dimensional vector: interplay between dependence and dimensionality Xianyang Zhang Joint work with Guang Cheng University of Missouri-Columbia LDHD: Transition Workshop, 2014 Xianyang
More informationPerformance Evaluation and Comparison
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationHow do we compare the relative performance among competing models?
How do we compare the relative performance among competing models? 1 Comparing Data Mining Methods Frequent problem: we want to know which of the two learning techniques is better How to reliably say Model
More informationCentral Bank of Chile October 29-31, 2013 Bruce Hansen (University of Wisconsin) Structural Breaks October 29-31, / 91. Bruce E.
Forecasting Lecture 3 Structural Breaks Central Bank of Chile October 29-31, 2013 Bruce Hansen (University of Wisconsin) Structural Breaks October 29-31, 2013 1 / 91 Bruce E. Hansen Organization Detection
More informationSCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.
SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression
More informationMachine Learning Linear Regression. Prof. Matteo Matteucci
Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares
More informationLinear regression methods
Linear regression methods Most of our intuition about statistical methods stem from linear regression. For observations i = 1,..., n, the model is Y i = p X ij β j + ε i, j=1 where Y i is the response
More informationStatistical Inference of Covariate-Adjusted Randomized Experiments
1 Statistical Inference of Covariate-Adjusted Randomized Experiments Feifang Hu Department of Statistics George Washington University Joint research with Wei Ma, Yichen Qin and Yang Li Email: feifang@gwu.edu
More informationRegularization: Ridge Regression and the LASSO
Agenda Wednesday, November 29, 2006 Agenda Agenda 1 The Bias-Variance Tradeoff 2 Ridge Regression Solution to the l 2 problem Data Augmentation Approach Bayesian Interpretation The SVD and Ridge Regression
More informationDiagnostics. Gad Kimmel
Diagnostics Gad Kimmel Outline Introduction. Bootstrap method. Cross validation. ROC plot. Introduction Motivation Estimating properties of an estimator. Given data samples say the average. x 1, x 2,...,
More informationSelection of Variables and Functional Forms in Multivariable Analysis: Current Issues and Future Directions
in Multivariable Analysis: Current Issues and Future Directions Frank E Harrell Jr Department of Biostatistics Vanderbilt University School of Medicine STRATOS Banff Alberta 2016-07-04 Fractional polynomials,
More informationThe Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA
The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA Presented by Dongjun Chung March 12, 2010 Introduction Definition Oracle Properties Computations Relationship: Nonnegative Garrote Extensions:
More informationConstruction of PoSI Statistics 1
Construction of PoSI Statistics 1 Andreas Buja and Arun Kumar Kuchibhotla Department of Statistics University of Pennsylvania September 8, 2018 WHOA-PSI 2018 1 Joint work with "Larry s Group" at Wharton,
More informationIV Quantile Regression for Group-level Treatments, with an Application to the Distributional Effects of Trade
IV Quantile Regression for Group-level Treatments, with an Application to the Distributional Effects of Trade Denis Chetverikov Brad Larsen Christopher Palmer UCLA, Stanford and NBER, UC Berkeley September
More informationLeast Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006
Least Squares Model Averaging Bruce E. Hansen University of Wisconsin January 2006 Revised: August 2006 Introduction This paper developes a model averaging estimator for linear regression. Model averaging
More informationInference After Variable Selection
Department of Mathematics, SIU Carbondale Inference After Variable Selection Lasanthi Pelawa Watagoda lasanthi@siu.edu June 12, 2017 Outline 1 Introduction 2 Inference For Ridge and Lasso 3 Variable Selection
More informationA Sparse Linear Model and Significance Test. for Individual Consumption Prediction
A Sparse Linear Model and Significance Test 1 for Individual Consumption Prediction Pan Li, Baosen Zhang, Yang Weng, and Ram Rajagopal arxiv:1511.01853v3 [stat.ml] 21 Feb 2017 Abstract Accurate prediction
More informationPeter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8
Contents 1 Linear model 1 2 GLS for multivariate regression 5 3 Covariance estimation for the GLM 8 4 Testing the GLH 11 A reference for some of this material can be found somewhere. 1 Linear model Recall
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationMultivariate Regression (Chapter 10)
Multivariate Regression (Chapter 10) This week we ll cover multivariate regression and maybe a bit of canonical correlation. Today we ll mostly review univariate multivariate regression. With multivariate
More informationIEOR 165 Lecture 7 1 Bias-Variance Tradeoff
IEOR 165 Lecture 7 Bias-Variance Tradeoff 1 Bias-Variance Tradeoff Consider the case of parametric regression with β R, and suppose we would like to analyze the error of the estimate ˆβ in comparison to
More informationarxiv: v2 [math.st] 9 Feb 2017
Submitted to the Annals of Statistics PREDICTION ERROR AFTER MODEL SEARCH By Xiaoying Tian Harris, Department of Statistics, Stanford University arxiv:1610.06107v math.st 9 Feb 017 Estimation of the prediction
More informationDecision Tree Learning Lecture 2
Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over
More informationEstimating the accuracy of a hypothesis Setting. Assume a binary classification setting
Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier
More informationBig Data Analytics. Special Topics for Computer Science CSE CSE Feb 24
Big Data Analytics Special Topics for Computer Science CSE 4095-001 CSE 5095-005 Feb 24 Fei Wang Associate Professor Department of Computer Science and Engineering fei_wang@uconn.edu Prediction III Goal
More informationLecture 7 Introduction to Statistical Decision Theory
Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7
More informationSparse PCA in High Dimensions
Sparse PCA in High Dimensions Jing Lei, Department of Statistics, Carnegie Mellon Workshop on Big Data and Differential Privacy Simons Institute, Dec, 2013 (Based on joint work with V. Q. Vu, J. Cho, and
More informationStructured Statistical Learning with Support Vector Machine for Feature Selection and Prediction
Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee Predictive
More informationRobust Performance Hypothesis Testing with the Sharpe Ratio
Robust Performance Hypothesis Testing with the Sharpe Ratio Olivier Ledoit Michael Wolf Institute for Empirical Research in Economics University of Zurich Outline 1 The Problem 2 Solutions HAC Inference
More informationSimple Linear Regression
Simple Linear Regression In simple linear regression we are concerned about the relationship between two variables, X and Y. There are two components to such a relationship. 1. The strength of the relationship.
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationBiostatistics Advanced Methods in Biostatistics IV
Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results
More informationEcon 583 Final Exam Fall 2008
Econ 583 Final Exam Fall 2008 Eric Zivot December 11, 2008 Exam is due at 9:00 am in my office on Friday, December 12. 1 Maximum Likelihood Estimation and Asymptotic Theory Let X 1,...,X n be iid random
More informationThe Slow Convergence of OLS Estimators of α, β and Portfolio. β and Portfolio Weights under Long Memory Stochastic Volatility
The Slow Convergence of OLS Estimators of α, β and Portfolio Weights under Long Memory Stochastic Volatility New York University Stern School of Business June 21, 2018 Introduction Bivariate long memory
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationProbably Approximately Correct (PAC) Learning
ECE91 Spring 24 Statistical Regularization and Learning Theory Lecture: 6 Probably Approximately Correct (PAC) Learning Lecturer: Rob Nowak Scribe: Badri Narayan 1 Introduction 1.1 Overview of the Learning
More informationThreshold Autoregressions and NonLinear Autoregressions
Threshold Autoregressions and NonLinear Autoregressions Original Presentation: Central Bank of Chile October 29-31, 2013 Bruce Hansen (University of Wisconsin) Threshold Regression 1 / 47 Threshold Models
More information11. Bootstrap Methods
11. Bootstrap Methods c A. Colin Cameron & Pravin K. Trivedi 2006 These transparencies were prepared in 20043. They can be used as an adjunct to Chapter 11 of our subsequent book Microeconometrics: Methods
More informationLinear Model Selection and Regularization
Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationA Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data
A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee May 13, 2005
More informationVariance Reduction and Ensemble Methods
Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis
More informationDe-biasing the Lasso: Optimal Sample Size for Gaussian Designs
De-biasing the Lasso: Optimal Sample Size for Gaussian Designs Adel Javanmard USC Marshall School of Business Data Science and Operations department Based on joint work with Andrea Montanari Oct 2015 Adel
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationMFE Financial Econometrics 2018 Final Exam Model Solutions
MFE Financial Econometrics 2018 Final Exam Model Solutions Tuesday 12 th March, 2019 1. If (X, ε) N (0, I 2 ) what is the distribution of Y = µ + β X + ε? Y N ( µ, β 2 + 1 ) 2. What is the Cramer-Rao lower
More informationPrediction & Feature Selection in GLM
Tarigan Statistical Consulting & Coaching statistical-coaching.ch Doctoral Program in Computer Science of the Universities of Fribourg, Geneva, Lausanne, Neuchâtel, Bern and the EPFL Hands-on Data Analysis
More informationDecision trees COMS 4771
Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?
Linear Regression Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2014 1 What about continuous variables? n Billionaire says: If I am measuring a continuous variable, what
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 7, 04 Reading: See class website Eric Xing @ CMU, 005-04
More informationInference on distributions and quantiles using a finite-sample Dirichlet process
Dirichlet IDEAL Theory/methods Simulations Inference on distributions and quantiles using a finite-sample Dirichlet process David M. Kaplan University of Missouri Matt Goldman UC San Diego Midwest Econometrics
More informationMachine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution
More informationHigh-dimensional regression with unknown variance
High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f
More informationPAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University PAC Learning Matt Gormley Lecture 14 March 5, 2018 1 ML Big Picture Learning Paradigms:
More informationIntroduction to Machine Learning
1, DATA11002 Introduction to Machine Learning Lecturer: Antti Ukkonen TAs: Saska Dönges and Janne Leppä-aho Department of Computer Science University of Helsinki (based in part on material by Patrik Hoyer,
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013
Bayesian Methods Machine Learning CSE546 Carlos Guestrin University of Washington September 30, 2013 1 What about prior n Billionaire says: Wait, I know that the thumbtack is close to 50-50. What can you
More informationLecture 6: Discrete Choice: Qualitative Response
Lecture 6: Instructor: Department of Economics Stanford University 2011 Types of Discrete Choice Models Univariate Models Binary: Linear; Probit; Logit; Arctan, etc. Multinomial: Logit; Nested Logit; GEV;
More informationConfidence Intervals for Low-dimensional Parameters with High-dimensional Data
Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology
More informationMath 494: Mathematical Statistics
Math 494: Mathematical Statistics Instructor: Jimin Ding jmding@wustl.edu Department of Mathematics Washington University in St. Louis Class materials are available on course website (www.math.wustl.edu/
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the
More informationThe risk of machine learning
/ 33 The risk of machine learning Alberto Abadie Maximilian Kasy July 27, 27 2 / 33 Two key features of machine learning procedures Regularization / shrinkage: Improve prediction or estimation performance
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationTime Series and Forecasting Lecture 4 NonLinear Time Series
Time Series and Forecasting Lecture 4 NonLinear Time Series Bruce E. Hansen Summer School in Economics and Econometrics University of Crete July 23-27, 2012 Bruce Hansen (University of Wisconsin) Foundations
More informationMS&E 226: Small Data
MS&E 226: Small Data Lecture 6: Model complexity scores (v3) Ramesh Johari ramesh.johari@stanford.edu Fall 2015 1 / 34 Estimating prediction error 2 / 34 Estimating prediction error We saw how we can estimate
More informationFeature Engineering, Model Evaluations
Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering
More informationDimension Reduction Methods
Dimension Reduction Methods And Bayesian Machine Learning Marek Petrik 2/28 Previously in Machine Learning How to choose the right features if we have (too) many options Methods: 1. Subset selection 2.
More informationData Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University
Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Why uncertainty? Why should data mining care about uncertainty? We
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationIndependent and conditionally independent counterfactual distributions
Independent and conditionally independent counterfactual distributions Marcin Wolski European Investment Bank M.Wolski@eib.org Society for Nonlinear Dynamics and Econometrics Tokyo March 19, 2018 Views
More informationA Significance Test for the Lasso
A Significance Test for the Lasso Lockhart R, Taylor J, Tibshirani R, and Tibshirani R Ashley Petersen May 14, 2013 1 Last time Problem: Many clinical covariates which are important to a certain medical
More informationMultivariate Regression
Multivariate Regression The so-called supervised learning problem is the following: we want to approximate the random variable Y with an appropriate function of the random variables X 1,..., X p with the
More informationLecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,
Lecture Slides for INTRODUCTION TO Machine Learning ETHEM ALPAYDIN The MIT Press, 2004 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml CHAPTER 14: Assessing and Comparing Classification Algorithms
More informationA union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling
A union of Bayesian, frequentist and fiducial inferences by confidence distribution and artificial data sampling Min-ge Xie Department of Statistics, Rutgers University Workshop on Higher-Order Asymptotics
More informationRobust Backtesting Tests for Value-at-Risk Models
Robust Backtesting Tests for Value-at-Risk Models Jose Olmo City University London (joint work with Juan Carlos Escanciano, Indiana University) Far East and South Asia Meeting of the Econometric Society
More informationSUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota
Submitted to the Annals of Statistics arxiv: math.pr/0000000 SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION By Wei Liu and Yuhong Yang University of Minnesota In
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationDISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania
Submitted to the Annals of Statistics DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING By T. Tony Cai and Linjun Zhang University of Pennsylvania We would like to congratulate the
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationHigher-Order von Mises Expansions, Bagging and Assumption-Lean Inference
Higher-Order von Mises Expansions, Bagging and Assumption-Lean Inference Andreas Buja joint with: Richard Berk, Lawrence Brown, Linda Zhao, Arun Kuchibhotla, Kai Zhang Werner Stützle, Ed George, Mikhail
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Gaussian graphical models and Ising models: modeling networks Eric Xing Lecture 0, February 5, 06 Reading: See class website Eric Xing @ CMU, 005-06
More informationDirect Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:
More informationIntroduction to bivariate analysis
Introduction to bivariate analysis When one measurement is made on each observation, univariate analysis is applied. If more than one measurement is made on each observation, multivariate analysis is applied.
More informationLinear regression COMS 4771
Linear regression COMS 4771 1. Old Faithful and prediction functions Prediction problem: Old Faithful geyser (Yellowstone) Task: Predict time of next eruption. 1 / 40 Statistical model for time between
More informationBig Data, Machine Learning, and Causal Inference
Big Data, Machine Learning, and Causal Inference I C T 4 E v a l I n t e r n a t i o n a l C o n f e r e n c e I F A D H e a d q u a r t e r s, R o m e, I t a l y P a u l J a s p e r p a u l. j a s p e
More informationFundamentals of Machine Learning (Part I)
Fundamentals of Machine Learning (Part I) Mohammad Emtiyaz Khan AIP (RIKEN), Tokyo http://emtiyaz.github.io emtiyaz.khan@riken.jp April 12, 2018 Mohammad Emtiyaz Khan 2018 1 Goals Understand (some) fundamentals
More informationSTAT 730 Chapter 4: Estimation
STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum
More informationSYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions
SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu
More informationShrinkage Methods: Ridge and Lasso
Shrinkage Methods: Ridge and Lasso Jonathan Hersh 1 Chapman University, Argyros School of Business hersh@chapman.edu February 27, 2019 J.Hersh (Chapman) Ridge & Lasso February 27, 2019 1 / 43 1 Intro and
More informationIntroduction to bivariate analysis
Introduction to bivariate analysis When one measurement is made on each observation, univariate analysis is applied. If more than one measurement is made on each observation, multivariate analysis is applied.
More informationGAUSSIAN PROCESS REGRESSION
GAUSSIAN PROCESS REGRESSION CSE 515T Spring 2015 1. BACKGROUND The kernel trick again... The Kernel Trick Consider again the linear regression model: y(x) = φ(x) w + ε, with prior p(w) = N (w; 0, Σ). The
More informationResampling techniques for statistical modeling
Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationAn Introduction to Statistical Machine Learning - Theoretical Aspects -
An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,
More information