Variable Selection in Wide Data Sets

Similar documents
Variable Selection in Wide Data Sets

Alpha-Investing. Sequential Control of Expected False Discoveries

Multiple hypothesis testing using the excess discovery count and alpha-investing rules

Multi-View Regression via Canonincal Correlation Analysis

Association studies and regression

Streaming Feature Selection using IIC

Discussion. Dean Foster. NYC

Exam Applied Statistical Regression. Good Luck!

Chapter 1 Statistical Inference

Chapter 9: Hypothesis Testing Sections

Machine Learning Linear Regression. Prof. Matteo Matteucci

Problem statement. The data: Orthonormal regression with lots of X s (possible lots of β s are zero: Y i = β 0 + β j X ij + σz i, Z i N(0, 1),

Machine Learning Linear Classification. Prof. Matteo Matteucci

Statistical testing. Samantha Kleinberg. October 20, 2009

Looking at the Other Side of Bonferroni

STATISTICS 110/201 PRACTICE FINAL EXAM

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

ISyE 691 Data mining and analytics

Econ 423 Lecture Notes: Additional Topics in Time Series 1

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

4. Nonlinear regression functions

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Volatility. Gerald P. Dwyer. February Clemson University

Sociology 593 Exam 2 Answer Key March 28, 2002

Econ 1123: Section 2. Review. Binary Regressors. Bivariate. Regression. Omitted Variable Bias

The prediction of house price

An Introduction to Mplus and Path Analysis

Lecture 5: Clustering, Linear Regression

Final Exam. Name: Solution:

Lectures on Simple Linear Regression Stat 431, Summer 2012

False Discovery Rate

Inference in Regression Model

Linear Model Selection and Regularization

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Statistics 910, #5 1. Regression Methods

Answer all questions from part I. Answer two question from part II.a, and one question from part II.b.

Lecture 5: Clustering, Linear Regression

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Sampling Distributions in Regression. Mini-Review: Inference for a Mean. For data (x 1, y 1 ),, (x n, y n ) generated with the SRM,

STAT 525 Fall Final exam. Tuesday December 14, 2010

Lecture 24: Weighted and Generalized Least Squares

The regression model with one fixed regressor cont d

CHL 5225H Advanced Statistical Methods for Clinical Trials: Multiplicity

Reading for Lecture 6 Release v10

Advanced Statistical Methods: Beyond Linear Regression

22s:152 Applied Linear Regression. 1-way ANOVA visual:

Homoskedasticity. Var (u X) = σ 2. (23)

Probabilities & Statistics Revision

22s:152 Applied Linear Regression. Chapter 8: 1-Way Analysis of Variance (ANOVA) 2-Way Analysis of Variance (ANOVA)

Chapter 11. Regression with a Binary Dependent Variable

Classification. Chapter Introduction. 6.2 The Bayes classifier

Multiple Testing. Gary W. Oehlert. January 28, School of Statistics University of Minnesota

Section 3: Simple Linear Regression

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Business Statistics. Lecture 9: Simple Regression

Model Assumptions; Predicting Heterogeneity of Variance

Simple linear regression

Statistical Inference with Regression Analysis

Introduction to Econometrics

Stat 206: Estimation and testing for a mean vector,

Least Squares Estimation-Finite-Sample Properties

Generalized Linear Models for Non-Normal Data

Chapter 3 Multiple Regression Complete Example

Statistical Methods for Data Mining

The Simple Linear Regression Model

Lecture 10: Alternatives to OLS with limited dependent variables. PEA vs APE Logit/Probit Poisson

Data Mining Stat 588

Specific Differences. Lukas Meier, Seminar für Statistik

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Trendlines Simple Linear Regression Multiple Linear Regression Systematic Model Building Practical Issues

download instant at

Regression with a Single Regressor: Hypothesis Tests and Confidence Intervals

An overview of applied econometrics

CHAPTER 6: SPECIFICATION VARIABLES

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Lecture 5: Clustering, Linear Regression

Construction of PoSI Statistics 1

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

An Introduction to Path Analysis

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

Machine Learning for OR & FE

Review of Statistics 101

Business Statistics. Tommaso Proietti. Linear Regression. DEF - Università di Roma 'Tor Vergata'

Topic 3: Hypothesis Testing

Lecture 6 April

The Design of a Survival Study

So far our focus has been on estimation of the parameter vector β in the. y = Xβ + u

An Introduction to Multilevel Models. PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 25: December 7, 2012

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Unit 10: Simple Linear Regression and Correlation

1 Least Squares Estimation - multiple regression.

Peak Detection for Images

Linear Models: Comparing Variables. Stony Brook University CSE545, Fall 2017

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

Introduction to Statistical modeling: handout for Math 489/583

Transcription:

Variable Selection in Wide Data Sets Bob Stine Department of Statistics The Wharton School of the University of Pennsylvania www-stat.wharton.upenn.edu/ stine Univ of Penn p.

Overview Problem Picking the features to use in a model Wide data sets imply many choices Examples Simulated idealized problems Predicting credit risk Themes Modifying familiar tools for data mining Dealing with the problem of multiplicity Collaboration with Dean Foster Univ of Penn p.

Questions Credit scoring Who will repay the loan? Who will declare bankruptcy? Identifying faces Is this a picture of a face? Genomics Does a pattern of genes predict higher risk of a disease? Text processing Which references were left out of this paper? Univ of Penn p.

Questions Credit scoring Who will repay the loan? Who will declare bankruptcy? Identifying faces Is this a picture of a face? Genomics Does a pattern of genes predict higher risk of a disease? Text processing Which references were left out of this paper? These are great statistics problems, so... Why not use our workhorse, regression? Its familiar with diagnostics available. Univ of Penn p.

Regression Models Simple structure n independent observations of m features q predictors in model with error variance σ 2 : Y = β 0 + β 1 X 1 + + β q X q + ǫ Calibrated predictions allow various link functions... Linear regression has identity, logistic has logit link. Rich feature space E(Y X) = h(β 0 + β 1 X 1 + + β q X q ) is key to the model Usual mix: continuous, categorical, dummy vars,... Missing data indicators Combinations (principal components) and clusters Nonlinear terms, transformations (quadratics) Univ of Penn p.

Hard Question If we allow for the possibility of interactions among the predictors, then the feature space becomes larger still. Which features generate the best predictions? Also hope to do this without cross-validation. Univ of Penn p.

A favorite example: Stock Market Where s the stock market headed in 2005? Univ of Penn p.

A favorite example: Stock Market Where s the stock market headed in 2005? Daily percentage changes in the closing price of the S&P 500 during the last 3 months of 2004 (85 trading days)... Pct Change 1.5 1 0.5-0.5??? 20 40 60 80 100 Day -1 Univ of Penn p.

Predicting the Market Problem Predict returns on S&P 500 in 2005 using features built from 12 exogenous factors (a.k.a., technical trading rules). Regression model R 2 = 0.85 using q = 28 predictors. With n = 85, F = 12 with p-value < 0.00001. Coefficients are impressive... Term Estimate Std Error t-ratio p-value Intercept 0.323 0.078 4.14 0.0001 X4 0.172 0.040 4.34 0.0000 (X1)*(X1) -0.202 0.039-5.16 0.0000 (X1)*(X5) 0.256 0.048 5.34 0.0000 (X2)*(X6) 0.289 0.044 6.59 0.0000 (X7)*(X9) 0.249 0.046 5.37 0.0000 Univ of Penn p.

Model Fit and Predictions Predictions track the historical returns closely, even matching turning points. Daily Pct Change 1-1 20 40 60 80 100 Day -2 Better short the market that day... No guts, no glory! Univ of Penn p.

Prediction Errors In-sample prediction errors are small... Pct Change 1.5 1 0.5-0.5 20 40 60 80 100 Day -1-1.5 Univ of Penn p.

Prediction Errors How d you lose the house? Even Bonferroni lost his house! Pct Change 1.5 1 0.5-0.5 20 40 60 80 100 Day -1-1.5 Univ of Penn p.

What was that model? Exogenous base variables 12 columns of Gaussian noise. Null model is the right model. Selecting predictors Turn stepwise regression lose on m = 90 features 12 linear + 12 Squares + 66 Interactions with promiscuous settings for adding variables, prob-to-enter = 0.16 (AIC) Then run stepwise backward to remove extraneous effects. Why does this fool Bonferroni? s 2 becomes biased, so remaining predictors appear more significant. Often cascades into a perfect fit. Cannot fit saturated model to get unbiased estimate of σ 2 because m > n. Univ of Penn p. 1

Big Picture, 1 Moral Stepwise regression can make a silk purse from sow s ear. Must control the fitting process Use Bonferroni from the start, not after the fact. No predictor joins model the right answer. So, how does one control the fitting process? Particularly when there are many possible choices. Univ of Penn p. 1

Second Example: Predicting Bankruptcy Predict onset of personal bankruptcy Estimate probability customer declares bankruptcy. Challenge Can stepwise regression predict as well as commercial data-mining tools or substantive models? Many features About 350 basic variables Short time series for each account Spending, utilization, payments, background Missing data and indicators Interactions are important (cash advance in Vegas) m > 67,000 predictors! Transaction history would vastly expand the problem. Univ of Penn p. 1

Complication: Sparse Response How much data do you really have? 3 million months of activity, each with 67,000 predictors. Needle in many haystacks Observe only 2,244 bankruptcies. Further complication What loss function should be minimized? Profitable customers look risky. Want to lose them? Who is the ideal customer? Borrow lots of money and pay it back slowly. Cross-validation Less appealing with so few events. Univ of Penn p. 1

Approach Forward stepwise search Use p-values to determine whether to add predictors. Two questions 1. How to calculate a p-value? 2. Where to put the threshold for the p-value? Risk inflation criterion (RIC) (m =# possible predictors) Hard thresholding, Bonferroni, Fisher s method Add X j t 2 j > 2 log m p j < 1 m Obtains RIC bound (Foster & George 1994) min ˆβ max β 2 Y X ˆβ E β σ 2 2 log m where β = #{β j 0}. Univ of Penn p. 1

Adaptive Variable Selection Sparse models RIC best when truth is sparse, but lacks power if much signal. False discovery rate motivates an alternative. Adaptive threshold Further motivation If you ve added q predictors, Add X j p j < q+1 m Half-normal plot (Cuthbert Daniel?) Generalized degrees of freedom (Ye 1998, 2002) Empirical Bayes (George & Foster 2000) Information theory (Foster, Stine & Wyner 2002) Universal prior Which predictors minimize this ratio? min ˆq max π E Y Ŷ (ˆq) 2 E Y Ŷ (π) 2 for β π Univ of Penn p. 1

Example: Finding Subtle Signal Signal is a Brownian bridge Stylized version of financial volatility. Y t = BB t + σ ǫ t 6 4 2-2 -4-6 -8 200 400 600 800 1000 Univ of Penn p. 1

Example: Finding Subtle Signal Wavelet transform has many coefficients large relative to the others. but none is very 4 2-3 -2-1 1 2 3-2 -4 MSE with adaptive (top) vs. hard (bottom) 1 1.25 1.5 1.75 2 2.25 2.5 Univ of Penn p. 1

But it fails for bankruptcy! Thresholding classical p-values does not work Fit on 600,000, predict 2,400,000. As fitting proceeds, out-of-sample error shows sudden spike. Diagnostic plots revealed the problem. Stylized problem n = 10, 000, dithered X 1 = 1, X 2,...,X 10,000 N(0, 0.025) P(Y = 1) = 1/1000, independent of X. 1 Y 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1 X t = 14, but common sense suggests p 1/1000. Univ of Penn p. 1

What went wrong? Theory implies p-values are known Normal distribution on error (thin tailed). Know true error variance σ 2, plus its constant. Predictors are orthogonal. ˆβ j iid N(β j, SE j ) Univ of Penn p. 1

What went wrong? Theory implies p-values are known Normal distribution on error (thin tailed). Know true error variance σ 2, plus its constant. Predictors are orthogonal. ˆβ j iid N(β j, SE j ) In practice... Is any data ever normally distributed? Don t believe there s a true model, much less fixed error variance. Collinear features, frequently with m > n. At best, can only guess p-values. Univ of Penn p. 1

Honest p-values Concern from stock example Must avoid false positives that lead to inaccurate predictions, cascade of mistakes. Robustness of validity Rank regression would also protect you to some extent, but has problems dealing with extreme heteroscedasticity. Alternatives Method Robust estimator (e.g. ranks) White estimator Bennett bounds Requires Homoscedastic Symmetry Bounded Y Bounded influence estimators Another possibility, but can it be computed fast enough? Univ of Penn p. 2

White Estimator Y = ˆβ 0 + ˆβ q,1 X q,1 + + ˆβ q,q X q,q + ǫ Sandwich formula (H. White, 1980, Econometrica) Var(ˆβ q ) = (X qx q ) 1 X q Var(ǫ) }{{} X q(x qx q ) 1 Estimate variance using the residuals from prior step: Var(ˆβ q ) = (X qx q ) 1 X q Diag(e 2 q 1) }{{} X q(x qx q ) 1 Result in stylized problem Estimated SE is 10 times larger, more appropriate p-value. Moral: Watch your null hypothesis! Only test H 0 : β q,q = 0 rather than H 0 : β q,q = 0 & Var(ǫ i ) = σ 2 Univ of Penn p. 2

Example: Finding Missed Signal Question Does White estimator ever find an effect that OLS misses? Usually OLS underestimates SE. Heteroscedastic data High variance at 0 obscures the differences at extremes. 30 20 10-1 -0.5 0.5 1-10 -20 Least squares Standard OLS reports SE = 3.6 giving t = 1.2. White SE 0.9 and finds the underlying effect. -30 Univ of Penn p. 2

Honest p-values for Testing H 0 : β = 0 Three methods Each method makes some assumption about the data in order to produce reliable p-value. Method Robust estimator (e.g. ranks) White estimator Bennett bounds Requires Homoscedastic Symmetry Bounded Y Univ of Penn p. 2

Bennett Inequality Think differently Sampling distribution of ˆβ is not normal because huge leverage contributes Poisson-like variation. Bennett inequality (Bennett, 1962, JASA) Bounded independent r.v. U 1,...,U n with max U i < 1, E U i = 0, and i E U2 i = 1, P( ( ( τ τ U i τ) exp M M + 1 ) ) log(1 + Mτ) M 2 i If M τ is small, log(1 + Mτ) Mτ M 2 τ 2 /2 Allows heteroscedastic data Free to divy up the variances as you choose, albeit only for bounded random variables. In stylized example assigns p-value 1/100. Univ of Penn p. 2

Classification Results Success! Not only does the model obtain smaller costs than C4.5 (w/wo boosting), it has huge lift: % BR Found 100 80 60 40 20 20 40 60 80 100 % Called Univ of Penn p. 2

Classification Results Success! Not only does the model obtain smaller costs than C4.5 (w/wo boosting), it has huge lift: % BR Found 100 80 60 40 20 20 40 60 80 100 % Called But... Most predictors were interactions. Know that you missed some things. Slloooowwwwwwww. Univ of Penn p. 2

Big Picture, 2 Moral Better get the right p-values Successful Carefully computed standard errors and p-values + Adaptive thresholding rules = Competitive procedure Univ of Penn p. 2

Big Picture, 2 Moral Better get the right p-values Successful Carefully computed standard errors and p-values + Adaptive thresholding rules = Competitive procedure Downside Know that we re not leveraging domain knowledge. Model is more complicated than need be. Took forever to run. Univ of Penn p. 2

Obesity The issues arising the bankruptcy application only get worse as the data set gets wider... Univ of Penn p. 2

Obesity The issues arising the bankruptcy application only get worse as the data set gets wider... and data sets seem to be getting wider and wider. Application Number of Cases Number Raw Features Bankruptcy 3,000,000 350 Faces 10,000 1,400 Genetics 1,000 10,000 CiteSeer 500 10,000,000 Univ of Penn p. 2

Sequential Selection Simple idea Search features in some order rather than all at once. Opportunities Heuristics Incorporate substantive knowledge into search order Sequential selection of features based on current status Open vs. closed view of space of predictors Run faster Theory from adaptive selection suggests that if you can order predictors, then little to be gained from knowing β. Alpha spending rules in clinical trials. Depth-first rather than breath-first search. Univ of Penn p. 2

Multiple Hypothesis Testing Test m null hypotheses {H 1,...,H m }, H j : θ j = 0. Accept H 0 Reject H 0 True H 0 U θ (m) V θ (m) m 0 State H0 c T θ (m) S θ (m) m m 0 m R(m) R(m) m V θ (m) = m j=1 V θ j indicators Univ of Penn p. 2

Multiple Hypothesis Testing Test m null hypotheses {H 1,...,H m }, H j : θ j = 0. Accept H 0 Reject H 0 True H 0 U θ (m) V θ (m) m 0 State H0 c T θ (m) S θ (m) m m 0 m R(m) R(m) m V θ (m) = m j=1 V j θ indicators Classical criterion Family wide error rate FWER(m) = P 0 (V θ (m) 1) Univ of Penn p. 2

Multiple Hypothesis Testing Test m null hypotheses {H 1,...,H m }, H j : θ j = 0. Accept H 0 Reject H 0 True H 0 U θ (m) V θ (m) m 0 State H0 c T θ (m) S θ (m) m m 0 m R(m) R(m) m V θ (m) = m j=1 V j θ indicators Classical criterion Family wide error rate FWER(m) = P 0 (V θ (m) 1) Classical procedure Bonferroni Test each H j at level α j so that P 0 (V θ j = 1) α j, αj α Univ of Penn p. 2

False Discovery Rate Approach Control testing procedure once it rejects. Univ of Penn p. 3

False Discovery Rate Approach Control testing procedure once it rejects. FDR criterion (Benjamini & Hochberg 1995, JRSSB) Proportion of false positives among rejects ( ) V (m) FDR(m) = E θ R(m) R(m) > 0 P(R(m) > 0). Univ of Penn p. 3

False Discovery Rate Approach Control testing procedure once it rejects. FDR criterion (Benjamini & Hochberg 1995, JRSSB) Proportion of false positives among rejects ( ) V (m) FDR(m) = E θ R(m) R(m) > 0 P(R(m) > 0). Step-down testing (procedure) Order p-values of m independent tests of hypotheses p (1) p (2) p (m) Starts with Bonferroni comparison p (1) α/m, then gradually raises threshold as more hypotheses are rejected. Reject H (j) if p (j) αj/m. Controls FWER and FDR with more power. Univ of Penn p. 3

An Alternative to FDR Count the number of correct rejections S θ (m) in excess of a fraction of the total number rejected. EDC α,γ (m) = E θ [S θ (m) γ R(m)] + α, 0 < α,γ < 1. Heuristically, α controls FWER and γ controls FDR. Testing procedure controls EDC EDC 0 Univ of Penn p. 3

An Alternative to FDR Count the number of correct rejections S θ (m) in excess of a fraction of the total number rejected. EDC α,γ (m) = E θ [S θ (m) γ R(m)] + α, 0 < α,γ < 1. Heuristically, α controls FWER and γ controls FDR. Testing procedure controls EDC EDC 0 Count EDC E Θ R E Θ S Θ Γ E Θ R Α No signal Moderate Strong signal Θ Univ of Penn p. 3

Comparison to FDR Simulate tests of m = 200 hypotheses with { 0 w.p. 1 π 1 µ j N(0,σ 2 ) w.p. π 1 Univ of Penn p. 3

Comparison to FDR Simulate tests of m = 200 hypotheses with { 0 w.p. 1 π 1 µ j N(0,σ 2 ) w.p. π 1 FDR 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 EDC 6 4 2 0-2.1.2.3.4.5.6.7.8.9 1 Π 1.1.2.3.4.5.6.7.8.9 1 Π 1 Three methods: naive fixed α level, step-down, Bonferroni. Univ of Penn p. 3

Alpha-Investing Rules Sequentially tests hypotheses in style of alpha-spending rules. Univ of Penn p. 3

Alpha-Investing Rules Sequentially tests hypotheses in style of alpha-spending rules. Alpha-wealth Test next hypothesis H j with level α j up to its current alpha-wealth, 0 < α j W(j 1) Univ of Penn p. 3

Alpha-Investing Rules Sequentially tests hypotheses in style of alpha-spending rules. Alpha-wealth Test next hypothesis H j with level α j up to its current alpha-wealth, 0 < α j W(j 1) P-value Determines how alpha-wealth changes: { ω p j if p j α j, W(j) W(j 1) + α j if p j > α j. Payout If the test rejects, the rule earns payout ω toward future tests. Otherwise, its wealth decreases by α j. Univ of Penn p. 3

Alpha-investing Controls EDC Benefit of EDC Because it uses differences in counts rather than expected ratio, we can prove alpha investing satisfies EDC criterion. Theorem An alpha-investing rule with initial wealth W(0) α and payoff ω 1 γ controls EDC, for any stopping time M. inf M inf θ E θ EDC α,γ (M) 0 Generality The tests need not be independent, just honest in the sense that E (V j R 1,...,R j 1 ) α j under H j Univ of Penn p. 3

Comparison to Step-Down Testing Simulation Same setup. 200 hypotheses, varying proportion of signal π 1. Testing procedures Step-down testing Alpha-investing: conservative and aggressive Two scenarios: What is known? Order of testing differs in the two cases: Nothing: random order. Lots: order of µ j (not Y j ) Univ of Penn p. 3

Approaches to Alpha-Investing Conservative Micro-investing matches performance of FDR. Suppose FDR rejects k hypotheses, and α = ω = W(0). Test all m hypotheses at Bonferroni level α/m. Rejects H (1) : pay α = m α/m to earn α. Test other m 1 hypotheses given p j > α/m. Rejects H (2) : pays α and earns α. Continue... Rejects at least those FDR rejects, and perhaps more. Univ of Penn p. 3

Approaches to Alpha-Investing Conservative Micro-investing matches performance of FDR. Suppose FDR rejects k hypotheses, and α = ω = W(0). Test all m hypotheses at Bonferroni level α/m. Rejects H (1) : pay α = m α/m to earn α. Test other m 1 hypotheses given p j > α/m. Rejects H (2) : pays α and earns α. Continue... Rejects at least those FDR rejects, and perhaps more. Aggressive Spend more α testing leading hypotheses since these should have the most signal if your science is right. e.g., Set level as α j = cw(0) j 2, j = 1, 2,... Univ of Penn p. 3

Results of Simulation Simulate tests of m = 200 hypotheses with { 0 w.p. 1 π 1 µ j N(0,σ 2 ) w.p. π 1 FDR 0.07 0.06 0.05 0.04 0.03 0.02 0.01 EDC 6 4 2 0-2.1.2.3.4.5.6.7.8.9 1 Π 1.1.2.3.4.5.6.7.8.9 1 Π 1 Step-down testing Alpha-investing (conservative -, aggressive ) Univ of Penn p. 3

Knowledge = Power Percentage rejection relative to step-down testing. 100 Sθ (m, aggressive investing) S θ (m, step-down) > 150% Aggressive alpha investing works well when the information it uses is accurate. % Rejected vs Step Down 150 140 130 120 110 100 Aggressive Conservative Step down.1.2.3.4.5.6.7.8.9 1 Π 1 Univ of Penn p. 3

Going Further In for a penny, in for a pound EDC and alpha-investing provide framework for testing a stream of hypotheses using several investing rules. Multiple predictor streams Start with raw variables as basic stream. Once select X 1 and X 2, try X 1 X 2. Auction = Multiple alpha-investing rules It works! Two rules, one for X s and second for interactions. Each starts with wealth α/2. Easy to wager more for m X s than m 2 interactions. Best strategy accumulates wealth, makes most choices. Have run auctions through 1,000,000 predictors. Finds linear effects and high-order interactions. Univ of Penn p. 3

Discussion Variable selection Importance of p-values (which not all methods have). Role for cross-validation. Provably as good as class of Bayes estimators. Univ of Penn p. 4

Discussion Variable selection Importance of p-values (which not all methods have). Role for cross-validation. Provably as good as class of Bayes estimators. Strategies for feature creation Room for serious improvement Substantive expert can order features (chemist) Parasitic rules that exploit substantive rules Univ of Penn p. 4

Discussion Variable selection Importance of p-values (which not all methods have). Role for cross-validation. Provably as good as class of Bayes estimators. Strategies for feature creation Room for serious improvement Substantive expert can order features (chemist) Parasitic rules that exploit substantive rules Multiple testing using EDC Adaptive sequential trials Universal wagering Univ of Penn p. 4