Problem statement. The data: Orthonormal regression with lots of X s (possible lots of β s are zero: Y i = β 0 + β j X ij + σz i, Z i N(0, 1),

Similar documents
Multi-View Regression via Canonincal Correlation Analysis

Information Theory and Model Selection

Alpha-Investing. Sequential Control of Expected False Discoveries

Streaming Feature Selection using IIC

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Variable Selection in Wide Data Sets

Bayesian Regularization

Local Asymptotics and the Minimum Description Length

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Defensive Forecasting: 4. Good probabilities mean less than we thought.

Model Checking and Improvement

Bayesian methods in economics and finance

Are You a Bayesian or a Frequentist?

L. Brown. Statistics Department, Wharton School University of Pennsylvania

Discussion of Predictive Density Combinations with Dynamic Learning for Large Data Sets in Economics and Finance

Variable Selection in Wide Data Sets

Discussion. Dean Foster. NYC

Some Curiosities Arising in Objective Bayesian Analysis

Divergence Based priors for the problem of hypothesis testing

Markov Chain Monte Carlo (MCMC)

CPSC 540: Machine Learning

Bayesian Econometrics

Bayesian linear regression

Prior Distributions for the Variable Selection Problem

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Model selection using penalty function criteria

Bayesian Nonparametric Point Estimation Under a Conjugate Prior

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Median Cross-Validation

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Model comparison and selection

Introduction to Algorithmic Trading Strategies Lecture 10

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Regression I: Mean Squared Error and Measuring Quality of Fit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Deterministic Calibration and Nash Equilibrium

Bayesian Semiparametric GARCH Models

Bayesian Semiparametric GARCH Models

Bayesian Analysis for Natural Language Processing Lecture 2

STAT 740: Testing & Model Selection

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Optimal Estimation of a Nonsmooth Functional

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Empirical Bayes Quantile-Prediction aka E-B Prediction under Check-loss;

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Model Selection and Geometry

Convolutions and Their Tails. Anirban DasGupta

7. Estimation and hypothesis testing. Objective. Recommended reading

Tweedie s Formula and Selection Bias. Bradley Efron Stanford University

Lecture 7 Introduction to Statistical Decision Theory

Stat 5101 Lecture Notes

Regression, Ridge Regression, Lasso

Curve Fitting Re-visited, Bishop1.2.5

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Introduction to Bayesian Statistics

Statistical Inference

:Effects of Data Scaling We ve already looked at the effects of data scaling on the OLS statistics, 2, and R 2. What about test statistics?

BAYESIAN DECISION THEORY

Lecture 4 How to Forecast

Adaptive Wavelet Estimation: A Block Thresholding and Oracle Inequality Approach

Adaptive Monte Carlo methods

Context tree models for source coding

A PECULIAR COIN-TOSSING MODEL

Bayesian estimation of the discrepancy with misspecified parametric models

Inference for High Dimensional Robust Regression

A Bayesian Analysis of Some Nonparametric Problems

Worst-Case Bounds for Gaussian Process Models

FDR-CONTROLLING STEPWISE PROCEDURES AND THEIR FALSE NEGATIVES RATES

Statistica Sinica Preprint No: SS R1

Warwick Business School Forecasting System. Summary. Ana Galvao, Anthony Garratt and James Mitchell November, 2014

Overall Objective Priors

Learning distributions and hypothesis testing via social learning

Linear Models A linear model is defined by the expression

Controlling Bayes Directional False Discovery Rate in Random Effects Model 1

Review of Statistics

The lasso, persistence, and cross-validation

All models are wrong but some are useful. George Box (1979)

An Extended BIC for Model Selection

Cover Page. The handle holds various files of this Leiden University dissertation

Generalization theory

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Master s Written Examination

The Game of Normal Numbers

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection

Chapter 7: Hypothesis testing

Predictive Distributions

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Probability and Measure

Machine learning, shrinkage estimation, and economic theory

Bayesian model selection for computer model validation via mixture model estimation

Construction of PoSI Statistics 1

Other Noninformative Priors

Transcription:

Problem statement The data: Orthonormal regression with lots of X s (possible lots of β s are zero: Y i = β 0 + p j=1 β j X ij + σz i, Z i N(0, 1), Equivalent form: Normal mean problem (known σ) Y i = µ i + Z i, Z i N(0, 1), Unimodal prior for µ: π M iff π(µ) is a symmetric µ µ implies π(µ) π(µ ). Risk function: Kullback-Liebler divergence. R n ( µ, π) = log P µ (Y X) P π (Y X) P µ (Y X)dY. Problem: Find a universal π. 1

Risk lower bounds Theorem 1 For all n, for all µ, and π M, R n ( µ, π) c i min ( µ 2 i + ɛ(π), 1 µ 2 i + log µ i ɛ(π) ) But, how is ɛ(π) defined? Marginal distribution of Y i : φ π (y) = φ(y µ)π(µ)dµ. τ(π) says when φ π s tail gets fat relative to a nromal tail: τ τ(π) = inf τ {τ : φ π (y)dy τ φ(y)dy > 7.38... = e2 }. ɛ(π) measures how big this fat tail is: ɛ(π) = τ(π) φ π(y)dy. 2

Knowing ɛ(π) is as good as knowing π Goal: find a single prior that can do almost as well as any unimodal prior with a fixed value of ɛ(π) Spike and slab (Cauchy slab) ˆπ ɛ (µ) = (1 ɛ) Spike + ɛ Cauchy Theorem 2 For all n, for all µ, and ɛ.5, R n ( µ, ˆπ ɛ ) 2 i min ( µ 2 i + ɛ, 1 µ 2 i + log µ i ɛ ) Note: Same shape as lower bound. So it is only off by a constant factor. 3

Suppose p = 1. Our risk compared to the lower bound. Figure 1: Risk of the Cauchy mixture ˆπ 0.001 and the lower bound for the divergence attainable by any Bayes prior with ɛ(π) = 0.001. Figures 2 and 3: The ratio is bounded by 6 in these examples for ɛ = 0.01 (left) and ɛ =.00001 (right). 4

Empirical Bayes: Doing without ɛ Put prior on ɛ: ɛ Beta(0, p) strongly biased towards null model Puts most of the weight near ɛ = 0 P (ɛ < 1/p) >.5 Induces an exchangable prior over µ. call it π. Theorem 3 R n ( µ, π) ω 0 + ω 1 inf π M R n( µ, π) Key point: π has almost as good a risk as the best unimodal prior. 5

Do there exist other procedures that have ω 0 and ω 1 both constant? Normal is bad: A spike and normal slab has unbounded ω 1 (even if calibration is used like in George and Foster). Tradition rules are bad: AIC / BIC / C p have unbounded ω 1. Risk inflation is better: The best a testimator can achieve is ω 1 = O(log p). (Donoho and Johnstone / Foster and George). Jefferies is competitive: If ɛ Beta(.5,.5) then ω 1 is constant, but ω 0 = O(log p). So still not linear. Adaptive rules work: Some adaptive procedures might work (nothing has been proven though): Simes-like methods (Benjamini and Hochberg) estimated degrees of freedom (Ye) Empirical Bayes? (Zhang) 6

Everyone likes a good forecast. If you don t like the risk perspective, how about a forecasting perspective? Dawid s prequential approach Predict successive observations Use so-called log-loss decision-maker gives a forecast of P ( ) Y is observed Loss = log 1 P (Y ) our total loss = intrinsic }{{ loss} +O(best Bayes excess) µ known n i=1 log 1 P i 1 ˆπ (Y i ) = n log 1 Pµ i 1 (Y i ) } i=1 {{ } O(n) +O p ( inf R n( µ, π) ) π M } {{ } O(log n) 7

Take home messages Don t worry about eliciting the shape of a IID prior for variable selection. It can be done well enough by automatic methods so the effort isn t justified. Bias your priors toward not including variables. Pretend you have seen p insignificant variables before you start. Make sure about 1/2 of your probability is on the no signal model. Cauchy priors are cool! 8

Adaptive Variable Selection with Bayesian Oracles Dean Foster & Bob Stine Department of Statistics, The Wharton School University of Pennsylvania, Philadelphia PA diskworld.wharton.upenn.edu November 3, 2002 9

Adaptive Variable Selection with Bayesian Oracles Dean Foster & Bob Stine Department of Statistics, The Wharton School University of Pennsylvania, Philadelphia PA diskworld.wharton.upenn.edu November 3, 2002 10

Abstract We analyze the performance of adaptive variable selection with the aid of a Bayesian oracle. A Bayesian oracle supplies the statistician with a distribution for the unknown model parameters, here the coefficients in an orthonormal regression. We derive lower bounds for the predictive risk of regression models constructed with the aid of a class of Bayesian oracles, those that are unimodal and symmetric about zero. These bounds are not asymptotic and obtain for all sample sizes and model parameters. We then construct a model whose predictive risk is bounded by a linear function of the risk obtained by the best Bayesian oracle. The procedure that achieves this performance is related to an empirical Bayes estimator and those derived from step-up/step-down testing. 11

Abstract We analyze the performance of adaptive variable selection with the aid of a Bayesian oracle. A Bayesian oracle supplies the statistician with a distribution for the unknown model parameters, here the coefficients in an orthonormal regression. We derive lower bounds for the predictive risk of regression models constructed with the aid of a class of Bayesian oracles, those that are unimodal and symmetric about zero. These bounds are not asymptotic and obtain for all sample sizes and model parameters. We then construct a model whose predictive risk is bounded by a linear function of the risk obtained by the best Bayesian oracle. The procedure that achieves this performance is related to an empirical Bayes estimator and those derived from step-up/step-down testing. 12