Cross-Validation in Function Estimation

Similar documents
5.1 Two-Step Conditional Density Estimator

BIO752: Advanced Methods in Biostatistics, II TERM 2, 2010 T. A. Louis. BIO 752: MIDTERM EXAMINATION: ANSWERS 30 November 2010

Ch. 1 Introduction to Estimation 1/15

D.S.G. POLLOCK: TOPICS IN TIME-SERIES ANALYSIS STATISTICAL FOURIER ANALYSIS

ENGI 4421 Central Limit Theorem Page Central Limit Theorem [Navidi, section 4.11; Devore sections ]

Chapter 3.1: Polynomial Functions

ENGI 4421 Central Limit Theorem Page Central Limit Theorem [Navidi, section 4.11; Devore sections ]

Quantum Mechanics for Scientists and Engineers. David Miller

Unifying the Derivations for. the Akaike and Corrected Akaike. Information Criteria. from Statistics & Probability Letters,

Author. Introduction. Author. o Asmir Tobudic. ISE 599 Computational Modeling of Expressive Performance

Solutions. Definitions pertaining to solutions

Grade 3 Mathematics Course Syllabus Prince George s County Public Schools

Multi-objective Programming Approach for. Fuzzy Linear Programming Problems

A Study on Estimation of Lifetime Distribution with Covariates Under Misspecification

Internal vs. external validity. External validity. Internal validity

Intermediate Division Solutions

TECHNICAL REPORT NO Generalization and Regularization in Nonlinear Learning Systems 1

Fourier Series & Fourier Transforms

K [f(t)] 2 [ (st) /2 K A GENERALIZED MEIJER TRANSFORMATION. Ku(z) ()x) t -)-I e. K(z) r( + ) () (t 2 I) -1/2 e -zt dt, G. L. N. RAO L.

COWLES FOUNDATION FOR RESEARCH IN ECONOMICS YALE UNIVERSITY

Physical Chemistry Laboratory I CHEM 445 Experiment 2 Partial Molar Volume (Revised, 01/13/03)

Lecture 21: Signal Subspaces and Sparsity

Markov processes and the Kolmogorov equations

Study the bias (due to the nite dimensional approximation) and variance of the estimators

MODIFIED LEAKY DELAYED LMS ALGORITHM FOR IMPERFECT ESTIMATE SYSTEM DELAY

Matching a Distribution by Matching Quantiles Estimation

Comparative analysis of bayesian control chart estimation and conventional multivariate control chart

, the random variable. and a sample size over the y-values 0:1:10.

A Hartree-Fock Calculation of the Water Molecule

[1 & α(t & T 1. ' ρ 1

Study of Energy Eigenvalues of Three Dimensional. Quantum Wires with Variable Cross Section

Claude Elysée Lobry Université de Nice, Faculté des Sciences, parc Valrose, NICE, France.

Statistica Sinica 6(1996), SOME PROBLEMS ON THE ESTIMATION OF UNIMODAL DENSITIES Peter J. Bickel and Jianqing Fan University of California and U

Super-efficiency Models, Part II

Active redundancy allocation in systems. R. Romera; J. Valdés; R. Zequeira*

Axial Temperature Distribution in W-Tailored Optical Fibers

Regression Quantiles for Time Series Data ZONGWU CAI. Department of Mathematics. Abstract

Mean residual life of coherent systems consisting of multiple types of dependent components

MATH Midterm Examination Victor Matveev October 26, 2016

Preliminary Test Single Stage Shrinkage Estimator for the Scale Parameter of Gamma Distribution

Sound Absorption Characteristics of Membrane- Based Sound Absorbers

Fourier Method for Solving Transportation. Problems with Mixed Constraints

ALE 26. Equilibria for Cell Reactions. What happens to the cell potential as the reaction proceeds over time?

AP Statistics Notes Unit Eight: Introduction to Inference

Examination No. 3 - Tuesday, Nov. 15

The Excel FFT Function v1.1 P. T. Debevec February 12, The discrete Fourier transform may be used to identify periodic structures in time ht.

Lecture 3: August 31

CHAPTER 24: INFERENCE IN REGRESSION. Chapter 24: Make inferences about the population from which the sample data came.

Portfolio Performance Evaluation in a Modified Mean-Variance-Skewness Framework with Negative Data

Smoothing, penalized least squares and splines

A New Method for Finding an Optimal Solution. of Fully Interval Integer Transportation Problems

Christensen, Mads Græsbøll; Vera-Candeas, Pedro; Somasundaram, Samuel D.; Jakobsson, Andreas

are specified , are linearly independent Otherwise, they are linearly dependent, and one is expressed by a linear combination of the others

Machine Learning Brett Bernstein

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

5.80 Small-Molecule Spectroscopy and Dynamics

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

THE ASYMPTOTIC PERFORMANCE OF THE LOG LIKELIHOOD RATIO STATISTIC FOR THE MIXTURE MODEL AND RELATED RESULTS

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

Review for cumulative test

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

CS 477/677 Analysis of Algorithms Fall 2007 Dr. George Bebis Course Project Due Date: 11/29/2007

NYU Center for Data Science: DS-GA 1003 Machine Learning and Computational Statistics (Spring 2018)

x 2 x 3 x b 0, then a, b, c log x 1 log z log x log y 1 logb log a dy 4. dx As tangent is perpendicular to the x axis, slope

Linear Regression Demystified

Lecture 9: September 19

Lecture 2: Monte Carlo Simulation

1 Inferential Methods for Correlation and Regression Analysis

Empirical Process Theory and Oracle Inequalities

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

The Maximum-Likelihood Decoding Performance of Error-Correcting Codes

Bayesian Estimation for Continuous-Time Sparse Stochastic Processes

Chapter 4. Problem Solutions

Pattern Recognition 2014 Support Vector Machines

What is Statistical Learning?

On the affine nonlinearity in circuit theory

Unit -2 THEORY OF DILUTE SOLUTIONS

Chapter 5. Root Locus Techniques

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal

18.S096: Homework Problem Set 1 (revised)

Internal vs. external validity. External validity. This section is based on Stock and Watson s Chapter 9.

WEST VIRGINIA UNIVERSITY

Statistical and Mathematical Methods DS-GA 1002 December 8, Sample Final Problems Solutions

Study in Cylindrical Coordinates of the Heat Transfer Through a Tow Material-Thermal Impedance

Directional Duality Theory

Topic 9: Sampling Distributions of Estimators

x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Recovery of Third Order Tensors via Convex Optimization

Design and Implementation of Cosine Transforms Employing a CORDIC Processor

Local Polynomial Regression

Correlation Regression

Machine Learning Brett Bernstein

Parameter, Statistic and Random Samples

The generation of successive approximation methods for Markov decision processes by using stopping times

An Investigation of Stratified Jackknife Estimators Using Simulated Establishment Data Under an Unequal Probability Sample Design

Output Analysis and Run-Length Control

1 Introduction to reducing variance in Monte Carlo simulations

MATHEMATICS 9740/01 Paper 1 14 Sep hours

Lab 1 The Scientific Method

Resampling Methods. Cross-validation, Bootstrapping. Marek Petrik 2/21/2017

Transcription:

Crss-Validati i Fucti Estimati Chg Gu Octber 1, 2006 Crss-validati is a ituitive ad effective techique fr mdel selecti i data aalysis. I this discussi, I try t preset a few icaratis f the geeral techique i a few parametric fucti estimati settigs. Justificatis f the techique i Gaussia regressi settigs will be discussed, alg with pssible reass fr the lack f similar justificati i ther settigs. There will be discussis f sme subtle cceptual issues which put certai widely adpted ccepts/practice uder scrutiy. 1 Crss-Validati ad Related Techiques 1.1 PRESS ad C p Csider a liear regressi mdel with P 1 predictrs X 1,..., X P 1, Y = β 0 + β 1 X 1 + + β p 1 X P 1 + ɛ, where ɛ N(0, σ 2 ). It is assumed that µ Y = β 0 + β 1 X 1 + + β P 1 X P 1, but sme f the β j X j s may ctribute very little r t at all. Fr mdel selecti i this settig (a.k.a. variable selecti), tw effective techiques are PRESS ad C p. Observig (Y i, X i,1,..., X i,p 1 ), = 1,...,, e may calculate PRESS (Predicted REsidual Sum f Squares) fr every f the (2 P 1 1) pssible mdels, PRESS = (Y i Ŷi(i)) 2, ˆβ [i] ˆβ [i] where Ŷi(i) [i] = 0 + ˆβ 1 X i,1 + + P 1 X i,p 1 ad j s are LS estimates usig the ( 1) bservatis excludig (Y i, X i,1,..., X i,p 1 ). Oe the ca chse the mdel with the miimum PRESS scre. PRESS is prbably the first icarati f crss-validati, ad the bjective f mdel selecti is t achieve mre precise predicti. Fr every f the (2 P 1 1) pssible mdels, e may als calculate the C p statistic, C p = ˆβ [i] SSE ( 2p), MSE(X 1,..., X P 1 ) where SSE is fr the mdel uder csiderati ad p is its umber f cefficiets. Oe selects a mdel with a small C p (s that the MSE E(ˆµ µ) 2 is small) that is clse t p (s that E ˆµ µ). The bjective f C p -based selecti is fr the estimati precisi f µ Y but t fr the idetificati f the crrect mdel, as the full mdel is assumed t be crrect t start with. The estimati f egligible β j s diverts resurces ad thus results i higher variace f the imprtat ˆβ j s, ad the elimiati f the mir terms, while itrducig bias i ˆµ, helps t reduce the variace. 1

1.2 Crss-validati, GCV, ad C L Csider a regressi mdel Y = η(x) + ɛ where ɛ N(0, σ 2 ) ad f(x) is smth. Tw f the ppular appraches t the parametric estimati f f(x) are the kerel methd ad the pealized least squares methd. T keep thigs simple, csider ly uivariate x. Observig (Y i, X i ), = 1,...,, e may estimate f(x) via ˆη h (x) = 1 ( ) Xi x Y i K, h h where K( ) is a give kerel fucti. Typically, K(x) is uimdal ad symmetric with respect t 0, K(x)dx = 1, ad x j K(x)dx = 0 fr j = 1,..., m. This prvides a family f estimates idexed by the badwidth h, with smaller h yieldig smaller bias but larger variace ad larger h yieldig smaller variace but larger bias. Assumig (η (m) (x)) 2 dx <, e may estimate η(x) usig the miimizer η λ (x) f 1 (Y i η(x i )) 2 + λ (η (m) (x)) 2 dx, (1) where the smthig parameter λ plays similar rle as the h fr kerel estimates. With λ = e frces a parametric mdel f the frm η(x) = β 0 + β 1 x + + β m 1 x m 1. With λ = 0 + e btais a iterplat with miimum (η (m) (x)) 2 dx. The sluti is called a smthig (atural) splie as η λ is a piece-wise plymial f rder 2m 1. Evaluatig the estimated η(x) at the data pits X i, e gets the predicted values Ŷi. Fr the keral estimates, smthig splie estimates, ad ther parametric estimates f η(x) kw as liear smthers, e has Ŷ = AY, where A is the smthig matrix idexed by h r λ r the like. I practice, e eeds t select h r λ, fr which the methds i the secti title are desiged; the smthig matrix A plays a imprtat rle here. Write Ŷi = η λ (X i ) with η λ (x) the miimizer f (1), ad Ŷi(i) = η [i] λ (X i), where η [i] λ (x) miimizes the delete-e versi f (1), 1 (Y j η(x j )) 2 + λ (η (m) (x)) 2 dx. (2) j i It ca be shw that η [i] λ (x) is the miimizer f (1) with η[i] λ (X i) replacig Y i, thus a i,i (Y i Ŷi(i)) = Ŷ i Ŷi(i), r Y i Ŷi(i) = (Y i Ŷi)/(1 a i,i ). This leads t the rdiary crss-validati scre V 0 (λ) = 1 (Y i Ŷi(i)) 2 = 1 (Y i Ŷi) 2 (1 a i,i ) 2. (3) A ivariace argumet suggests the replacemet f a i,i by their average value tracea/, yieldig the geeralized crss-validati (GCV) f Crave ad Wahba (1979), V (λ) = 1 Y T (I A(λ)) 2 Y { 1 trace(i A(λ))} 2. (4) 2

Nte that althugh the derivati f GCV is thrugh (1), it ca be used ad d get used fr all liear smthers. Clsely related t GCV is the C L scre, which assumes a kw σ 2. 1.3 Optimality f U(λ) ad V (λ) U(λ) = 1 YT (I A(λ)) 2 Y + 2 σ2 tracea(λ), (5) Befre presetig the theretical justificati f C L ad GCV, let me try t clarify a few cceptual issues. I parametric statistics, e has a discrete cllecti f tetative mdels ad assumes that the crrect mdel is amg them, ad a mdel selecti methd is csistet if it zeres i t this crrect mdel as. The csistecy prperty is t predicti-rieted as i the desig f PRESS, r is it fiite-sample MSE-rieted as i the practical use f C p ; with fr fixed umber f parameters, variace is f ccer ad all that matters is the bias. [I am t familiar with this lie f literature, s I d t kw if there are results whe e f the tetative mdels is crrect, r whether e culd frmulate the prblem i such a way that the umber f parameters icrease with s bias-variace trade-ff remais relevat.] I parametric fucti estimati, the family f tetative estimates frm a trajectry i the fucti space ad e culd t ad des t assume that the true fucti is the trajectry. Istead, e lks fr the estimate that is clsest t the true fucti i sme sese. The bjective f such mdel selecti is t t idetify the crrect mdel (there is e), but rather t lcate the best estimate give the bserved data. Bias-variace trade-ff is the cetral issue here (thugh t always explicitly), ad as, the ptimal chice wuld have smaller ad smaller bias. Fr Gaussia regressi, a atural measure fr estimati precisi is the MSE the data pits, L(λ) = 1 (ˆη λ (X i ) η(x i )) 2, ad the estimate the trajectry that miimizes this MSE wuld be the ptimal chice. Nte that the trajectry is depedet the bserved data, s is the MSE lss ad its miimizer. The C L scre f (5) is actually a ubiased estimate f relative lss, with relative meaig the drppig f terms that d t deped λ. The ptimality f U(λ) ad V (λ) were established by Ker-Chau Li (1986): Let λ, λ u, ad λ v be the miimizers f L(λ), U(λ), ad V (λ), respectively, the L(λ ) L(λ u ) p 1, L(λ ) L(λ v ) p 1. (6) The key results leadig t these are that as λ 0 at certai rates that ctai the ptimal e, U(λ) L(λ) 1 ɛ T ɛ = p (L(λ)), V (λ) L(λ) 1 ɛ T ɛ = p (L(λ)); The mai cditi fr these is that R(λ), where R(λ) = E[L(λ)], which states that the parametric -csistecy is t achievable i the settig. Nte that U(λ) V (λ) p 1 ad L(λ) = p (1), s these are delicate results. Fr the recrd, as ad λ 0, L(λ) = O p (λ + 1 λ 1/2m ) fr the miimizer f (1). 3

1.4 Crss-validati fr desity estimati Csider a desity estimati prblem with idepedet samples X i f(x), i = 1,...,. estimate f(x), e may use the kerel estimate, f h (x) = 1 h ( ) Xi x K, h where K( ) is as give earlier. T assess the perfrmace f f h (x) as a estimate f f(x), e may use the Kullback-Leibler lss L(h) = E f [lg{f(x)/f h (X)}], which, after drppig the term E f [lg f(x)] t ivlvig h, reduces t the relative KL discrepacy E f [lg f h (X)]. Estimatig the relative KL by crss-validated sample mea, e has the KL crss-validati scre, T V (h) = 1 lg f [i] h (X i), (7) where f [i] h (x) is based the ( 1) samples excludig X i. It was shw by Peter Hall (1987) that if the tails f the kerel K(x) are thier tha the tails f f(x), the L(h ) L(h v ) p 1, (8) where h ad h v miimizes L(h) ad V (h), respectively. Parallel t (1), e may assume a fiite supprt X fr f(x) ad emply the lgistic desity trasfrm f(x) = e η(x) / X eη, ad estimate η(x) by the miimizer η λ (x) f the pealized likelihd fuctial, 1 { η(x i ) lg e η(x)} + λ (η (m) (x)) 2 dx. (9) X 2 The Kullback-Leibler distace is w L(λ) = E f [lg{f(x)/f λ (X)}] = E f [η(x) η λ (X)] + {lg X e ηλ(x) lg e η(x) }, (10) X with a relative KL discrepacy lg e ηλ(x) E f [η λ (X)], X where the first term ca be cmputed ad the secd term ca be estimated thrugh a crssvalidated sample mea, 1 η[i] λ (X i), with η [i] λ (x) miimizig the delete-e versi f (9), 1 1 j i { η(x j ) lg X e η(x)} + λ 2 (η (m) (x)) 2 dx. This yields a crss-validati scre V (λ) = lg e ηλ(x) 1 η [i] X λ (X i). (11) 4

While empirical results strgly suggest ptimality similar t thse established by Li (1806) ad Hall (1987), attempts the theretical aalysis have t bee successful. It ca be shw that the symmetrized KL lss, E f [lg{f(x)/f λ (X)}] E fλ [lg{f(x)/f λ (X)}], which rughly dubles the KL lss L(λ) f (10), is f the rder O p (λ + 1 λ 1/2m ), s the miimum KL lss is f rder O p ( 2m/(1+2m) ) = p ( 1/2 ). O the ther had, the estimati f E f [g(x)] thrugh the sample mea is at best f the rder O p ( 1/2 ). Oe seems t eed mre delicate term grupig fr ay success at a theretical aalysis f V (λ). 1.5 Crss-validati fr -Gaussia regressi ad hazard estimati Fr -Gaussia regressi ad hazard estimati, scres similar t (11) were derived fllwig similar lies. The empirical perfrmaces f these scres suggest ptimality prperties similar t (6) ad (8), but theretical aalysis is lackig. 1.6 Crss-validati fr regressi with crrelated errrs Fr regressi with crrelated data, tw scearis have bee treated i tw dissertatis by my frmer studets, Pig Ma ad Chu Ha, respectively. The first thesis was by Pig Ma, which ccers mixed-effect/variace-cmpet mdels f the frm Y = η(x) + z T b + ɛ, where b N(0, B) ad ɛ i.i.d N(0, σ 2 ), with the dimesi p f b much smaller tha ad its variace-cvariace matrix partly r etirely ukw; the errr variace-cvariace matrices are lw-rak mdificatis f σ 2 I. T estimate η(x), e may miimize 1 (Y i η(x i ) z T i b) 2 + λ (η (m) (x)) 2 dx + 1 bt Σb, where (λ, Σ) are tuig parameters; Σ shuld reflect the structure f B 1. The jit estimati f (η(x), b) yields Ŷ = ˆη(x) + zt ˆb, ad e still has a expressi Ŷ = A(λ, Σ)Y. The selecti f (λ, Σ) ca be de usig C L ad GCV, ad we were able t establish the ptimality f such practice similar t (6), with the lsses L 1 (λ, Σ) = 1 (Ŷi η(x i ) z T i b) 2 ad L 2 (λ, Σ) = 1 (ˆη η)t P Z (ˆη η), where η T = (η(x 1 ),..., η(x )), P Z = I Z(ZT Z) 1 Z T, ad Z T = (z 1,....z ). Nte that we are t ccered with the estimati f B i this exercise, but ly use Σ (r B 1 ) as tuig parameters fr the estimati f η(x) r η(x) + z T b. Fr the ptimality with respect t L 1, e eeds p = O( ). Fr L 2, e eeds p fixed. Similar practice was applied t -Gaussia regressi with satisfactry empirical perfrmace, but thery is lackig. 5

The secd thesis was by Chu Ha, which ccers statiary time series mdels r mixedeffect mdels with the dimesi f b grwig with ; the errr variace-cvariace matrices are lger lw-rak mdificatis f σ 2 I. Frmally, e has Y i = η(x i ) + ɛ i, ɛ N(0, σ 2 W 1 ), where W is kw up t a set f parameter γ. Oe may estimate η(x) via the miimizati f (Y η) T W (Y η) + λ (η (m) (x)) 2 dx, with (λ, γ) as tuig parameters. Fr the jit selecti f (λ, γ), we derived the C L like scre U(λ, γ) = 1 σ 2 YT W 1/2 (I A) 2 W 1/2 Y 1 lg W + 2 tracea, where W 1/2 Ŷ = AW 1/2 Y, ad the GCV like scre V (λ, γ) = lg{ 1 Y T W 1/2 (I A) 2 W 1/2 Y} 1 lg W + 2 tracea tracea. Remember that W depeds γ ad A depeds (λ, γ). Usig the Kullback-Leibler lss fr the jit estimati f (η 0 (x), γ 0 ) by (η(x), γ), [ 1 L(λ, γ) = E 0 2σ 2 (Y η)t W (Y η) 1 1 lg W 2 2σ 2 (Y η 0) T W 0 (Y η 0 ) + 1 ] 2 lg W 0 = 1 2σ 2 (η 0 η) T W (η 0 η) + 1 1 tr(w W0 I) 1 1 lg W W0, 2 2 the ptimality similar t (6) was established fr U ad V, ad the resultig ˆγ is -csistet. A key cditi fr the thery is that W 1 > ci fr sme c > 0 uifrmly ver the tetative γ. The thery applies t the stadard statiary ad ivertible ARMA mdels with γ i a cmpact set, ad t a mixed-effect mdel with p. 1.7 Refereces PRESS is i early every textbk regressi aalysis ad liear mdels, thugh I d t kw exactly wh first prpsed it. T read abut C p ad C L, check ut the classical referece f Mallws. Mallws, C. L. (1973). Sme cmmets C P, Techmetrics 15, 661 675. Fr the mtivati ad derivati f GCV alg with sme attempt the theretical justificati f U(λ) ad V (λ), check ut the semial wrk f Crave ad Wahba (1979); the risk-based thery des t really justify its practical use, hwever. The really relevat lss-based justificati f U(λ) ad V (λ) was by Li (1986). Crave, P. ad G. Wahba (1979). Smthig isy data with splie fuctis: Estimatig the crrect degree f smthig by the methd f geeralized crss-validati, Numer. Math. 31, 377 403. Li, K.-C. (1986). Asympttic ptimality f C L ad geeralized crss-validati i the ridge regressi with applicati t splie smthig, A. Statist. 14, 1101 1112. 6

Kerel desity estimati ad the assciated badwidth selecti was ce a busiess by itself. The wrk f Hall (1987) fr the justificati f KL-based crss-validati scre (7) was a imprtat ctributi, but it was take by may as egative the KL distace as perfrmace measure ad egative CV as badwidth selectr, ad partly ispired the develpmets f the s-called plug-i methds fr badwidth selecti; I csider this a ufrtuate tur f evet. Hall, P. (1987). O Kullback-Leibler lss ad desity estimati, A. Statist. 15, 1491 1519. The implemetati ad empirical perfrmace f (11) ca be fud i my jit wrk with Jigyua Wag ad i Chapter 6 f my bk. Discussis ccerig -Gaussia regressi ad hazard estimati are i Chapters 5 ad 7 f my bk. Gu, C. ad J. Wag (2003). Pealized likelihd desity estimati: Direct crss-validati ad scalable apprximati, Statist. Si. 13, 811 826. Gu, C. (2002). Smthig Splie ANOVA Mdels. New Yrk: Spriger-Verlag. Details ccerig crss-validati fr regressi with crrelated errrs are t be fud i the fllwig articles. Gu, C. ad P. Ma (2005). Optimal smthig i parametric mixed-effect mdels, A. Statist. 33, 1357 1379. Gu, C. ad P. Ma (2005). Geeralized parametric mixed-effect mdels: Cmputati ad smthig parameter selecti, J. Cmput. Graph. Statist. 14, 485 504. Ha, C. ad C. Gu (2006). Optimal smthig with crrelated data, mauscript. 2 Prblems (Real ad Perceived), Ccepts, ad Ctrversies 2.1 A simple simulati settig We will use a simple simulati settig t illustrate sme f the issues t be discussed. The issues are t specific t this particular settig, as similar simulatis i ther settigs demstrate the same qualitative characteristics. O x i = (i.5)/, i = 1,,, we geerate 100 replicates f data frm Y i = η(x i ) + ɛ i, ɛ i N(0, σ 2 ), with η(x) = 1 + 3 si(2πx π) ad σ 2 = 1. Fr λ a fie grid f lg 10 λ = ( 5)(.05)( 1), we calculate the miimizers f (1) with m = 2 fr each f the replicates ad evaluate varius quatities assciated with them. The grid was brad eugh t bracket the λ f iterest fr all the 100 replicates. 2.2 Udersmthig ad mdificatis f crss-validati Oe prblem suffered by crss-validati methds is udersmthig: i up t 10% f the cases, the methds lead t very small λ r h r the like, resultig i udersmthig r eve iterplati. The prblem des t seem t g away with larger, at least t fr up t 500. A alterative t V (λ) f (4) is the s-called geeralized maximum likelihd (GML) methd derived uder the empirical Bayes iterpretati f smthig splies, which is simply the restricted 7

L(λ) f GCV 0.00 0.10 0.20 0.30 0.00 0.10 0.20 0.30 mi L(λ) lg 10 (λ) f V 1.4 (λ) r M(λ) 5.0 4.0 3.0 2.0 5.0 4.0 3.0 2.0 lg 10 (λ) f V 1 (λ) Relative Efficacy 0.0 0.2 0.4 0.6 0.8 1.0 M 1.0 1.4 1.8 α Figure 1: Left: Perfrmaces f V α (λ) with α = 1 (faded) ad α = 1.4 fr = 100. Ceter: the λ miimizig V 1 (λ) versus that miimizig M(λ) (faded) r V 1.4 (λ), fr = 100. Right: mi L(λ)/L(ˆλ) with ˆλ miimizig M(λ) r V α (λ) at α = 1, 1.2, 1.4, 1.6, 1.8, fr = 100 (fatter bxes) ad = 500 (thier bxes). maximum likelihd (REML) methd fr mixed-effect/variace cmpet mdels. The GMLscre is give by M(λ) = 1 Y T (I A(λ))Y I A(λ) 1/( m), + where I A + is the prduct f the m psitive eigevalues f (I A). The GML methd ever iterplates, but csistetly udersmthes fr η(x) super smth. A simple mdificati seems t cure the udersmthig prblem fr crss-validati. Fr V (λ) f (4), e may use V α (λ) = 1 Y T (I A(λ)) 2 Y { 1 trace(i αa(λ))} 2, (12) where α > 1, ad fr V (λ) f (11), e may use V α (λ) = lg e ηλ(x) 1 X η λ (X i ) + α 1 {η λ (X i ) η [i] λ (X i)}. (13) Simulati studies suggest that a α i the rage 1.2 1.4 wuld be the mst effective. I the settig f 2.1, e may calculate the lss L(λ) as well as the selecti scres V α (λ) ad M(λ) the λ grid fr all the replicates, ad idetify λ, λ v alg with the assciated L(λ). Figure 1 summarizes sme f the empirical results. A s-called exteded expetial (EE) methd was prpsed by Ku ad Efr (2002) t cmbie the stregths f C p ad GML, but I am t able t fllw the argumets. 2.3 Negative crrelati ad mdel idexig Oe majr criticism f crss-validati i the literature was the famus egative crrelati betwee the ptimal ad crss-validated badwidths, as demstrated i the middle frame f Figure 2. The egative crrelati is bthersme ly whe the idex λ is meaigful acrss-replicate, hwever, which will be aalyzed belw. 8

L(λ) f GCV 0.00 0.10 0.20 0.30 * 0.00 0.10 0.20 0.30 mi L(λ) lg 10 (λ v ) 5.0 4.0 3.0 2.0 5.0 4.0 3.0 2.0 lg 10 (λ ) lg 10 (ρ v ) 3.5 4.0 4.5 5.0 5.5 6.0 3.5 4.0 4.5 5.0 5.5 6.0 lg 10 (ρ ) Figure 2: Left: Perfrmaces f V 1 (λ) fr = 100. Ceter: lg 10 (λ ) versus lg 10 (λ v ). Right: lg 10 (ρ ) versus lg 10 (ρ v ). lg 10 (λ) 5 4 3 2 1 2 3 4 5 6 lg 10 (ρ) lg 10 (λ) 3.6 3.2 2.8 2.4 3.2 3.6 4.0 4.4 lg 10 (ρ) Efficacy f Typical ρ r λ 0.0 0.2 0.4 0.6 0.8 1.0 ρ λ Figure 3: Left ad Ceter: The ρ-λ mappig ad the (ρ, λ ) pair f 100 replicates. Right: Efficacy f typical ρ r λ. Remember that ulike i the parametric settigs where e selects frm a discrete set f mdels, we are chsig frm a ctiuum f tetative estimates η λ (x). A imprtat issue is hw t alig estimates based differet data. Mathematically, the miimizer f (1) is the sluti t a cstraied LS prblem, mi (Y i η(x i )) 2, s.t. (η (m) (x)) 2 dx ρ fr sme ρ > 0. There is a e-t-e crrespdece betwee λ ad ρ give the data. Ituitively, ρ is meaigful acrss-replicate as it impses the same cstrait η(x) i the estimati prcess regardless the data e bserves, whereas the same λ implies differet cstraits fr differet bservatis. A simple simulati will cfirm that it is ideed the case, that η λ (x) based differet data shuld be aliged by the crrespdig ρ, but we first bserve i the right frame f Figure 2 that the egative crrelati disappears the ρ scale. I the settig f 2.1, e may calculate ρ = 1 0 (η(2) (x)) 2 dx fr all estimates η λ (x) the λ grid, fr all replicates, ad fr the true fucti η(x) = 1 + 3 si(2πx π); the true fucti has ρ = 10 3.846. I the left ad middle frames f Figure 3, we demstrate the ρ-λ mappig fr 100 9

replicates f samples f size = 100 ad idetify the 100 ptimal (ρ, λ ) the grid; it is reassurig t see that 10 3.846 is i the middle f the ρ s. T assess the acrss-replicate iterpretability f ρ ad λ, we pick a typical ptimal ρ = 10 3.846, ad a typical ptimal λ = media(λ ), ad calculates their efficacy ver the replicates via L(ρ )/L( ρ) ad L(λ )/L( λ) which is shw i the right frame f Figure 3. The tight spread f ρ s cfirms the ituiti that ρ is the prper mdel idex here. Besides smthig splies, there des t seem t exist a ρ-idex fr ther smthig methds. Nevertheless, mdel idexig remais a imprtat issue, which has subtle implicatis i the thery ad practice f badwidth selecti. Eve fr smthig splies, the ρ-idex is difficult t wrk with, ad the idetificati f it as the prper mdel idex des t ffer ay peratial help. It hwever helps t idetify sme ppular but questiable ccepts ad practices. Oe f the misleadig ccept is the degree-f-freed i regressi, which is defied as tracea: give x i s, λ-tracea is a e-t-e mappig, s the selecti f λ thrugh tracea simply leaves thigs i the hads f radm ise ɛ i. I classical parametric statistics, the degree-ffreedm is defied i terms f mdel dimesi, ad it is relevat i settigs ther tha regressi. The cicidece that the trace f hat matrix i liear regressi matches the mdel dimesi des t autmatically qualify the trace as a valid mdel idex. I fact, tracea is much wrse tha λ, as it allws e t cmpare A(λ) with A(h) while they are t cmparable. Resamplig is widely used i may phases f statistical aalysis. Wrkig with a idex such as λ that is t meaigful acrss-replicate, hwever, e shuld avid usig resamplig fr badwidth selecti. Besides the egative crrelati, ather criticism agaist crss-validati is that the crssvalidated badwidth cverges very slwly asympttically. Tw aspects i this argumet eed clse scrutiy: i) if the target f cvergece is smethig like a fixed λ, the it is wrry because the ptimal λ chages with data ad λ v may simply be chasig λ. ii) The badwidth is t part f the stchastic settig but a tuig parameter i the estimati prcess, ad its meaig is ly thrugh L(λ), s it s kay fr λ v t be far frm λ as lg as L(λ )/L(λ v ) 1. 2.4 Lss versus risk Risk calculati is a basic exercise i the theretical aalysis f statistical prcedures, but whe de usig a mdel idex t iterpretable acrss-replicate, the prper use f it ca be tricky. Imagie that yu are studyig the MSE perfrmace f LS regressi empirically, ad yu have 3 mdels M 1 = {µ = E[Y ] = β 0 + β 1 x}, M 2 = {µ = β 0 + β 1 x + β 2 x 2 }, ad M 3 = {µ = β 0 + β 1 x + β 2 x 2 + β 3 x 3 }. After geeratig data, yu calculate ˆµ (j), j = 1, 2, 3, uder M j ad calculate the MSE 1 (ˆµ(j) i µ i ) 2, ad yu average ver replicates the MSEs crrespdig t the same ˆµ (j). Yu d be crazy t take ˆµ (1) frm the first 20 replicates t average with ˆµ (2) frm the ext 40 replicates. While t a much mir extet, the calculati f R(λ) = E[L(λ)] is like mixig ˆµ (1) s frm sme replicates with ˆµ (2) s frm ther replicates i the abve example. Yu may tice that I ever call η λ (x) a estimatr but ly a estimate give the data. May wuld call the miimizer f R(λ) ptimal, but sice I ever bserve a average sample, I csider a risk-ptimal λ meaigless, ad d t care abut ay cvergece twards it. While derivative ccepts based R(λ) have meaig i my bks, R(λ) serves as a imprtat aalytical device. Fr example, t prve V (λ) L(λ) 1 ɛ T ɛ = p (L(λ)), e prceeds by shwig that V (λ) L(λ) 1 ɛ T ɛ = p (R(λ)) ad that L(λ) R(λ) = p (R(λ)); t establish L(λ) = O p (λ + 1 λ 1/2m ), e simply shw that R(λ) = O(λ + 1 λ 1/2m ). 10

Risk calculati usig the ρ-idex wuld be cceptually meaigful, ly if it were pssible. Althugh e questis the appeal f lss-ptimal badwidth ad mst d simulatis i terms f it (egative crrelati wuld t be there therwise), may authrs claim that the data d t ctai eugh ifrmati fr e t pursue the lss-ptimal badwidth. I view f the results by Li (1986) ad Hall (1987), such claims are misleadig. 2.5 Refereces Fr the derivati f GML ad its asympttic aalysis, check Wahba (1985). The GML alway pick λ 2m/(2m+1) but fr η(x) super smth the ptimal e is λ 2m/(4m+1). Wahba, G. (1985). A cmparis f GCV ad GML fr chsig the smthig parameter i the geeralized splie smthig prblem, A. Statist. 13, 1378 1402. The mdificatis f crss-validati i (12) ad (13) ad the empirical perfrmaces ca be fud i the articles based the theses f Jigyua Wag ad Yug-Ju Kim. Gu, C. ad J. Wag (2003). Pealized likelihd desity estimati: Direct crss-validati ad scalable apprximati, Statist. Si. 13, 811 826. Kim, Y.-J. ad C. Gu (2004). Smthig splie Gaussia regressi: Mre scalable cmputati via efficiet apprximati, J. Ry. Statist. Sc. Ser. B 66, 337 356. Details f the EE methd ca be fud i the referece belw. Ku, S. ad B. Efr (2002). Smthers ad the C p, GML ad EE criteria: A gemetric apprach, J. Amer. Statist. Assc. 97, 766 782. The egative crrelati betwee crss-validated badwidth ad the ptimal badwidth was bserved by may, ad was publicized by Sctt ad Terrell (1987) ad Hall ad Jhste (1992), thugh all tk it at the face value ad tried t fix it, which is uecessary. Sctt, D. W. ad G. R. Terrell (1987). Biased ad ubiased crss-validati i desity estimati, J. Amer. Statist. Assc. 82, 1131 1146. Hall, P. ad I. Jhste (1992). Empirical fuctials ad efficiet smthig parameter selecti (with discussi) J. Ry. Statist. Sc Ser. B 54, 475 530. Detailed discussi f mdel idexig ad the ramificatis i badwidth selecti ca be fud i the fllwig referece. Gu, C. (1998). Mdel idexig ad smthig parameter selecti i parametric fucti estimati (with discussi), Statist. Si. 8, 607 646. 11