STATISTICS 593C: Spring, Model Selection and Regularization

Size: px

Start display at page:

Download "STATISTICS 593C: Spring, Model Selection and Regularization"

Stella Griffin
5 years ago
Views:

1 STATISTICS 593C: Sprig, 27 Model Selectio ad Regularizatio Jo A. Weller Lecture 2 (March 29): Geeral Notatio ad Some Examples Here is some otatio ad termiology that I will try to use (more or less) systematically throughout the course. Suppose that we have: Data: ξ () P,θ = P θ for θ Θ; ξ () Ξ. Parameter space: Θ. Sometimes Θ R k for some k; ofte Θ is some (large) collectio of fuctios. A sieve (or fiite-dimesioal subsets of Θ): {Θ m : m M where Θ m Θ with dim(θ m )=D m <. A collectio of models: {P m : m M,withP m = {P θ : θ Θ m. A empirical cotrast fuctio: γ :Θ Ξ R, γ (θ, ξ () ) γ (θ), θ Θ. Empirical cotrast estimators: θ,m =argmi θ Θm γ (θ) form M. Risk fuctios: R,m (θ, θ,m )=E θ [d 2 (θ, θ ],m ) = risk of θ,m at θ for m M ad θ Θ. Here are several examples: Example 1. (desity estimatio) Suppose that ξ () =(X 1,...,X )wherex i are i.i.d. p θ with respect to a domiatig measure µ. Here are two commo choices for a cotrast fuctio γ: Maximum likelihood: γ (θ) =P γ(x, θ) withγ(x, θ) = log p θ (x). The θ,m = argmi θ Θm γ (θ) = maximum likelihood estimator of θ over Θ,m. Least squares: I this case we take γ(x, θ) = p 2 θ dµ 2p θ(x), so that γ (θ) = P γ(x, θ) = p 2 θdµ 2 = (p θ p ) 2 dµ p 2 dµ p θ dp if p = dp /dµ exists. 1

2 Example 2. (regressio) I this case ξ () =(ξ 1,...,ξ )whereξ i =(X i,y i )where Y i = θ(x i )+ɛ i, 1 i, X i G i are idepedet, ɛ i are idepedet with E(ɛ i X i )=for1 i. Ifµ G = 1 i=1 G i,theforθ L 2 (µ) set γ(ξ,θ) =γ((x, y),θ)=(y θ(x)) 2. The θ,m =argmi θ Θ,m γ (θ) is the least squares estimator of θ over Θ m. Example 3. (biary classificatio) I this case ξ () =(ξ 1,...,ξ )whereξ i =(X i,y i ) where Y i {, 1, γ(ξ,θ) =γ((x, y),θ)=(y f θ (x)) 2. as i Example 2, θ(x) E(Y X = x), ad f θ (x) =1{θ(x) 1/2. Example 4. (Gaussia white oise model) Let ξ () be the process o [, 1] d defied by dξ () (x) =θ(x)dx + σ dw (x), where W is a Browia sheet o [, 1] d (i.e. a mea zero Gaussia process with EW(x)W (y) = x y d j=1 (x j y j )). For θ L 2 ([, 1] d,λ), defie The γ (θ) = θ θ(x)dξ () (x). θ,m =argmi θ Θm γ (θ) is the MLE of θ over Θ m ad is also the least squares estimator of θ over Θ m. Here is a atural choice of d 2 for each of these problems: d 2 (θ, θ )=E θ γ (θ ) E θ γ (θ). I Example 1 with the maximum likelihood empirical cotrast fuctio, d 2 (θ, θ )= p θ log(p θ /p θ )dµ = K(P θ,p θ ). For the Least squares empirical cotrast fuctio, d 2 (θ, θ )= (p θ p θ ) 2 dµ = p θ p θ 2 L 2 (µ). I Example 2, d 2 (θ, θ )= θ θ 2 θ θ 2 L 2 (µ). I Example 3, we fid, after some computatio d 2 (θ, θ )=E θ { 2θ(X) 1 f θ (X) f θ (X). 2

3 I Example 4, d 2 (θ, θ )= θ θ 2 L 2 (λ). Now we cotiue with some computatios for the white oise model, Example 4. Suppose that Θ m =[φ 1,...,φ m ] = liear spa of φ 1,...,φ D i L 2 ([, 1] d,λ) where D = D m ad {φ j j=1 is a orthoormal basis for L 2 ([, 1] d ). The the least squares estimator θ,m is give by θ,m (x) = D ( 1 ) φ j dξ () φ j (x); j=1 ote that 1 φ j dξ () = 1 φ j θdλ + σ Z j, j {1,...,D where Z j are i.i.d. N(, 1). Note that this ca be re-writte as a Gaussia sequece model : Y j = µ j + ɛ j, j {1,...,D where Y j = 1 φ jdξ (), µ j = 1 φ jθdλ, adɛ j = σz j / N(,σ 2 /), j =1,...,D.Thus θ θ ( ),m = φ j θdλ φ j + σ D φ j φ j dw, so ad hece j=d+1 θ θ,m 2 = j=d+1 ( j=1 ) 2 φ j θdλ + σ2 D Zj 2, R(θ, θ,m ) = E θ { θ θ,m 2 = θ Π(θ Θ m ) 2 + σ2 D = mi θ Θ m θ θ 2 + σ2 D j=1 θ θ m 2 + σ2 D. This is a classical formula ivolvig a bias versus variace trade-off via the choice of D: icreasig D leads to smaller bias but larger variace. Model selectio via pealizatio: Cosider γ ( θ,m )+pe(m), m M. (1) For model selectio via Mallows C p or AIC, pe(m) =2D m σ 2 / assumig that σ 2 is kow. (If σ 2 is ukow, the we should estimate it usig a low-bias model.) The we choose ˆm M to miimize the pealized cotrast fuctio i (1): i.e. m =argmi m M {γ ( θ,m )+pe(m), 3

4 ad the Θ Θ bm ad θ θ bm. Heuristics for Mallows C p pealty: Oe way to proceed: a ideal model m should miimize the quadratic risk we calculated above: m =argmi m M { θ θ m 2 + σ2 D m { =argmi m M θ 2 θ m 2 + σ2 D m, or, equivaletly, miimize θ m 2 + σ2 D m. This depeds o the true θ through θ m =Π(θ Θ m ). But we ca estimate θ m 2 by its atural ubiased estimator, amely θ,m 2 σ2 D m. Thus we ca choose ˆm by miimizig θ,m 2 + 2D mσ 2. (2) This is (very early) equivalet to miimizig γ ( θ,m )+ 2σ2 D m. (3) Geeral versio of heuristics: The followig is from the itroductio of Barro, Birgé, ad Massart (1999). A ideal model might be take to be oe that miimizes R,m (θ, θ,m ) over m M. (However, eve if θ Θ m is true, this true model be far from the ideal model.) Sice θ is ukow, we caot determie such a ideal model exactly. Goal 1: Fid m M based o the data, such that R,m (θ, θ, bm ) if R,m (θ, θ,m ) miimal risk. m M Ufortuately, goal 1 is too hard i most problems. Istead, cosider replacig the target of miimal risk by some appropriate accuracy idex a (θ) = if {d 2 (θ, Θ m )+pe (m) m M = if { if d 2 (θ, θ )+pe (m) m M θ Θ m miimal risk. 4

5 Goal 2: Fid m M based o the data, such that E θ d 2 (θ, θ, bm ) C(θ)a (θ) for all. Oe way to do this: choose m =argmi m M {γ ( θ,m )+pe (m) ; typically pe (m) = { κlmd m, L m = weights K m (ξ () ) LmDm, L m = weights, K m a fuctio of the data where m M exp( L m D m ) 1. The the results of Birgé ad Massart (1997), (21) ad Barro, Birgé ad Massart (1999) are of the form { E θ d 2 (θ, θ, bm ) C(θ) if d 2 (θ, Θ m )+ κl md m. m M We will retur to results of this type later i the quarter. 5

Lecture 13: Maximum Likelihood Estimation

Lecture 13: Maximum Likelihood Estimation ECE90 Sprig 007 Statistical Learig Theory Istructor: R. Nowak Lecture 3: Maximum Likelihood Estimatio Summary of Lecture I the last lecture we derived a risk (MSE) boud for regressio problems; i.e., select