Thanks. Presentation help: B. Narasimhan, N. El Karoui, J-M Corcuera Grant Support: NIH, NSF, ARC (via P. Hall) p.2

Size: px

Start display at page:

Download "Thanks. Presentation help: B. Narasimhan, N. El Karoui, J-M Corcuera Grant Support: NIH, NSF, ARC (via P. Hall) p.2"

Shanon Kennedy
6 years ago
Views:

1 p.1

2 Collaborators: Felix Abramowich Yoav Benjamini David Donoho Noureddine El Karoui Peter Forrester Gérard Kerkyacharian Debashis Paul Dominique Picard Bernard Silverman Thanks Presentation help: B. Narasimhan, N. El Karoui, J-M Corcuera Grant Support: NIH, NSF, ARC (via P. Hall) p.2

3 Abraham Wald p.3

4 Three Talks 1. Function Estimation & Classical Normal Theory X n N p(n) (θ n,i) p(n) with n (MVN) 2. The Threshold Selection Problem In (MVN) with, say, ˆθ i = X i I{ X i > ˆt} How to select ˆt = ˆt(X) reliably? 3. LargeCovarianceMatrices X n N p(n) (I Σ p(n) ); especially X n = [ Y n Z n ] spectral properties of n 1 X n X T n PCA, CCA, MANOVA p.4

5 Focus on Gaussian Models Cue from Wald (1943), Trans. Amer. Math. Soc. -an inspiration for Le Cam s theory of Local Asymptotic Normality (LAN) p.5

6 Focus on Gaussian Models, ctd. Growing models : p n or p n is now commonplace (classification, genomics..) Reality check: But, no real data is Gaussian... Yes (in many fields), and yet, consider the power of fable and fairy tale... p.6

7 1. Function Estimation and Classical Normal Theory Theme: Gaussian White Noise model & multiresolution point of view allow classical parametric theory of N d (θ, I) to be exploited in nonparametric function estimation Reference: book manuscript in (non-)progress: Function Estimation and Gaussian Sequence Models www-stat.stanford.edu/~imj p.7

8 Agenda Classical Parametric Ideas Nonparametric Estimation and Growing Gaussian Models I. Kernel Estimation and James-Stein Shrinkage II. Thresholding and Sparsity III. Bernstein-von Mises phenomenon p.8

9 Classical Ideas a) Multinormal shift model X 1,...,X n data from P η (dx) =f η (x)µ(dx), η H R d. Let I 0 = Fisher information matrix at η 0. Local asymptotic Gaussian approximation: {P n η 0 +θ/ n, θ Rd } {N d (θ, I 1 0 ), θ Rd } p.9

10 Classical Ideas, ctd b) ANOVA, Projections and Model Selection Y = Xβ n 1 n d d 1 + σɛ Submodels: X =[X 0 X 1 ] projections P X0. Canonical form: y = θ + σz. Projections: (P 0 y) i = { y i i I 0 0 o/w MSE: E P 0 y θ 2 = i I 0 σ 2 + i/ I 0 θ 2 i p.10

11 Classical Ideas, ctd c) Minimax estimation of θ inf ˆθ sup E θ ˆθ(y) θ 2 = dσ 2 θ R d attained by MLE ˆθ MLE (y) =y. d) James-Stein estimate ˆθ JS (y) = ( 1 d 2 ) y 2 y dominates MLE if d 3. p.11

12 Classical Ideas, ctd e) Conjugate priors Prior: θ N(0,τ 2 I), Likelihood: y θ N(θ, σ 2 I) Posterior: θ y N(ˆθ y,σ 2 y) ˆθ y = τ 2 σ 2 + τ 2 y σ2 y = σ2 τ 2 σ 2 + τ 2 f) Unbiased risk estimate Y N d (θ, I). Forg (weakly) differentiable E Y + g(y ) θ 2 = E[d +2 T g(y )+ g(y ) 2 ]=E[U(Y )] If g = g(y ; t) then U = U(Y ; t). p.12

13 Agenda Classical Parametric Ideas Nonparametric Estimation and Growing Gaussian Models I. Kernel Estimation and James-Stein Shrinkage II. Thresholding and Sparsity III. Bernstein-von Mises phenomenon p.13

14 Nonparametric function estimation Hodges and Fix (1951) nonparametric classification spectrum estimation in time series, kernel methods for density est n, regression roughness penalty methods (splines) techniques largely differ from parametric normal theory Ibragimov & Hasminskii (and school): importance of Gaussian white noise (GWN) model: Y t = t 0 f(s)ds + ɛw t, 0 t 1 p.14

15 Motivation for GWN model GWN model emerges as large-sample limit of equispaced regression, density estimation, spectrum estimation,... y j = f(j/n)+σw j j =1,...,n ɛ = σ/ n p.15

16 Motivation for GWN model GWN model emerges as large-sample limit of equispaced regression, density estimation, spectrum estimation,... y j = f(j/n)+σw j j =1,...,n ɛ = σ/ n 1 n [nt] j=1 y j = 1 n [nt] j=1 f(j/n)+ σ [nt] 1 n n j=1 w j p.15

17 Motivation for GWN model GWN model emerges as large-sample limit of equispaced regression, density estimation, spectrum estimation,... y j = f(j/n)+σw j j =1,...,n ɛ = σ/ n 1 n [nt] j=1 y j = 1 n [nt] j=1 f(j/n)+ σ [nt] 1 n n j=1 w j Y t = t 0 f(s)ds + σ n W t p.15

18 Series form of WN Model Y t = t 0 f(s)ds + ɛw t For any orthonormal basis {ψ λ,λ Λ} for L 2 [0, 1], ψ λ dy t = ψ λ fdt+ ɛ ψ λ dw t y λ = θ λ + ɛz λ, (z λ ) i.i.d. N(0, 1), λ Λ Parseval relation implies ( ˆf f) 2 = (ˆθ λ θ λ ) 2 = ˆθ θ 2, etc. analysis of infinite sequences in l 2 (N) p.16

19 Multiresolution connection Wavelet orthonormal bases {ψ jk } have double index level ( octave ) j =1, 2,..., location k =1,...,2 j Collect coefficients in a single level: y j =(y jk ) θ j =(θ jk ) k =1,...,2 j etc. Finite (but growing with j) multivariate normal model y j N 2 j(θ j,ɛ 2 I), j =1, 2,... apply classical normal theory to the vectors y j. p.17

20 Some S8 Symmlets at various scales and locations 16 (8,202) (8,166) (8,102) (8,81) (8,31) (7,101) (7,77) (7,51) (6,42) (6,34) (6,26) (6,12) (4,11) (4, 8) (4, 3) p.18

21 Example 600 Electrical Signal Nationwide electricity consumption in France Here: 3 days (Thur, Fri, Sat), summer 1990, t =1min. from Misiti et. al., Rev. Statist. Appliquée 1994, N.B. Malfunction of meters: d. 2, d. 3, p.19

22 450 Electrical Signal CDJV Wavelet Coeffs Often o.k. to treat a single level as N 2 j(θ j,ɛ 2 I) p.20

23 350 Electrical Signal CDJV Wavelet Coeffs Or, apply N d (θ, I) model to a block within a level. p.21

24 The Minimax principle Given loss function L(a, θ), and risk R(ˆθ, θ) =E θ L(ˆθ, θ) choose estimator ˆθ so as to minimize the maximum risk E θ L(ˆθ, θ). sup θ Θ introduced into statistics by Wald L.J. Savage (1951): the only rule of comparable generality proposed since [that of] Bayes was published in 1763 standard complaint: ˆθ depends on Θ: optimizing on the worst case may be irrelevant p.22

25 Simultaneous near-minimaxity Shift in perspective on minimaxity fix ˆf in advance; an estimator to be evaluated for many spaces F, compare sup F R( ˆf,f) to R(F) =inf ˆf sup F R( ˆf,f) shift from to Exact answer to wrong problem Approx answer to many (related) problems an imperfect, yet serviceable tool p.23

26 Agenda Classical Parametric Ideas Nonparametric Estimation and Growing Gaussian Models I. Kernel Estimation and James-Stein Shrinkage II. Thresholding and Sparsity III. Bernstein-von Mises phenomenon p.24

27 . Kernel estimation & James-Stein Shrinkage Priestley-Chao estimator: ˆf h (t) = 1 Nh N i=1 ( t ti ) K Y i e.g. K(x) =(1 x 2 ) 4 + h Automatic choice of h? Huge literature, (e.g. Wand & Jones, 1995) James-Stein provides simple, powerful approach p.25

28 Fourier Form of Kernel Smoothing Convolution multiplication Shrinkage factors s k (h) [0, 1] DFT (shrink) decrease with frequency idft Flatten shrinkage in blocks: ˆθ k =(1 w j )y k How to choose w j? k B j frequency p.26

29 James-Stein Shrinkage For y N d (θ, I), andd 3 ( ˆθ JS+ = 1 d 2+η ) y 2 y + Unbiased estimate of risk: E ˆθ JS θ 2 = E θ { d (d 2)2 + η 2 y 2 } d p.27

30 Smoothing via Blockwise James-Stein Block Fourier or wavelet (here) coefficients: y j =(y jk ), k =1,...,d j =2 j ; θ j =(θ jk ) etc. Apply J-S shrinkage to each block: [refs] ˆθ JS+ j = (1 β jɛ 2 y j 2 ) + y j 2 j log 2 N β j = { d j 2 d j + 2d j 2logdj (ExactJS) or (ConsJS) For ConsJS, ifθ j =0,thenP { y j 2 <β j } 1, soˆθ JS j =0. p.28

31 Block James-Stein imitates Kernel p.29

32 Block James-Stein imitates Kernel Red: James Stein equivalent kernels; Blue: Actual Kernel p.30

33 Properties of Block James-Stein Block JS imitates kernel smoothing, but near automatic choice of bandwidth canonical choices of β j Easy theoretical analysis from unbiased risk bounds E ˆθ JS j θ j 2 2ɛ 2 + θ j 2 2 j ɛ 2 p.31

34 Properties of Block James-Stein Block JS imitates kernel smoothing, but near automatic choice of bandwidth canonical choices of β j Easy theoretical analysis from unbiased risk bounds E ˆθ JS j θ j 2 2ɛ 2 + θ j 2 2 j ɛ 2 e.g. below: MSE for Hölder smooth functions: 0 <δ<1 f(x) f(y) B x y δ all x, y (*) α = r + δ, r N, D r f satisfies (*). In terms of wavelet coefficients: f H α (C) θ jk C2 (α+1/2)j for all j, k p.31

35 A single level determines MSE From unbiased risk bounds and Hölder smoothness: E ˆθ JS j θ j 2 [2ɛ 2 + θ j 2 2 j ɛ 2 ] f H α (C) θ j 2 C 2 2 2αj p.32

36 A single level determines MSE From unbiased risk bounds and Hölder smoothness: E ˆθ j JS θ j 2 [2ɛ 2 + θ j 2 2 j ɛ 2 ]+ j j J j>j f H α (C) θ j 2 C 2 2 2αj θ j 2 MSE p.32

37 Growing Gaussians MSE geometric decay of MSE away from critical j Growing Gaussian aspect: As noise ɛ = σ/ n) decreases, worstlevelj = j (ɛ, C) increases: d j =2 j = c(c/ɛ) 2/(2α+1) p.33

38 MultiJS is rate-adaptive The optimal rate for H α (C) is ɛ 2r,withr = r(α) = 2α 2α+1. JS risk bounds imply simultaneous near-minimaxity: sup R( ˆf JS,f) cc 2(1 r) ɛ 2r (1 + o(1)) R(H α (C)). H α (C) For a single, prespecified estimator ˆf JS, valid for all smoothness α (0, ), bounds C (0, ). No speed limit to rate of convergence infinite order kernel (for certain wavelets) Conclusion follows easily from single level James-Stein analysis in multinormal mean model p.34

39 Agenda Classical Parametric Ideas Nonparametric Estimation and Growing Gaussian Models I. Kernel Estimation and James-Stein Shrinkage II. Thresholding and Sparsity III. Bernstein-von Mises phenomenon p.35

40 James-Stein Fails on Sparse Signals d θ 2 d + θ 2 R(ˆθ JS,θ) 2+ d θ 2 d + θ 2 r spike θ r : r co-ords at d/r θ 2 = d So R(ˆθ JS,θ r ) d/2. p.36

41 James-Stein Fails on Sparse Signals d θ 2 d + θ 2 R(ˆθ JS,θ) 2+ d θ 2 d + θ 2 r spike θ r : r co-ords at d/r θ 2 = d So R(ˆθ JS,θ r ) d/2.... Hard thresholding at t = 2logd: R(ˆθ HT,θ r ) d 2tφ(t)+r [1 + η(d/r)] r +3 2logd p.36

42 A more systematic story for thresholding Based on l p norms and balls: θ p p = 1. l p norms capture sparsity d i=1 (mostly with p<2) θ i p 2. thresholding arises from least squares estimation with l p constraints 3. describe best possible estimation over l p balls 4. show that thresholding (nearly) attains the best possible p.37

43 l p norms, p<2, capture sparsity ( ) l 1 ball smaller than l 2, expect better estimation p.38

44 l p norms, p<2, capture sparsity ( ) l 1 ball smaller than l 2, expect better estimation Sparsity of representation depends on basis: rotation sends (0,C,0,...) (C/ d,...,c/ d) p.38

45 Thresholding from constrained least squares min (y i θ i ) 2 s.t. θi p C p i.e. min (y i θ i ) 2 + λ θ i p leads to p =2 Linear Shrinkage ˆθi =(1+λ) 1 y i y i λ y i >λ p =1 Soft thresholding ˆθi = 0 y i λ y i + λ y i < λ [λ = λ/2] p =0 Hard thresholding ˆθi = y i I{ y i >λ}. [penaltyλ I{θ i 0} ] p.39

46 Bounds on best estimation on l p balls Minimax risk: R d,p (C) =infˆθ sup θ p C E θ d 1 (ˆθ i θ i ) 2. Non-asymptotic bounds: [ Birgé-Massart] c 1 r d,p (C) R d,p (C) c 2 [log d + r d,p (C)] l 1 ball smaller (much) smaller minimax risk p.40

47 Bounds on best estimation on l p balls Minimax risk: R d,p (C) =infˆθ sup θ p C E θ d 1 (ˆθ i θ i ) 2. Non-asymptotic bounds: [ Birgé-Massart] c 1 r d,p (C) R d,p (C) c 2 [log d + r d,p (C)] bound for thresholding l 1 ball smaller (much) smaller minimax risk p.40

48 Thresholding nearly attains the bound For simplicity: threshold t = 2logd Oracle inequality for soft thresholding: [non-asymptotic!] E ˆθ ST θ 2 (2 log d +1)[1+ (θ 2 i 1)] Apply to l 1 ball {θ : θ 1 C}. sup θ 1 C E ˆθ ST θ 2 (2 log d +1)(C +1). Better bounds possible for data-dependent thresholds t = t(y ) (Lec. 2) p.41

49 Summary for N d (θ, I) For X N d (θ, I), comparing James Stein shrinkage (β = d 2), and soft thresholding at t = 2logd: James-Stein shrinkage: orthogonally invariant: 1 2 ( θ 2 d) E ˆθ JS θ 2 2+( θ 2 d) Thresholding: co-ordinatewise, and co-ordinate dependent: 1 2 (θ 2 i 1) E ˆθ ST θ 2 (2 log d + 1)(1 + (θ 2 i 1)) p.42

50 Implications for Function Estimation Key example: functions of bounded total variation: TV(C) ={f : f TV C} f TV = sup t 1 <...<t N N 1 i=1 f(t i+1 ) f(t i ) + f 1 Well captured by weighted combinations of l 1 norms on wavelet coefficients: c 1 sup 2 j/2 θ j 1 f TV c 2 2 j/2 θ j 1 j Best possible (minimax) MSE R(TV(C),ɛ)=inf f sup E ˆf f 2 f TV(C) j p.43

51 Reduction to single levels sup θ E ˆθ θ 2 = j sup θ j E ˆθ j θ j 2 Apply thresholding bounds for each j: ton 2 j(θ j,ɛ 2 I). a worst case level j = j (ɛ, C), Geometric decay of sup j E ˆθ j θ j 2 as j j. Growing Gaussian aspect: As noise ɛ = σ/ n) decreases, worst level j = j (ɛ, C) increases: d j =2 j = c(c/ɛ) 2/3 p.44

52 Final Result sup R( ˆf thr,f) C 2(1 r) ɛ 2r r =2/3 TV(C) sup R( ˆf JS,f) C 2(1 rl) ɛ 2r L r L =1/2 TV(C) 20 Blocks signal 25 Noisy version \sqrt 2 log d threshold 25 Block JS shrinkage p.45

53 Agenda Classical Parametric Ideas Nonparametric Estimation and Growing Gaussian Models I. Kernel Estimation and James-Stein Shrinkage II. Thresholding and Sparsity III. Bernstein-von Mises phenomenon p.46

54 Bernstein-von Mises Phenomenon Asymptotic match of freq. & Bayesian confidence intervals: Classical version Y 1,...,Y n i.i.d. p θ (y)dµ θ Θ R d d fixed Under mild conditions, as n, P θ Y N (ˆθMLE, n 1 I 1 ) θ 0 P n,θ0 0, where I θ = E θ [ θ log p θ ][ θ log p θ ] T is Fisher information, P Q =max A P (A) Q(A) is total variation distance p.47

55 Non-parametric Regression Y i = f(i/n)+σ 0 w i, i =1,...,n For typical smoothness priors (e.g. d th integrated Wiener process prior) Bernstein - von Mises fails [Dennis Cox, D. A. Freedman] frequentist & posterior laws of ˆf f 2 2 center and scale mismatch in Here, revisit via elementary Growing Gaussian Model approach Apply to single levels of wavelet decomposition p.48

56 Growing Gaussian Model Data: Y 1,...,Y n θ i.i.d. N p (θ, σ 2 0 I) Prior: θ N p(0,τ 2 n). growing: p = p(n) with n σ 2 n = Var(Ȳn θ) =σ 2 0 /n; τ 2 n may depend on n. Goal: compare L(ˆθ MLE θ), L(ˆθ Bayes θ) with L(θ Y ). Posterior: All Gaussian: centering and scaling are key: L ( θ Ȳ =ȳ) N p ( ˆθB = w n ȳ, w n σ 2 ni ) w n = τ 2 n σ 2 n + τ 2 n p.49

57 Growing Gaussians and BvM Correspondences: P θ Y N p (w n ȳ, w n σni 2 ) N (ˆθMLE,n 1 I 1 ) θ 0 N p ( ȳ, σni 2 ) For BvM to hold, now need w n 1 sufficiently fast: Proposition: In the growing Gaussian model if and only if Pθ Y N (ˆθMLE,n 1 I 1 θ 0 ) P n,θ0 0, p σ 2 n τ 2 n = p n σ 2 0 τ 2 n 0 ( 1 i.e. w n =1 o ). pn p.50

58 Example: Pinsker priors Back to regression: dy t = f(t)dt + σ n dw t Minimax MSE linear estimation of f: {f : 1 0 (D α f) 2 C 2 } Least favourable prior on wavelet coeffs (for sample size n): θ jk indep N(0,τ 2 j ) τj 2 = σn(λ 2 n 2 jα 1) + λ n = c(c/σ n ) 2α/(2α+1) critical level j = j (n, α) grows with n (Growing Gaussian model again) p.51

59 Validity of B-vM depends on level Bayes estimator for the Pinsker prior attains exact minimax MSE (asymptotically) But Bernstein von Mises fails at the critical level j (n): At j (n) τ 2 j /σ 2 n 2α 1 [fine scale features] 1 w n = 1 1+τ 2 j /σ2 n 2 α BvM fails At fixed j 0 : p =2 j 0 FIXED (or slowly ) [coarse scale] τj 2 0 /σn 2 = λ n 2 j0α 1 w n 0 BvM holds Difficulty lies with high dimensional features. p.52

60 Three Talks 1. Function Estimation & Classical Normal Theory X n N p(n) (θ n,i) p(n) with n (MVN) 2. The Threshold Selection Problem In (MVN) with, say, ˆθ i = X i I{ X i > ˆt} How to select ˆt = ˆt(X) reliably? 3. LargeCovarianceMatrices X n N p(n) (I Σ p(n) ); especially X n = [ Y n Z n ] spectral properties of n 1 X n X T n PCA, CCA, MANOVA p.53

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania Professors Antoniadis and Fan are