Model Selection and Geometry

Size: px

Start display at page:

Download "Model Selection and Geometry"

Cleopatra Cross
5 years ago
Views:

1 Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February

2 Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model selection may be usefull for geometric inference! Focus on the Gaussian framework: convenient to provide the main ideas and intuitions

3 Concentration of measure

4 Isoperimetry The Gaussian isoperimetric Theorem: let P be the standard Gaussian measure on! N. Given θ, among the sets A with P A { } P d (., A) < ε ( ) = θ is minimal for a half-space R N 1 (,ε θ with Q( ε ) θ = θ ( : standard Gaussian tail function, d : Euclidean distance)

5 Gaussian concentration inequality and therefore if θ 1/, then ε 0 θ and { ( ) ε} Q( ε + ε ) θ Q( ε ) e ε / P d., A Given some L-Lipschitz function f, applying this inequality with ε = x to the set { } A = f Mf where Mf is a median of f leads to P{ f Mf L x} e x True (not obvious) with instead of Ef Mf

6 Chi-quare distribution If we apply this inequality to the Euclidean norm itself, i.e. f : z z, then L = 1 and since Ef = N Ef N and therefore { } e x P f N + x Of course f follows a chi-square distribution χ N. ( )

7 The Gaussian concentration inequality also applies to suprema of Gaussian processes. { ( ),t T } Let G t be some centered Gaussian process. Assume that T is equipped with the covariance pseudo-metric d defined by d u,t If (T,d) is separable and if continuous on (T,d), setting ( ) ( ) = Var G( t) G( u) { G( t),t T } Z = supg t t T P Z E Z + σ x is a.s. one has ( ) e x where σ = sup t T ( ) Var G t

8 Outside the Gaussian world! Concentration of measure for product measures has been studied for years by M. Talagrand in a series of remarkable papers.! Talagrand s concentration inequality for empirical processes plays a fundamental role in statistics in general and for model selection in particular.! Gaussian life is easier!

9 Model Selection

10 What are we talking about? Statistical inference One observes X ( random vector, random process ) with unknown distribution P. Purpose: take a decision about some quantity related to. Predict with a level of confidence. What to do: design a genuine estimation procedure ˆθ ( X ) of θ and get some idea of how far it is from the target. The role of probability theory Problem: the exact distribution of ˆθ is generally unknown. Solution: make some approximation or evaluation based on Probability Theory.

11 Asymptotic Theory Typically, when X = ( X 1,..., X ) n, n and X 1,..., X n are independent. Asymptotic theory in statistics uses limit theorems ( Central limit Theorems, Large Deviation Principles) as approximation tools. Example: when P belongs to some given model ( which does not depend of n ), behavior of maximum likelihood estimators. Model selection Designing a genuine requires some prior knowledge on P. Choosing a proper model for is a major problem for the statistician. Aim of model selection : construct data-driven criteria to select a model among a given list.

12 Asymptotic approach to model selection - Idea of using some penalized empirical criterion goes back to the seminal works of Akaike ( 70). - Akaike celebrated criterion (AIC) suggests to penalize the log-likelihood by the number of parameters of the parametric model. - This criterion is based on some asymptotic approximation that essentially relies on Wilks Theorem

13 Wilks Theorem: under some proper regularity conditions the log-likelihood L ( n θ ) based on n i.i.d. observations with distribution belonging to a parametric model with D parameters obeys to the following weak convergence result where denotes the MLE and is the true value of the parameter.

14 Non asymptotic Theory In many situations, it is usefull to make the size of the models tend to infinity or make the list of models depend on n. In these situations, classical asymptotic analysis breaks down and one needs to introduce an alternative approach that we call non asymptotic. We still like Large values of n! But the size of the models as well the size of the list of models should be authorized to be large too. This approach is based on Concentration Inequalities

15 This approach has been fruitfully used in several works. Among others: Baraud ( 00) and ( 03) for least squares in the regression framework, Castellan ( 03) for log-splines density estimation, Patricia Reynaud ( 03) for poisson processes, etc

16 Gaussian Model Selection

17 The Gaussian framework We consider the generalized linear Gaussian model. This means that, given some separable Hilbert space, one observes W Y ε ( t) = s,t + εw ( t), t H where is some centered Gaussian isonormal process, i.e. maps isometrically onto some Gaussian subspace of L ( Ω). We have in mind that writing the level of noise as allows an easy comparison with other frameworks involving observations.

18 This framework is convenient to cover both the infinite dimensional white noise model and the finite dimensional linear model for which and, where is a standard Gaussian random vector. Consider some collection of (typically closed and convex) subsets of. We also consider the least squares criterion L ε ( t) = t Y ε t and, for each, ŝ m minimizing over. The quality of model is measured by the quadratic risk S m S m E s ŝ m ( ) L ε

19 Does the risk reflect the model choice paradigm? Let us compute it in the simplest situation where is a linear model with dimension. Then D m S m E s ŝ m = ε D m + d The best (oracle) model according to this criterion should make a trade-off between its size and its quality of approximation to the truth. The aim is now to mimic it. Our purpose is to analyze the penalized LSE, where ˆm = argmin m M ( s,s ) m { L ( ε ŝ ) m + pen( m) }

20 Selecting linear models Let us begin with model selection among linear models and review some results obtained in joint works with Lucien Birgé (JEMS 01 and PTRF 07). Each model S m is assumed to be linear with dimension D m and represented by the least squares estimator on. Then S m E s ŝ m = ε D m + d ( s,s ) m Variance Bias infe s ŝ m M m Oracle : Ideal model achieving Aim: mimic the oracle by estimating the risk

21 Mallows heuristics Let denote the orthogonal projection of s on. An «ideal» model should minimize the quadratic risk s s m + ε D m = s s m + ε D m Pythagore or equivalently s m + ε D m Substituting to s m its natural unbiased estimator leads to Mallows criterion ŝ m ε D m ŝ m + ε D m = L ε ( ŝ ) m + ε D m Issue : look the way uniformly w.r.t. m M. concentrates around ε D m

22 Theorem (Birgé, Massart 01) Let ( ) m M x m e xm m M such that be a family of non negative weights = Σ < Let K > 1. Assume that and take pen( m) K ε ( ) D m + x m ˆm = argmin m M ŝ m + pen m ( ) Then E ŝ ˆm s C K { } ( ) inf m M d ( s,s ) m + pen( m) + Σε

23 A typical choice is Choice of the weights. Then if one defines ( ) = α D + log# { m M : D m = D} x D the weights actually appear as the price to pay for redundancy (i.e. many models with the same dimension). The penalty becomes and the upper bound (up to constant) writes { inf { D inf d ( Dm s,s ) =D m + pen( D) }} Approximation property of D m =D is essential S m

24 Comparison with the oracle Since If the weights and E ŝ m s = d ( s,s ) m + ε D m are such that then the upper bound provided by the Theorem becomes inf m M E ŝ m s (up to some multiplicative constant). Conclusion : The selected estimator performs (almost) as well as an oracle.

25 { } Variable Selection Let ϕ j, j N be a family of elements in. Let be some collection of subsets of { 1,...,N } and define S m = ϕ j, j m, m M. a. Ordered variable selection M is the collection of subsets of the form with D N (possibly N = ) Then one can take pen m ( ) = K 'ε m { 1,...,D } Comparable to the oracle (linearly independent case) E s ŝ ˆm C ( K ' )inf m M E s ŝ m sharp

26 b. Complete variable selection Let be the collection of all subsets of. One can take and Σ = ( ) x m = m ln N e x m = C D N e Dln N m M D N ( which leads to ( ) e. pen( m) = K ε m 1+ ln N ( ) with. In the orthonormal case can be explicited. It is simply hard thresholding (Donoho, Johnstone, Kerkyacharian, Picard) ŝ ˆm = N j=1 ˆβ j 1ˆβ j T ϕ j, with ˆβ j = Y ( ϕ ) j )

27 with T = Kε 1+ ( ln( N) ). Then { ( )} E s ŝ ˆm C ' ( K )inf D N inf m =D s s m + ε Dlog N sharp Role of the basis ( ( ) + ) Refinement : One can choose x m = D ln N / D whenever, which leads to an optimal risk bound (minimax sense).

28 Link with non parametric adaptation { } Let ϕ j, j 1 be the Fourier basis with its α natural ordering then s s D κ s ( ) D α. Up to constant E s s ŝ ˆm R 4α α +1 α +1 ε is bounded by uniformly w.r.t s satisfying s α R. Optimal in minimax sense since infŝ sup s ( α ) R Conclusion Minimax, up to constant over all the Sobolev balls ( s α ) R simultanously. ( ) E s s ŝ κ ' R 4α α +1 α +1 ε

29 Conclusions Mallows criterion can underpenalize. Condition K>1 is sharp. What penalty should be recommanded? One can try to optimize the oracle inequality. The result is that twice the minimal value is a good choice (Birgé, Massart (007)) Practical use One does not know the level of noise but one can retain from the theory that «optimal» penalty = «minimal» penalty data-driven penalty

30 Back to Geometry

31 Theorem (M. 007) Assume that for every, there exists some function φ m such that φ m x on and ( ) / x for any positive and any point in. Let us define D m such that φ m ε D m and consider some family of weights ( xm ) m M Let be some constant with and take Then E sup t S m W ( t) W ( u) t u + x ( ) = ε D m K > 1 x φ m ( x) ( ). pen( m) K ε D m + x m such that

32 Comments ' If is finite dimensional with dimension D m ' we can take D m = D m which shows that the above Theorem strictly implies the linear model selection Theorem (Birgé,Massart ( 01)). Indeed since x t u x + t u if ( ψ ) j 1 ' j Dm sup t S m therefore denotes some orthonormal basis of W ( t) W ( u) t u + x x 1 sup t S m E sup t S m So we may take φ m W ( t) t W ( t) W ( u) t u + x ' D = m W ψ j j=1 x 1 ( ) D m ' ( x) ' ' = x D m D m = D m

33 More generally, the function should be understood as a modulus of continuity of model over. One can indeed use a pealing device to ensure that if for every E sup t S m, t u σ φ m u S m ( W ( t) W ( u) ) ϕ m and every ( σ ) where ϕ m is a non decreasing and concave function with ϕ m 0, then one can take ( ) = 0 φ m ( x) = 5ϕ m x ( ) One possibility is to use covering numbers ϕ m σ 0 ( σ ) = κ log N δ,s m ( ) dδ

34 Simplicial approximation Here is an illustration of what we are promoting with Frédéric Chazal and our students. C. Caillerie and B. Michel (011) have applied the Gaussian model selection Theorem above to simplicial approximation. Consider that one observes x = s + σξ with 1 i n i i i The unobserved deterministic points s belong to! p and the variables ξ i s are i.i.d. standard normal random vectors. One has in mind that the points s i s are sampled on some geometrical object that one wants to learn. s i

35 They consider some collection of k-homogenous simplicial complexes { C m,m M} in! p. They use covering numbers to define a proper penalty term and derive a risk bound on the selected estimator E s ŝ ˆm s! np where simply denotes the vector of «built» from the s i s. As expected (at least if the collection is not too rich) the penalty recommanded by the Theorem is found to be proportionnal to log C m k

36 where C m k is measuring the size of the simplicial complex. More precisely C m k = ( ) k 1/k + ω Cm Δ ω where the sum is extended to the set of simplices with maximal dimension k and Δ ω denotes the diameter of the smallest Euclidean ball containing the simplex ω. The bad news are that the penalty involves nasty constant (and the level of noise that we do not know in practice). They use a heuristics to solve this problem.

37 They use the penalized criterion X ŝ m + ˆβ log Cm k where ˆβ is estimated from the data by using the following phase transition which is empirically observed: when β is too small ( β < ˆβ ) the criterion X ŝ m + β log Cm k chooses a simplicial complex with a size which is close to the maximal one in the collection.

38 Many remaining issues Among which: escaping from the square loss, building an adaptive estimation theory Thanks for your attention!

Model selection theory: a tutorial with applications to learning

Model selection theory: a tutorial with applications to learning Pascal Massart Université Paris-Sud, Orsay ALT 2012, October 29 Asymptotic approach to model selection - Idea of using some penalized empirical