A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo

Size: px

Start display at page:

Download "A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo"

Emerald Palmer
6 years ago
Views:

1 A BLEND OF INFORMATION THEORY AND STATISTICS Andrew R. YALE UNIVERSITY Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo Frejus, France, September 1-5, 2008

2 A BLEND OF INFORMATION THEORY AND STATISTICS Andrew R. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo Mon: Greedy Alg. for l 1 Penalization and Subset Selection Tue: : Upper Bounds and Adaptation Wed: Lower Bounds and Minimax Rates Thu: Role of Approximation Theory

3 Information Theory & Statistical Risk Analysis Andrew

4 Outline 1 Some Principles of Information Theory and Statistics Data Compression Minimum Description Length Principle 2 This part is saved for Wednesday 3 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid 4 Fixed σ 2 Case Unknown σ 2 Case 5

5 Outline Data Compression Minimum Description Length Principle 1 Some Principles of Information Theory and Statistics Data Compression Minimum Description Length Principle 2 This part is saved for Wednesday 3 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid 4 Fixed σ 2 Case Unknown σ 2 Case 5

6 Shannon, Kraft, McMillan Data Compression Minimum Description Length Principle Characterization of uniquely decodeable codelengths L(x), x X, 2 L(x) 1 x L(x) = log 1/q(x) q(x) = 2 L(x) Operational meaning of probability: A distribution is given by a choice of code

7 Codelength Comparison Data Compression Minimum Description Length Principle Targets p are possible distributions Compare codelength log 1/q(x) to targets log 1/p(x) Redundancy or regret [ ] log 1/q(x) log 1/p(x) Expected redundancy D(P X Q X ) = E P [ log p(x) ] q(x)

8 Universal Codes Data Compression Minimum Description Length Principle MODELS Family of coding strategies Family of prob. distributions Indexed by parameters or functions: { Lθ (x) : θ Θ } { p θ (x) : θ Θ } { Lf (x) : f F } { p f (x) : f F } Universal codes Universal probabilities q(x) L(x) = log 1/q(x) Redundancy: [ log 1/q(x) log 1/p θ (x) ] Want it small either uniformly in x, θ or in expectation

9 Statistical Aim Data Compression Minimum Description Length Principle Training data x estimator ˆp = pˆθ Subsequent data x Want log 1/ˆp(x ) to compare favorably to log 1/p(x ) For targets p close to or in the families

10 Loss Data Compression Minimum Description Length Principle Kullback Information-divergence: D(P X Q X ) = E [ log p(x )/q(x ) ] Bhattacharyya, Hellinger, Chernoff, Rényi divergence: d(p X, Q X ) = 2 log 1/E[q(X )/p(x )] 1/2 Product model case: p(x ) = n i=1 p(x i ) D(P X Q X ) = n D(P Q) Likewise d(p X, Q X ) = n d(p, Q)

11 Loss Data Compression Minimum Description Length Principle Relationship: d(p, Q) D(P Q) and, if the log density ratio is not more than B, then with C B 2 + B D(P Q) C B d(p, Q)

12 Data Compression Minimum Description Length Principle Minimum Description Length (MDL) Principle Universal coding brought into statistical play Minimum Description Length Principle: The shortest code for data gives the best statistical model

13 MDL: Two-stage Version Data Compression Minimum Description Length Principle Two-stage codelength: [ L(x) = min θ Θ ] log 1/p θ (x) + L(θ) bits for x given θ + bits for θ The corresponding statistical estimator is ˆp = pˆθ Typically in d-dimensional families L(θ) is of order d 2 log n

14 MDL: Mixture Versions Data Compression Minimum Description Length Principle Codelength based on a Bayes mixture where L(x) = log 1 q(x) q(x) = p(x θ)w(θ)dθ average case optimal and pointwise optimal for a.e. θ Codelength approximation ( 1985, Clarke and 1990,1994) log 1 p(x ˆθ) + d 2 log n Î(ˆθ) 1/2 + log 2π w(ˆθ) where Î(ˆθ) is the empirical Fisher Information at the MLE

15 MDL: Mixture Versions Data Compression Minimum Description Length Principle Codelength based on a Bayes mixture where L(x) = log 1 q(x) q(x) = p(x θ)w(θ)dθ average case optimal and pointwise optimal for a.e. θ Corresponding statistical estimator is p(x ˆp(x ) = q(x θ)p(x θ)w(θ)dθ x) = p(x θ)w(θ)dθ which is a predictive distribution

16 Predictive Versions Data Compression Minimum Description Length Principle Chain rule, for joint densities q(x) = q(x 1 )q(x 2 x 1 ) q(x n x 1, x 2,..., x n 1 ) Chain rule, for codelengths based on predictive distrib L(x) = log 1 q(x 1 ) + log 1 q(x 2 x 1 ) +... log 1 q(x n x 1,..., x n 1 ) Corresponding statistical estimator at x = x n+1 ˆp(x ) = q(x n+1 x 1,..., x n )

17 Chain Rule of Divergence Data Compression Minimum Description Length Principle Expected redundancy of compression of x by pred. distrib. corresponds to cumulative risk for sample sizes n < n [ D(P X Q X ) = E log p(x 1) q(x 1 ) +log p(x 2) q(x 2 x 1 ) +... log p(x n ) ] q(x n x 1... x n 1 ) = n 1 n =0 ] E [D(P X ˆP ) X X1,...,Xn Individual risk E[D(P X ˆP X X1,...,X n )] Cumulative risk of d 2 log n corresponds to individual risk of d 2n

18 Outline This part is saved for Wednesday 1 Some Principles of Information Theory and Statistics Data Compression Minimum Description Length Principle 2 This part is saved for Wednesday 3 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid 4 Fixed σ 2 Case Unknown σ 2 Case 5

19 Information Capacity This part is saved for Wednesday A Channel θ Y is a family of probability distributions Information Capacity {P Y θ : θ Θ} C = max P θ I(θ; Y )

20 Communications Capacity This part is saved for Wednesday C com = maximum rate of reliable communication, measured in message bits per use of a channel Shannon Channel Capacity Theorem (Shannon 1948) C com = C

21 Data Compression Capacity This part is saved for Wednesday Minimax Redundancy Red = min max D(P Y θ Q Y ) Q Y θ Θ Data Compression Capacity Theorem Red = C (Gallager, Davisson & Leon-Garcia, Ryabko)

22 Statistical Risk Setting This part is saved for Wednesday Loss function Examples: Kullback Loss l(θ, θ ) l(θ, θ ) = D(P Y θ P Y θ ) Squared metric loss, e.g. squared Hellinger loss: l(θ, θ ) = d 2 (θ, θ )

23 Statistical Capacity This part is saved for Wednesday Estimators: ˆθ n Based on sample Y of size n Minimax Risk (Wald): r n = min ˆθ n max El(θ, ˆθ n ) θ

24 This part is saved for Wednesday Ingredients for determination of Statistical Capacity Kolmogorov Metric Entropy of S Θ: H(ɛ) = max{log Card(Θ ɛ ) : d(θ, θ ) > ɛ for θ, θ Θ ɛ S} Loss Assumption, for θ, θ S: l(θ, θ ) D(P Y θ P Y θ ) d 2 (θ, θ )

25 Statistical Capacity Theorem This part is saved for Wednesday For infinite-dimensional Θ With metric entropy evaluated a critical separation ɛ n Statistical Capacity Theorem Minimax Risk Info Capacity Rate Metric Entropy rate r n C n n H(ɛ n) n Yang 1997, Yang and B. 1999, Haussler and Opper 1997 ɛ 2 n

26 Outline Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid 1 Some Principles of Information Theory and Statistics Data Compression Minimum Description Length Principle 2 This part is saved for Wednesday 3 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid 4 Fixed σ 2 Case Unknown σ 2 Case 5

27 Two-stage Code Redundancy Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Expected codelength minus target at p θ [ { } 1 Redundancy = E min log θ Θ p θ (x) + L(θ) Redundancy approx in smooth families d 2 log n 2π + log I(θ ) 1/2 w(θ ) log 1 ] p θ (x)

28 Redundancy and Resolvability Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid [ ] Redundancy = E min θ Θ log p θ (x) p θ (x) + L(θ)) [ ] Resolvability = min θ Θ E log p θ (x) p θ (x) + L(θ) ] = min θ Θ [D(P X θ P X θ ) + L(θ) Ideal tradeoff of Kullback approximation error & complexity Population analogue of the two-stage code MDL criterion Divide by n to express as a rate. In the i.i.d. case [ R n (θ ) = min D(θ θ) + L(θ) ] θ Θ n

29 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Risk of Estimator based on Two-stage Code Estimator ˆθ is the choice achieving the minimization { } log min θ Θ 1 p θ (x) + L(θ) Codelengths for θ are L(θ) = 2L(θ) with θ Θ 2 L(θ) 1. Total loss d n (θ, ˆθ) with d n (θ, θ) = d(p X θ, P X θ ) Risk = E[d n (θ, ˆθ)] Info-Thy bound on risk: ( 1985, and Cover 1991, Jonathan Li 1999) Risk Redundancy Resolvability

30 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Risk of Estimator based on Two-stage Code Estimator ˆθ achieves min θ Θ {log 1/p θ (x) + L(θ)} Codelengths require θ F 2 L(θ) 1. Risk Resolvability Specialize to i.i.d. case: Ed(θ, ˆθ) [ min D(θ θ) + L(θ) ] θ Θ n As n, tolerate more complex P X θ if needed to get near P X θ Rate is 1/n, or close to that rate if the target is simple Drawback: Code interpretation entails countable Θ

31 Key to Risk Analysis Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid log likelihood-ratio discrepancy at training x and future x [ log p θ (x) ] p θ (x) d n(θ, θ) Proof shows, for L(θ) satisfying Kraft, that { [ min log p θ (x) ] } θ Θ p θ (x) d n(θ, θ) + L(θ) has expectation 0. From which the risk bound follows.

32 Information-theoretically Valid Penalties Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Penalized Likelihood min θ Θ { log Possibly uncountable Θ } 1 p θ (x) + Pen(θ) Yields data compression interpretation if there exists a countable Θ and L satisfying Kraft such that the above is not less than { min log θ Θ } 1 + L( θ) p θ(x)

33 Data-Compression Valid Penalties Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Equivalently, Pen(θ) is valid for penalized likelihood with uncountable Θ to have a data compression interpretation if there is such a countable Θ and Kraft summable L( θ), such that, for every θ in Θ, there is a representor θ in Θ such that Pen(θ) L( θ) + log p θ(x) p θ(x) This is the link between uncountable and countable cases

34 Statistical-Risk Valid Penalties Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Penalized Likelihood ˆθ = argmin θ Θ Again: possibly uncountable Θ { } 1 log p θ (x) + Pen(θ) Task: determine a condition on Pen(θ) such that the risk is captured by the population analogue { Ed n (θ, ˆθ) inf E log p θ (X) } θ Θ p θ (X) + Pen(θ)

35 Statistical-Risk Valid Penalty Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid For an uncountable Θ and a penalty Pen(θ), θ Θ, suppose there is a countable Θ and L( θ) = 2L( θ) where L( θ) satisfies Kraft, such that, for all x, θ, { [ min log p θ (x) ] } θ Θ p θ (x) d n(θ, θ) + Pen(θ) { [ min log p θ (x) ] } θ Θ p θ (x) d n(θ, θ) + L( θ) Proof of the risk conclusion: The second expression has expectation 0, so the first expression does too. This condition and result is obtained with J. Li and X. Luo (in Rissanen Festschrift 2008)

36 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Variable Complexity, Variable Distortion Cover Equivalent statement of the condition: Pen(θ) is a valid penalty if for each θ in Θ there is a representor Θ in Θ with complexity L( θ), distortion n ( θ, θ) and Pen(θ) L( θ) + n ( θ, θ) where the distortion n ( θ, θ) is the difference in the discrepancies at θ and θ n ( θ, θ) = log p θ(x) p θ (x) + d n(θ, θ ) d n ( θ, θ )

37 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid A Setting for Regression and log-density Estimation: Linear Span of a Dictionary G is a dictionary of candidate basis functions E.g. wavelets, splines, polynomials, trigonometric terms, sigmoids, explanatory variables and their interactions Candidate functions in the linear span f θ (x) = g G θ g g(x) weighted l 1 norm of coefficients θ 1 = g a g θ g weights a g = g n where g 2 n = 1 n n i=1 g2 (x i )

38 Example Models Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Regression (focus of current presentation) p θ (y x) = Normal(f θ (x), σ 2 ) Logistic regression with y {0, 1} p θ (y x) = Logistic(f θ (x)) for y = 1 Log-density estimation (focus of Festschrift paper) Gaussian graphical models p θ (x) = p 0(x) exp{f θ (x)} c f

39 l 1 Penalty Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid pen(θ) = λ θ 1 where θ are coeff of f θ (x) = g G θ g g(x) Popular penalty: Chen & Donoho (96) Basis Pursuit; Tibshirani (96) LASSO; Efron et al (04) LARS; Precursors: Jones (92), B.(90,93,94) greedy algorithm and analysis of combined l 1 and l 0 penalty We want to avoid cross-validation in choice of λ Data-compression: specify valid λ for coding interpretation Risk analysis: specify valid λ for risk resolvability Computation analysis: bounds accuracy of l 1 -penalized greedy pursuit algorithm

40 Outline Fixed σ 2 Case Unknown σ 2 Case 1 Some Principles of Information Theory and Statistics Data Compression Minimum Description Length Principle 2 This part is saved for Wednesday 3 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid 4 Fixed σ 2 Case Unknown σ 2 Case 5

41 Regression with l 1 penalty Fixed σ 2 Case Unknown σ 2 Case l 1 penalized log-density estimation, i.i.d. case { } 1 ˆθ = argmin θ n log 1 p fθ (x) + λ n θ 1 Regression with Gaussian model, fixed σ 2 { } 1 1 n min θ 2σ 2 (Y i f θ (x i )) n 2 log 2πσ2 + λ n σ θ 1 Valid for i=1 λ n 2 log(2mg ) n with M G = Card(G)

42 Adaptive risk bound Fixed σ 2 Case Unknown σ 2 Case For log density estimation with suitable λ n } Ed(f, fˆθ) inf {D(f f θ ) + λ n θ 1 θ For regression with fixed design points x i, fixed σ, and 2 log(2m) λ n = n, E f fˆθ 2 n 4σ 2 { f f θ 2 n inf θ 2σ 2 + λ } n σ θ 1

43 Fixed σ 2 Case Unknown σ 2 Case Adaptive risk bound specialized to regression 2 log 2M Again for fixed design and λ n = n, multiplying through by 4σ 2, E f fˆθ 2 n inf θ {2 f f θ 2 n + 4σλ n θ 1 } In particular for all targets f = f θ with finite θ the risk bound 4σλ n θ is of order log M n Details in, Luo (proceedings Workshop on Information Theory Methods in Science & Eng. 2008), Tampere, Finland: last week

44 Comments on proof Fixed σ 2 Case Unknown σ 2 Case Likelihood discrepancy plus complexity 1 n 2σ 2 (Y i f θ(x i )) 2 1 n 2σ 2 (Y i f θ (x i )) 2 + K log(2m) i=1 i=1 Representor f θ of f θ of following form, with v near θ 1 f θ (x) = v K g k (x)/ g k K k=1 g 1,... g K picked at random from G, independently, where g arises with probability proportional to θ g a g Shows exists representor with like. discrep. + complexity nv 2 2K + K log(2m)

45 Comments on proof Fixed σ 2 Case Unknown σ 2 Case Optimizing it yields penalty proportional to l 1 norm Penalty λ θ 1 is valid for both data compression and statistical risk requirements for λ λ n where 2 log(2m) λ n = n Especially useful for very large dictionaries Improvement for small dictionaries gets rid of log factor: log(2m) may be replaced by log(2e max{ M n, 1})

46 Comments on proof Fixed σ 2 Case Unknown σ 2 Case Existence of representor shown by random draw is a Shannon-like demonstration of the variable cover (code) Similar approximation in analysis of greedy computation of l 1 penalized least squares

47 Aside: Random design case Fixed σ 2 Case Unknown σ 2 Case May allow X i random with distribution P X Presuming functions in the library G are uniformly bounded Analogous risk bounds hold (in submitted paper by Huang, Cheang, and ) } E f fˆθ 2 inf {2 f f θ 2 + cλ n θ 1 θ Allows for libraries G of infinite cardinality, replacing the log M G with a metric entropy of G If the linear span of G is dense in L 2 then for all L 2 functions f the approximation error f f θ 2 can be arranged to go to zero as the size of v = θ 1 increases.

48 Fixed σ 2 Case Unknown σ 2 Case Aside: Simultaneous Minimax Rate Optimality 2 log M G n The resolvability bound tends to zero as λ n = gets small } E f fˆθ 2 inf {2 f f θ 2 + cλ n θ 1 θ Current work with Cong Huang shows for a broad class of approximate rates, the risk rate obtained from this resolvability bound is minimax optimal for the class of target functions {f : inf θ: θ 1 =v f f θ 2 A v }, simultaneously for all such classes

49 Fixed σ versus unknown σ Fixed σ 2 Case Unknown σ 2 Case MDL with l 1 penalty for each possible σ. Recall { } 1 1 n min θ 2σ 2 (Y i f θ (x i )) n 2 log 2πσ2 + λ n σ θ 1 i=1 Provides a family of fits indexed by σ. For unknown σ suggest optimization over σ as well as θ { } 1 1 n min θ,σ 2σ 2 (Y i f θ (x i )) n 2 log 2πσ2 + λ n σ θ 1 i=1 Slight modification of this does indeed satisfy our condition for an information-theoretically valid penalty and risk bound (details in the WITMSE 2008 proceedings)

50 Best σ Fixed σ 2 Case Unknown σ 2 Case Best σ for each θ solves the quadratic equation σ 2 = σ λ n θ n n ( Yi f θ (x i ) ) 2 i=1 By the quadratic formula the solution is σ = 1 2 λ n θ 1 + [ 1 2 λ ] 2 1 n θ 1 + n n ( Yi f θ (x i ) ) 2 i=1

51 Outline 1 Some Principles of Information Theory and Statistics Data Compression Minimum Description Length Principle 2 This part is saved for Wednesday 3 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid 4 Fixed σ 2 Case Unknown σ 2 Case 5

52 Handle penalized likelihoods with continuous domains for θ Information-theoretically valid penalties: Penalty exceed complexity plus distortion of optimized representor of θ Yields MDL interpretation and stat. risk controlled by resolvability l 0 penalty dim 2 log n classically analyzed l 1 penalty σλ n θ 1 analyzed here: valid in regression for λ n 2(log2M)/n Can handle fixed or unknown σ

Information and Statistics

Information and Statistics Andrew Barron Department of Statistics Yale University IMA Workshop On Information Theory and Concentration Phenomena Minneapolis, April 13, 2015 Barron Information and Statistics