A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo

Size: px
Start display at page:

Download "A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo"

Transcription

1 A BLEND OF INFORMATION THEORY AND STATISTICS Andrew R. YALE UNIVERSITY Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo Frejus, France, September 1-5, 2008

2 A BLEND OF INFORMATION THEORY AND STATISTICS Andrew R. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo Mon: Greedy Alg. for l 1 Penalization and Subset Selection Tue: : Upper Bounds and Adaptation Wed: Lower Bounds and Minimax Rates Thu: Role of Approximation Theory

3 Information Theory & Statistical Risk Analysis Andrew

4 Outline 1 Some Principles of Information Theory and Statistics Data Compression Minimum Description Length Principle 2 This part is saved for Wednesday 3 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid 4 Fixed σ 2 Case Unknown σ 2 Case 5

5 Outline Data Compression Minimum Description Length Principle 1 Some Principles of Information Theory and Statistics Data Compression Minimum Description Length Principle 2 This part is saved for Wednesday 3 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid 4 Fixed σ 2 Case Unknown σ 2 Case 5

6 Shannon, Kraft, McMillan Data Compression Minimum Description Length Principle Characterization of uniquely decodeable codelengths L(x), x X, 2 L(x) 1 x L(x) = log 1/q(x) q(x) = 2 L(x) Operational meaning of probability: A distribution is given by a choice of code

7 Codelength Comparison Data Compression Minimum Description Length Principle Targets p are possible distributions Compare codelength log 1/q(x) to targets log 1/p(x) Redundancy or regret [ ] log 1/q(x) log 1/p(x) Expected redundancy D(P X Q X ) = E P [ log p(x) ] q(x)

8 Universal Codes Data Compression Minimum Description Length Principle MODELS Family of coding strategies Family of prob. distributions Indexed by parameters or functions: { Lθ (x) : θ Θ } { p θ (x) : θ Θ } { Lf (x) : f F } { p f (x) : f F } Universal codes Universal probabilities q(x) L(x) = log 1/q(x) Redundancy: [ log 1/q(x) log 1/p θ (x) ] Want it small either uniformly in x, θ or in expectation

9 Statistical Aim Data Compression Minimum Description Length Principle Training data x estimator ˆp = pˆθ Subsequent data x Want log 1/ˆp(x ) to compare favorably to log 1/p(x ) For targets p close to or in the families

10 Loss Data Compression Minimum Description Length Principle Kullback Information-divergence: D(P X Q X ) = E [ log p(x )/q(x ) ] Bhattacharyya, Hellinger, Chernoff, Rényi divergence: d(p X, Q X ) = 2 log 1/E[q(X )/p(x )] 1/2 Product model case: p(x ) = n i=1 p(x i ) D(P X Q X ) = n D(P Q) Likewise d(p X, Q X ) = n d(p, Q)

11 Loss Data Compression Minimum Description Length Principle Relationship: d(p, Q) D(P Q) and, if the log density ratio is not more than B, then with C B 2 + B D(P Q) C B d(p, Q)

12 Data Compression Minimum Description Length Principle Minimum Description Length (MDL) Principle Universal coding brought into statistical play Minimum Description Length Principle: The shortest code for data gives the best statistical model

13 MDL: Two-stage Version Data Compression Minimum Description Length Principle Two-stage codelength: [ L(x) = min θ Θ ] log 1/p θ (x) + L(θ) bits for x given θ + bits for θ The corresponding statistical estimator is ˆp = pˆθ Typically in d-dimensional families L(θ) is of order d 2 log n

14 MDL: Mixture Versions Data Compression Minimum Description Length Principle Codelength based on a Bayes mixture where L(x) = log 1 q(x) q(x) = p(x θ)w(θ)dθ average case optimal and pointwise optimal for a.e. θ Codelength approximation ( 1985, Clarke and 1990,1994) log 1 p(x ˆθ) + d 2 log n Î(ˆθ) 1/2 + log 2π w(ˆθ) where Î(ˆθ) is the empirical Fisher Information at the MLE

15 MDL: Mixture Versions Data Compression Minimum Description Length Principle Codelength based on a Bayes mixture where L(x) = log 1 q(x) q(x) = p(x θ)w(θ)dθ average case optimal and pointwise optimal for a.e. θ Corresponding statistical estimator is p(x ˆp(x ) = q(x θ)p(x θ)w(θ)dθ x) = p(x θ)w(θ)dθ which is a predictive distribution

16 Predictive Versions Data Compression Minimum Description Length Principle Chain rule, for joint densities q(x) = q(x 1 )q(x 2 x 1 ) q(x n x 1, x 2,..., x n 1 ) Chain rule, for codelengths based on predictive distrib L(x) = log 1 q(x 1 ) + log 1 q(x 2 x 1 ) +... log 1 q(x n x 1,..., x n 1 ) Corresponding statistical estimator at x = x n+1 ˆp(x ) = q(x n+1 x 1,..., x n )

17 Chain Rule of Divergence Data Compression Minimum Description Length Principle Expected redundancy of compression of x by pred. distrib. corresponds to cumulative risk for sample sizes n < n [ D(P X Q X ) = E log p(x 1) q(x 1 ) +log p(x 2) q(x 2 x 1 ) +... log p(x n ) ] q(x n x 1... x n 1 ) = n 1 n =0 ] E [D(P X ˆP ) X X1,...,Xn Individual risk E[D(P X ˆP X X1,...,X n )] Cumulative risk of d 2 log n corresponds to individual risk of d 2n

18 Outline This part is saved for Wednesday 1 Some Principles of Information Theory and Statistics Data Compression Minimum Description Length Principle 2 This part is saved for Wednesday 3 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid 4 Fixed σ 2 Case Unknown σ 2 Case 5

19 Information Capacity This part is saved for Wednesday A Channel θ Y is a family of probability distributions Information Capacity {P Y θ : θ Θ} C = max P θ I(θ; Y )

20 Communications Capacity This part is saved for Wednesday C com = maximum rate of reliable communication, measured in message bits per use of a channel Shannon Channel Capacity Theorem (Shannon 1948) C com = C

21 Data Compression Capacity This part is saved for Wednesday Minimax Redundancy Red = min max D(P Y θ Q Y ) Q Y θ Θ Data Compression Capacity Theorem Red = C (Gallager, Davisson & Leon-Garcia, Ryabko)

22 Statistical Risk Setting This part is saved for Wednesday Loss function Examples: Kullback Loss l(θ, θ ) l(θ, θ ) = D(P Y θ P Y θ ) Squared metric loss, e.g. squared Hellinger loss: l(θ, θ ) = d 2 (θ, θ )

23 Statistical Capacity This part is saved for Wednesday Estimators: ˆθ n Based on sample Y of size n Minimax Risk (Wald): r n = min ˆθ n max El(θ, ˆθ n ) θ

24 This part is saved for Wednesday Ingredients for determination of Statistical Capacity Kolmogorov Metric Entropy of S Θ: H(ɛ) = max{log Card(Θ ɛ ) : d(θ, θ ) > ɛ for θ, θ Θ ɛ S} Loss Assumption, for θ, θ S: l(θ, θ ) D(P Y θ P Y θ ) d 2 (θ, θ )

25 Statistical Capacity Theorem This part is saved for Wednesday For infinite-dimensional Θ With metric entropy evaluated a critical separation ɛ n Statistical Capacity Theorem Minimax Risk Info Capacity Rate Metric Entropy rate r n C n n H(ɛ n) n Yang 1997, Yang and B. 1999, Haussler and Opper 1997 ɛ 2 n

26 Outline Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid 1 Some Principles of Information Theory and Statistics Data Compression Minimum Description Length Principle 2 This part is saved for Wednesday 3 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid 4 Fixed σ 2 Case Unknown σ 2 Case 5

27 Two-stage Code Redundancy Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Expected codelength minus target at p θ [ { } 1 Redundancy = E min log θ Θ p θ (x) + L(θ) Redundancy approx in smooth families d 2 log n 2π + log I(θ ) 1/2 w(θ ) log 1 ] p θ (x)

28 Redundancy and Resolvability Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid [ ] Redundancy = E min θ Θ log p θ (x) p θ (x) + L(θ)) [ ] Resolvability = min θ Θ E log p θ (x) p θ (x) + L(θ) ] = min θ Θ [D(P X θ P X θ ) + L(θ) Ideal tradeoff of Kullback approximation error & complexity Population analogue of the two-stage code MDL criterion Divide by n to express as a rate. In the i.i.d. case [ R n (θ ) = min D(θ θ) + L(θ) ] θ Θ n

29 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Risk of Estimator based on Two-stage Code Estimator ˆθ is the choice achieving the minimization { } log min θ Θ 1 p θ (x) + L(θ) Codelengths for θ are L(θ) = 2L(θ) with θ Θ 2 L(θ) 1. Total loss d n (θ, ˆθ) with d n (θ, θ) = d(p X θ, P X θ ) Risk = E[d n (θ, ˆθ)] Info-Thy bound on risk: ( 1985, and Cover 1991, Jonathan Li 1999) Risk Redundancy Resolvability

30 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Risk of Estimator based on Two-stage Code Estimator ˆθ achieves min θ Θ {log 1/p θ (x) + L(θ)} Codelengths require θ F 2 L(θ) 1. Risk Resolvability Specialize to i.i.d. case: Ed(θ, ˆθ) [ min D(θ θ) + L(θ) ] θ Θ n As n, tolerate more complex P X θ if needed to get near P X θ Rate is 1/n, or close to that rate if the target is simple Drawback: Code interpretation entails countable Θ

31 Key to Risk Analysis Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid log likelihood-ratio discrepancy at training x and future x [ log p θ (x) ] p θ (x) d n(θ, θ) Proof shows, for L(θ) satisfying Kraft, that { [ min log p θ (x) ] } θ Θ p θ (x) d n(θ, θ) + L(θ) has expectation 0. From which the risk bound follows.

32 Information-theoretically Valid Penalties Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Penalized Likelihood min θ Θ { log Possibly uncountable Θ } 1 p θ (x) + Pen(θ) Yields data compression interpretation if there exists a countable Θ and L satisfying Kraft such that the above is not less than { min log θ Θ } 1 + L( θ) p θ(x)

33 Data-Compression Valid Penalties Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Equivalently, Pen(θ) is valid for penalized likelihood with uncountable Θ to have a data compression interpretation if there is such a countable Θ and Kraft summable L( θ), such that, for every θ in Θ, there is a representor θ in Θ such that Pen(θ) L( θ) + log p θ(x) p θ(x) This is the link between uncountable and countable cases

34 Statistical-Risk Valid Penalties Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Penalized Likelihood ˆθ = argmin θ Θ Again: possibly uncountable Θ { } 1 log p θ (x) + Pen(θ) Task: determine a condition on Pen(θ) such that the risk is captured by the population analogue { Ed n (θ, ˆθ) inf E log p θ (X) } θ Θ p θ (X) + Pen(θ)

35 Statistical-Risk Valid Penalty Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid For an uncountable Θ and a penalty Pen(θ), θ Θ, suppose there is a countable Θ and L( θ) = 2L( θ) where L( θ) satisfies Kraft, such that, for all x, θ, { [ min log p θ (x) ] } θ Θ p θ (x) d n(θ, θ) + Pen(θ) { [ min log p θ (x) ] } θ Θ p θ (x) d n(θ, θ) + L( θ) Proof of the risk conclusion: The second expression has expectation 0, so the first expression does too. This condition and result is obtained with J. Li and X. Luo (in Rissanen Festschrift 2008)

36 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Variable Complexity, Variable Distortion Cover Equivalent statement of the condition: Pen(θ) is a valid penalty if for each θ in Θ there is a representor Θ in Θ with complexity L( θ), distortion n ( θ, θ) and Pen(θ) L( θ) + n ( θ, θ) where the distortion n ( θ, θ) is the difference in the discrepancies at θ and θ n ( θ, θ) = log p θ(x) p θ (x) + d n(θ, θ ) d n ( θ, θ )

37 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid A Setting for Regression and log-density Estimation: Linear Span of a Dictionary G is a dictionary of candidate basis functions E.g. wavelets, splines, polynomials, trigonometric terms, sigmoids, explanatory variables and their interactions Candidate functions in the linear span f θ (x) = g G θ g g(x) weighted l 1 norm of coefficients θ 1 = g a g θ g weights a g = g n where g 2 n = 1 n n i=1 g2 (x i )

38 Example Models Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid Regression (focus of current presentation) p θ (y x) = Normal(f θ (x), σ 2 ) Logistic regression with y {0, 1} p θ (y x) = Logistic(f θ (x)) for y = 1 Log-density estimation (focus of Festschrift paper) Gaussian graphical models p θ (x) = p 0(x) exp{f θ (x)} c f

39 l 1 Penalty Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid pen(θ) = λ θ 1 where θ are coeff of f θ (x) = g G θ g g(x) Popular penalty: Chen & Donoho (96) Basis Pursuit; Tibshirani (96) LASSO; Efron et al (04) LARS; Precursors: Jones (92), B.(90,93,94) greedy algorithm and analysis of combined l 1 and l 0 penalty We want to avoid cross-validation in choice of λ Data-compression: specify valid λ for coding interpretation Risk analysis: specify valid λ for risk resolvability Computation analysis: bounds accuracy of l 1 -penalized greedy pursuit algorithm

40 Outline Fixed σ 2 Case Unknown σ 2 Case 1 Some Principles of Information Theory and Statistics Data Compression Minimum Description Length Principle 2 This part is saved for Wednesday 3 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid 4 Fixed σ 2 Case Unknown σ 2 Case 5

41 Regression with l 1 penalty Fixed σ 2 Case Unknown σ 2 Case l 1 penalized log-density estimation, i.i.d. case { } 1 ˆθ = argmin θ n log 1 p fθ (x) + λ n θ 1 Regression with Gaussian model, fixed σ 2 { } 1 1 n min θ 2σ 2 (Y i f θ (x i )) n 2 log 2πσ2 + λ n σ θ 1 Valid for i=1 λ n 2 log(2mg ) n with M G = Card(G)

42 Adaptive risk bound Fixed σ 2 Case Unknown σ 2 Case For log density estimation with suitable λ n } Ed(f, fˆθ) inf {D(f f θ ) + λ n θ 1 θ For regression with fixed design points x i, fixed σ, and 2 log(2m) λ n = n, E f fˆθ 2 n 4σ 2 { f f θ 2 n inf θ 2σ 2 + λ } n σ θ 1

43 Fixed σ 2 Case Unknown σ 2 Case Adaptive risk bound specialized to regression 2 log 2M Again for fixed design and λ n = n, multiplying through by 4σ 2, E f fˆθ 2 n inf θ {2 f f θ 2 n + 4σλ n θ 1 } In particular for all targets f = f θ with finite θ the risk bound 4σλ n θ is of order log M n Details in, Luo (proceedings Workshop on Information Theory Methods in Science & Eng. 2008), Tampere, Finland: last week

44 Comments on proof Fixed σ 2 Case Unknown σ 2 Case Likelihood discrepancy plus complexity 1 n 2σ 2 (Y i f θ(x i )) 2 1 n 2σ 2 (Y i f θ (x i )) 2 + K log(2m) i=1 i=1 Representor f θ of f θ of following form, with v near θ 1 f θ (x) = v K g k (x)/ g k K k=1 g 1,... g K picked at random from G, independently, where g arises with probability proportional to θ g a g Shows exists representor with like. discrep. + complexity nv 2 2K + K log(2m)

45 Comments on proof Fixed σ 2 Case Unknown σ 2 Case Optimizing it yields penalty proportional to l 1 norm Penalty λ θ 1 is valid for both data compression and statistical risk requirements for λ λ n where 2 log(2m) λ n = n Especially useful for very large dictionaries Improvement for small dictionaries gets rid of log factor: log(2m) may be replaced by log(2e max{ M n, 1})

46 Comments on proof Fixed σ 2 Case Unknown σ 2 Case Existence of representor shown by random draw is a Shannon-like demonstration of the variable cover (code) Similar approximation in analysis of greedy computation of l 1 penalized least squares

47 Aside: Random design case Fixed σ 2 Case Unknown σ 2 Case May allow X i random with distribution P X Presuming functions in the library G are uniformly bounded Analogous risk bounds hold (in submitted paper by Huang, Cheang, and ) } E f fˆθ 2 inf {2 f f θ 2 + cλ n θ 1 θ Allows for libraries G of infinite cardinality, replacing the log M G with a metric entropy of G If the linear span of G is dense in L 2 then for all L 2 functions f the approximation error f f θ 2 can be arranged to go to zero as the size of v = θ 1 increases.

48 Fixed σ 2 Case Unknown σ 2 Case Aside: Simultaneous Minimax Rate Optimality 2 log M G n The resolvability bound tends to zero as λ n = gets small } E f fˆθ 2 inf {2 f f θ 2 + cλ n θ 1 θ Current work with Cong Huang shows for a broad class of approximate rates, the risk rate obtained from this resolvability bound is minimax optimal for the class of target functions {f : inf θ: θ 1 =v f f θ 2 A v }, simultaneously for all such classes

49 Fixed σ versus unknown σ Fixed σ 2 Case Unknown σ 2 Case MDL with l 1 penalty for each possible σ. Recall { } 1 1 n min θ 2σ 2 (Y i f θ (x i )) n 2 log 2πσ2 + λ n σ θ 1 i=1 Provides a family of fits indexed by σ. For unknown σ suggest optimization over σ as well as θ { } 1 1 n min θ,σ 2σ 2 (Y i f θ (x i )) n 2 log 2πσ2 + λ n σ θ 1 i=1 Slight modification of this does indeed satisfy our condition for an information-theoretically valid penalty and risk bound (details in the WITMSE 2008 proceedings)

50 Best σ Fixed σ 2 Case Unknown σ 2 Case Best σ for each θ solves the quadratic equation σ 2 = σ λ n θ n n ( Yi f θ (x i ) ) 2 i=1 By the quadratic formula the solution is σ = 1 2 λ n θ 1 + [ 1 2 λ ] 2 1 n θ 1 + n n ( Yi f θ (x i ) ) 2 i=1

51 Outline 1 Some Principles of Information Theory and Statistics Data Compression Minimum Description Length Principle 2 This part is saved for Wednesday 3 Two-stage Code Redundancy, Resolvability, and Risk for Continuous Parameters l 1 Penalties are Information-Theoretically Valid 4 Fixed σ 2 Case Unknown σ 2 Case 5

52 Handle penalized likelihoods with continuous domains for θ Information-theoretically valid penalties: Penalty exceed complexity plus distortion of optimized representor of θ Yields MDL interpretation and stat. risk controlled by resolvability l 0 penalty dim 2 log n classically analyzed l 1 penalty σλ n θ 1 analyzed here: valid in regression for λ n 2(log2M)/n Can handle fixed or unknown σ

Information and Statistics

Information and Statistics Information and Statistics Andrew Barron Department of Statistics Yale University IMA Workshop On Information Theory and Concentration Phenomena Minneapolis, April 13, 2015 Barron Information and Statistics

More information

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms university-logo Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms Andrew Barron Cong Huang Xi Luo Department of Statistics Yale University 2008 Workshop on Sparsity in High Dimensional

More information

INFORMATION AND STATISTICS. Andrew R. Barron. Presentation, April 30, 2015 Information Theory Workshop, Jerusalem

INFORMATION AND STATISTICS. Andrew R. Barron. Presentation, April 30, 2015 Information Theory Workshop, Jerusalem INFORMATION AND STATISTICS Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF STATISTICS Presentation, April 30, 2015 Information Theory Workshop, Jerusalem Information and Statistics Topics in the abstract

More information

Coding on Countably Infinite Alphabets

Coding on Countably Infinite Alphabets Coding on Countably Infinite Alphabets Non-parametric Information Theory Licence de droits d usage Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe

More information

Context tree models for source coding

Context tree models for source coding Context tree models for source coding Toward Non-parametric Information Theory Licence de droits d usage Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding

More information

WE start with a general discussion. Suppose we have

WE start with a general discussion. Suppose we have 646 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997 Minimax Redundancy for the Class of Memoryless Sources Qun Xie and Andrew R. Barron, Member, IEEE Abstract Let X n = (X 1 ; 111;Xn)be

More information

Exact Minimax Predictive Density Estimation and MDL

Exact Minimax Predictive Density Estimation and MDL Exact Minimax Predictive Density Estimation and MDL Feng Liang and Andrew Barron December 5, 2003 Abstract The problems of predictive density estimation with Kullback-Leibler loss, optimal universal data

More information

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation

Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation Andre Barron Department of Statistics Yale University Email: andre.barron@yale.edu Teemu Roos HIIT & Dept of Computer Science

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University The Annals of Statistics 1999, Vol. 27, No. 5, 1564 1599 INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1 By Yuhong Yang and Andrew Barron Iowa State University and Yale University

More information

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY 2nd International Symposium on Information Geometry and its Applications December 2-6, 2005, Tokyo Pages 000 000 STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY JUN-ICHI TAKEUCHI, ANDREW R. BARRON, AND

More information

Lecture 2: August 31

Lecture 2: August 31 0-704: Information Processing and Learning Fall 206 Lecturer: Aarti Singh Lecture 2: August 3 Note: These notes are based on scribed notes from Spring5 offering of this course. LaTeX template courtesy

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection 2708 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 11, NOVEMBER 2004 Exact Minimax Strategies for Predictive Density Estimation, Data Compression, Model Selection Feng Liang Andrew Barron, Senior

More information

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme

An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme An Information Theoretic Interpretation of Variational Inference based on the MDL Principle and the Bits-Back Coding Scheme Ghassen Jerfel April 2017 As we will see during this talk, the Bayesian and information-theoretic

More information

Minimax lower bounds I

Minimax lower bounds I Minimax lower bounds I Kyoung Hee Kim Sungshin University 1 Preliminaries 2 General strategy 3 Le Cam, 1973 4 Assouad, 1983 5 Appendix Setting Family of probability measures {P θ : θ Θ} on a sigma field

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

Communication by Regression: Achieving Shannon Capacity

Communication by Regression: Achieving Shannon Capacity Communication by Regression: Practical Achievement of Shannon Capacity Department of Statistics Yale University Workshop Infusing Statistics and Engineering Harvard University, June 5-6, 2011 Practical

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Willem van den Boom Department of Statistics and Applied Probability National University

More information

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble

Motivation Sparse Signal Recovery is an interesting area with many potential applications. Methods developed for solving sparse signal recovery proble Bayesian Methods for Sparse Signal Recovery Bhaskar D Rao 1 University of California, San Diego 1 Thanks to David Wipf, Zhilin Zhang and Ritwik Giri Motivation Sparse Signal Recovery is an interesting

More information

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal

More information

The Method of Types and Its Application to Information Hiding

The Method of Types and Its Application to Information Hiding The Method of Types and Its Application to Information Hiding Pierre Moulin University of Illinois at Urbana-Champaign www.ifp.uiuc.edu/ moulin/talks/eusipco05-slides.pdf EUSIPCO Antalya, September 7,

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting

Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting Construction of an Informative Hierarchical Prior Distribution: Application to Electricity Load Forecasting Anne Philippe Laboratoire de Mathématiques Jean Leray Université de Nantes Workshop EDF-INRIA,

More information

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n

Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Information Measure Estimation and Applications: Boosting the Effective Sample Size from n to n ln n Jiantao Jiao (Stanford EE) Joint work with: Kartik Venkat Yanjun Han Tsachy Weissman Stanford EE Tsinghua

More information

Lecture 2: Statistical Decision Theory (Part I)

Lecture 2: Statistical Decision Theory (Part I) Lecture 2: Statistical Decision Theory (Part I) Hao Helen Zhang Hao Helen Zhang Lecture 2: Statistical Decision Theory (Part I) 1 / 35 Outline of This Note Part I: Statistics Decision Theory (from Statistical

More information

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding Patrick Breheny February 15 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/24 Introduction Last week, we introduced penalized regression and discussed ridge regression, in which the penalty

More information

arxiv: v3 [stat.me] 11 Feb 2018

arxiv: v3 [stat.me] 11 Feb 2018 arxiv:1708.02742v3 [stat.me] 11 Feb 2018 Minimum message length inference of the Poisson and geometric models using heavy-tailed prior distributions Chi Kuen Wong, Enes Makalic, Daniel F. Schmidt February

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

The Minimum Message Length Principle for Inductive Inference

The Minimum Message Length Principle for Inductive Inference The Principle for Inductive Inference Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health University of Melbourne University of Helsinki, August 25,

More information

3.0.1 Multivariate version and tensor product of experiments

3.0.1 Multivariate version and tensor product of experiments ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 3: Minimax risk of GLM and four extensions Lecturer: Yihong Wu Scribe: Ashok Vardhan, Jan 28, 2016 [Ed. Mar 24]

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

On Bayes Risk Lower Bounds

On Bayes Risk Lower Bounds Journal of Machine Learning Research 17 (2016) 1-58 Submitted 4/16; Revised 10/16; Published 12/16 On Bayes Risk Lower Bounds Xi Chen Stern School of Business New York University New York, NY 10012, USA

More information

Information Theory, Statistics, and Decision Trees

Information Theory, Statistics, and Decision Trees Information Theory, Statistics, and Decision Trees Léon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. Léon Bottou 2/31 COS 424 4/6/2010

More information

Sequential prediction with coded side information under logarithmic loss

Sequential prediction with coded side information under logarithmic loss under logarithmic loss Yanina Shkel Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Maxim Raginsky Department of Electrical and Computer Engineering Coordinated Science

More information

Lossy Compression Coding Theorems for Arbitrary Sources

Lossy Compression Coding Theorems for Arbitrary Sources Lossy Compression Coding Theorems for Arbitrary Sources Ioannis Kontoyiannis U of Cambridge joint work with M. Madiman, M. Harrison, J. Zhang Beyond IID Workshop, Cambridge, UK July 23, 2018 Outline Motivation

More information

MIT Spring 2016

MIT Spring 2016 MIT 18.655 Dr. Kempthorne Spring 2016 1 MIT 18.655 Outline 1 2 MIT 18.655 Decision Problem: Basic Components P = {P θ : θ Θ} : parametric model. Θ = {θ}: Parameter space. A{a} : Action space. L(θ, a) :

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

6.1 Variational representation of f-divergences

6.1 Variational representation of f-divergences ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 6: Variational representation, HCR and CR lower bounds Lecturer: Yihong Wu Scribe: Georgios Rovatsos, Feb 11, 2016

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk John Lafferty School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lafferty@cs.cmu.edu Abstract

More information

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam.  aad. Bayesian Adaptation p. 1/4 Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4 Joint work with Jyri Lember Bayesian Adaptation p. 2/4 Adaptation Given a collection

More information

Inference in non-linear time series

Inference in non-linear time series Intro LS MLE Other Erik Lindström Centre for Mathematical Sciences Lund University LU/LTH & DTU Intro LS MLE Other General Properties Popular estimatiors Overview Introduction General Properties Estimators

More information

16.1 Bounding Capacity with Covering Number

16.1 Bounding Capacity with Covering Number ECE598: Information-theoretic methods in high-dimensional statistics Spring 206 Lecture 6: Upper Bounds for Density Estimation Lecturer: Yihong Wu Scribe: Yang Zhang, Apr, 206 So far we have been mostly

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

Introduction to graphical models: Lecture III

Introduction to graphical models: Lecture III Introduction to graphical models: Lecture III Martin Wainwright UC Berkeley Departments of Statistics, and EECS Martin Wainwright (UC Berkeley) Some introductory lectures January 2013 1 / 25 Introduction

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

Testing Restrictions and Comparing Models

Testing Restrictions and Comparing Models Econ. 513, Time Series Econometrics Fall 00 Chris Sims Testing Restrictions and Comparing Models 1. THE PROBLEM We consider here the problem of comparing two parametric models for the data X, defined by

More information

IMA Preprint Series # 2276

IMA Preprint Series # 2276 UNIVERSAL PRIORS FOR SPARSE MODELING By Ignacio Ramírez Federico Lecumberry and Guillermo Sapiro IMA Preprint Series # 2276 ( August 2009 ) INSTITUTE FOR MATHEMATICS AND ITS APPLICATIONS UNIVERSITY OF

More information

A Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity

A Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity A Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity Peter Grünwald Centrum Wiskunde & Informatica Amsterdam Mathematical Institute Leiden University Joint work with

More information

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines Maximilian Kasy Department of Economics, Harvard University 1 / 37 Agenda 6 equivalent representations of the

More information

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10 COS53: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 0 MELISSA CARROLL, LINJIE LUO. BIAS-VARIANCE TRADE-OFF (CONTINUED FROM LAST LECTURE) If V = (X n, Y n )} are observed data, the linear regression problem

More information

Lecture 2 Machine Learning Review

Lecture 2 Machine Learning Review Lecture 2 Machine Learning Review CMSC 35246: Deep Learning Shubhendu Trivedi & Risi Kondor University of Chicago March 29, 2017 Things we will look at today Formal Setup for Supervised Learning Things

More information

Quiz 2 Date: Monday, November 21, 2016

Quiz 2 Date: Monday, November 21, 2016 10-704 Information Processing and Learning Fall 2016 Quiz 2 Date: Monday, November 21, 2016 Name: Andrew ID: Department: Guidelines: 1. PLEASE DO NOT TURN THIS PAGE UNTIL INSTRUCTED. 2. Write your name,

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

Keep it Simple Stupid On the Effect of Lower-Order Terms in BIC-Like Criteria

Keep it Simple Stupid On the Effect of Lower-Order Terms in BIC-Like Criteria Keep it Simple Stupid On the Effect of Lower-Order Terms in BIC-Like Criteria Teemu Roos and Yuan Zou Helsinki Institute for Information Technology HIIT Department of Computer Science University of Helsinki,

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Data Mining Stat 588

Data Mining Stat 588 Data Mining Stat 588 Lecture 9: Basis Expansions Department of Statistics & Biostatistics Rutgers University Nov 01, 2011 Regression and Classification Linear Regression. E(Y X) = f(x) We want to learn

More information

Local Asymptotics and the Minimum Description Length

Local Asymptotics and the Minimum Description Length Local Asymptotics and the Minimum Description Length Dean P. Foster and Robert A. Stine Department of Statistics The Wharton School of the University of Pennsylvania Philadelphia, PA 19104-6302 March 27,

More information

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang Bayesian Dropout Tue Herlau, Morten Morup and Mikkel N. Schmidt Discussed by: Yizhe Zhang Feb 20, 2016 Outline 1 Introduction 2 Model 3 Inference 4 Experiments Dropout Training stage: A unit is present

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Optimal Estimation of a Nonsmooth Functional

Optimal Estimation of a Nonsmooth Functional Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose

More information

High-dimensional covariance estimation based on Gaussian graphical models

High-dimensional covariance estimation based on Gaussian graphical models High-dimensional covariance estimation based on Gaussian graphical models Shuheng Zhou Department of Statistics, The University of Michigan, Ann Arbor IMA workshop on High Dimensional Phenomena Sept. 26,

More information

SHARED INFORMATION. Prakash Narayan with. Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe

SHARED INFORMATION. Prakash Narayan with. Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe SHARED INFORMATION Prakash Narayan with Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe 2/40 Acknowledgement Praneeth Boda Himanshu Tyagi Shun Watanabe 3/40 Outline Two-terminal model: Mutual

More information

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain 0.1. INTRODUCTION 1 0.1 Introduction R. A. Fisher, a pioneer in the development of mathematical statistics, introduced a measure of the amount of information contained in an observaton from f(x θ). Fisher

More information

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

Chapter 7: Model Assessment and Selection

Chapter 7: Model Assessment and Selection Chapter 7: Model Assessment and Selection DD3364 April 20, 2012 Introduction Regression: Review of our problem Have target variable Y to estimate from a vector of inputs X. A prediction model ˆf(X) has

More information

g-priors for Linear Regression

g-priors for Linear Regression Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,

More information

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation Yujin Chung November 29th, 2016 Fall 2016 Yujin Chung Lec13: MLE Fall 2016 1/24 Previous Parametric tests Mean comparisons (normality assumption)

More information

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013) A Survey of L 1 Regression Vidaurre, Bielza and Larranaga (2013) Céline Cunen, 20/10/2014 Outline of article 1.Introduction 2.The Lasso for Linear Regression a) Notation and Main Concepts b) Statistical

More information

Chaos, Complexity, and Inference (36-462)

Chaos, Complexity, and Inference (36-462) Chaos, Complexity, and Inference (36-462) Lecture 7: Information Theory Cosma Shalizi 3 February 2009 Entropy and Information Measuring randomness and dependence in bits The connection to statistics Long-run

More information

Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

More information

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage Linear Regression Linear Regression ith Shrinkage Introduction Regression means predicting a continuous (usually scalar) output y from a vector of continuous inputs (features) x. Example: Predicting vehicle

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Universal probability distributions, two-part codes, and their optimal precision

Universal probability distributions, two-part codes, and their optimal precision Universal probability distributions, two-part codes, and their optimal precision Contents 0 An important reminder 1 1 Universal probability distributions in theory 2 2 Universal probability distributions

More information

13. Parameter Estimation. ECE 830, Spring 2014

13. Parameter Estimation. ECE 830, Spring 2014 13. Parameter Estimation ECE 830, Spring 2014 1 / 18 Primary Goal General problem statement: We observe X p(x θ), θ Θ and the goal is to determine the θ that produced X. Given a collection of observations

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Information Theory and Model Selection

Information Theory and Model Selection Information Theory and Model Selection Dean Foster & Robert Stine Dept of Statistics, Univ of Pennsylvania May 11, 1998 Data compression and coding Duality of code lengths and probabilities Model selection

More information

Master s Written Examination

Master s Written Examination Master s Written Examination Option: Statistics and Probability Spring 016 Full points may be obtained for correct answers to eight questions. Each numbered question which may have several parts is worth

More information

Tufts COMP 135: Introduction to Machine Learning

Tufts COMP 135: Introduction to Machine Learning Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Logistic Regression Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI) Finale Doshi-Velez (Harvard)

More information

Minimum Message Length Analysis of the Behrens Fisher Problem

Minimum Message Length Analysis of the Behrens Fisher Problem Analysis of the Behrens Fisher Problem Enes Makalic and Daniel F Schmidt Centre for MEGA Epidemiology The University of Melbourne Solomonoff 85th Memorial Conference, 2011 Outline Introduction 1 Introduction

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

Lecture 4 September 15

Lecture 4 September 15 IFT 6269: Probabilistic Graphical Models Fall 2017 Lecture 4 September 15 Lecturer: Simon Lacoste-Julien Scribe: Philippe Brouillard & Tristan Deleu 4.1 Maximum Likelihood principle Given a parametric

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

RATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1

RATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1 The Annals of Statistics 1997, Vol. 25, No. 6, 2493 2511 RATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1 By Theodoros Nicoleris and Yannis

More information

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d.

More information