Theoretical Statistics. Lecture 14.

Size: px
Start display at page:

Download "Theoretical Statistics. Lecture 14."

Transcription

1 Theoretical Statistics. Lecture 14. Peter Bartlett Metric entropy. 1. Chaining: Dudley s entropy integral 1

2 Recall: Sub-Gaussian processes Definition: A stochastic process θ X θ with indexing set T is sub- Gaussian with respect to a metric d ont if, for all θ,θ T and all λ R, ( λ Eexp(λ(X θ X θ )) 2 d(θ,θ ) 2 ) exp. 2 Lemma: [Finite Classes] ForX θ sub-gaussian wrtdont, andaaset of pairs from T, E max (θ,θ ) A (X θ X θ ) max (θ,θ ) A d(θ,θ ) 2log A. 2

3 Recall: Covering number bound Theorem: Consider a zero-mean process X θ that is sub-gaussian wrt the metric d on T. Suppose that the diameter of T is D = sup θ,θ d(θ,θ ). Then for any ǫ, Esup θ X θ 2E sup (X θ X θ )+2D logn(ǫ,t,d). d(θ,θ ) ǫ 3

4 Dudley s entropy integral Theorem: Let X θ be a zero-mean stochastic process that is sub-gaussian wrt a pseudo-metric d on the indexing set T. Then Esup θ X θ 8 2 logn(ǫ,t,d)dǫ. Note that we can always rewrite the integral as an integral from to the diameter of T. 4

5 Dudley s entropy integral: Proof As before, Esup θ X θ = Esup(X θ X θ ) Esup(X θ X θ ), θ θ,θ and choosing ˆθ ˆT (a minimal ǫ-cover) withd(ˆθ,θ) ǫ (and similarly for θ ), we have X θ X θ = X θ Xˆθ +Xˆθ Xˆθ +Xˆθ X θ 2 sup (X θ Xˆθ)+ sup d(θ,ˆθ) ǫ ˆθ,ˆθ ˆT Xˆθ Xˆθ. 5

6 Dudley s entropy integral: Proof Consider boundingesupˆθ,ˆθ (Xˆθ Xˆθ ). Previously, we bounded the supremum over theǫ-cover ˆT (for which the diameter is that of T ). Instead, we consider a sequence of progressively better approximations to elements of ˆT (which leads to sets with progressively smaller diameters). Suppose the diameter of ˆT isd. We first define ˆT k = ˆT, and think of it as a (2 k D)-cover of ˆT, where k = log 2 (D/ǫ) ensures that 2 k D ǫ. Then we define ˆT i 1 = a minimal (2 (i 1) D)-cover of ˆT i, for i going from k 1 down to. Notice that ˆT is a minimal D-cover of ˆT 1, so ˆT = 1. [PICTURE]. 6

7 Dudley s entropy integral: Proof Pick ˆθ k = ˆθ, and then pick ˆθ i 1 ˆT i 1 as the best approximation of ˆθ i. We can write ˆθ i 1 = f i 1 (ˆθ i ), where f i 1 : ˆT i ˆT i 1 is the best approximation operator. Then we can write Xˆθ = Xˆθk = Xˆθ + k i=1 and, using the same notation for ˆθ, we have (Xˆθi Xˆθi 1 ) = Xˆθ Xˆθ = Xˆθk Xˆθ k k (Xˆθi Xˆθi 1 ) k (Xˆθ i Xˆθ i 1 ). i=1 i=1 7

8 Dudley s entropy integral: Proof Thus, E sup ˆθ,ˆθ ˆT Xˆθ Xˆθ 2 k i=1 ( ) E sup X Xˆθi fi 1 (ˆθ i ). ˆθ i ˆT i Sinced(ˆθ i, ˆθ i 1 ) 2 (i 1) D, the Finite Lemma shows that ( ) E sup X Xˆθi fi 1 (ˆθ i ) 2 (i 1) D 2log ˆT i ˆθ i ˆT i 2 (i 1) D 2logN(2 i D,T). 8

9 Dudley s entropy integral: Proof Finally, since logn(2 i D) logn(u) for u 2 i D, we can approximate the area of the rectangle from (2 (i+1) D,) to (2 i D, 2logN(2 i D)) by the integral under 2logN(u) for u in that interval (which has length 2 (i+1) D): 2 (i 1) D 2logN(2 i D) = 4 2 (i+1) D 2logN(2 i D) 4 2 i D 2 (i+1) D 2logN(u,T)du. 9

10 Dudley s entropy integral: Proof Combining, we have Esup θ X θ 2E sup (X θ Xˆθ)+2 d(θ,ˆθ) ǫ 2E sup (X θ Xˆθ)+2 d(θ,ˆθ) ǫ k i=1 2E sup (X θ Xˆθ)+8 2 d(θ,ˆθ) ǫ ( ) E sup X Xˆθi fi 1 (ˆθ i ) ˆθ i ˆT i k 2 (i 1) D 2logN(2 i D,T) i=1 D/2 2 (k+1) D logn(u,t)du. Whenǫ, the first term goes to zero and (since k = log 2 (D/ǫ) ), the second term approaches the integral from tod/2, which gives the result. 1

11 Dudley s entropy integral We actually proved the following result: Theorem: Let X θ be a zero-mean stochastic process that is sub-gaussian wrt a pseudo-metric d on the indexing set T. Then Esup θ X θ 2E sup (X θ X θ )+8 D/2 2 logn(ǫ,t,d)dǫ. d(θ,θ ) δ δ/2 When the entropy integral does not exist (because N(ǫ,T,d) grows too quickly as ǫ ), this can still give a useful bound. 11

12 Dudley s entropy integral When does the entropy integral exist? SupposeT has diameter D and logn(ǫ,t,d) = O(ǫ d ). Then D logn(ǫ,t,d)dǫ C D = ǫ d/2 dǫ C 1 d/2 D1 d/2 provided that d < 2. The integral does not exist otherwise. 12

13 Entropy Integral: Lipschitz parameterized class Suppose that F is a parameterized class, F = {f(θ, ) : θ Θ}, where Θ = B 2 R p. The parameterization isl-lipschitz wrt Euclidean distance onθ, so that for all x, f(θ,x) f(θ,x) L θ θ 2. Suppose also that F = F (that is,f is closed under negations). Theorem: E R n F = O ( ) p L. n NB: We ve lost the log factor. 13

14 Entropy Integral: Lipschitz parameterized class Recall that ne R n F = E sup F F ǫ, = Esup F ǫ, = Esup ǫ,f(θ,x1), n which is sub-gaussian wrt the Euclidean distance on R n. Also, recall that N(δ,f(Θ,X n 1), 2 ) N(δ/(L n),θ, 2 ) (1+2L n/δ) p. θ 14

15 Entropy Integral: Lipschitz parameterized class Hence, E R n F 8 2 n = 8 2L n 8 p 2L n 8 p 2L n logn ( ) ǫ L n,θ, 2 dǫ logn(ǫ,θ, 2 )dǫ 2 2 log log ( 1+ 2 ) dǫ ǫ ( ) 4 dǫ. ǫ 15

16 Entropy Integral: Lipschitz parameterized class Integrating by parts, E R n F 8 p 2L n = 8 p 2L n 2 ( log ( ) 4 dǫ ǫ [4e y2 y] log2 16 ( ) 2 log2+ 2π 8.7p < L n. 4 p L n log2 e y2 dy ) 16

17 Entropy Integral: VC-class Theorem: For F a class of{, 1}-valued functions with VC-dimension d, ( ) d E R n F = O. n Compare with the consequence of Sauer s Lemma: O( dlog(n/d)/n). We lose the log factor. Note: This leads to a faster rate (without the log factor) in the proof of the Glivenko-Cantelli Theorem: ( Pr F n F c ) ) +t 2exp ( nt2. n 8 17

18 Entropy Integral: VC-class We have where E R n F 8 2 n E 8 2 n E = 8 2 n E 2 n 2 n 2 f g 2 L 2 (P n ) = 1 n logn(ǫ,f(x n 1 ), 2)dǫ logn(ǫ/ n,f, L2 (P n ))dǫ logn(ǫ,f, L2 (P n ))dǫ, n (f(x i ) g(x i )) 2. i=1 18

19 Entropy Integral: VC-class Fact (due to Haussler): N(ǫ,F, L2 (P n )) cd(16e) d ǫ 2d. E R n F n E 8 2 n E = d c n. 2 logn(ǫ,f, L2 (P n ))dǫ log(cd(16e) d ǫ 2d )dǫ 19

20 An aside: Generic Chaining Theorem: Let X θ be a zero-mean stochastic process that is sub-gaussian wrt a pseudo-metric d on the indexing set T. Then for any probability distributionµont, Esup θ X θ csup θ T log 1 µ(b(θ,ǫ)) dǫ. 2

21 An aside: Generic Chaining Talagrand s γ 2 : Theorem: ForX θ as above and γ 2 (T,d) = inf sup µ θ T log 1 µ(b(θ,ǫ)) dǫ, we have Esup θ X θ cγ 2 (T,d). 21

22 Sudakov s Lower Bound Theorem: For a zero-mean Gaussian process X θ defined ont, define the variance pseudometric d(θ,θ ) 2 = Var(X θ X θ ). Then ǫ EsupX θ sup logm(ǫ,t,d). θ ǫ> 2 22

23 Sudakov s Lower Bound Compare with the Entropy integral: Theorem: Let X θ be a zero-mean stochastic process that is sub-gaussian wrt a pseudo-metric d on the indexing set T. Then Esup θ X θ 8 2 logn(ǫ,t,d)dǫ. Suppose that Var(X θ X θ ) is on the same scale asd(θ,θ ) 2 (think of the Gaussian example of a sub-gaussian process this is precisely the variance). Then, modulo constants, the lower bound is the area of the largest rectangle that can fit under the curve (ǫ, logn(ǫ)), whereas the upper bound is the area under the curve. 23

Theoretical Statistics. Lecture 12.

Theoretical Statistics. Lecture 12. Theoretical Statistics. Lecture 12. Peter Bartlett Uniform laws of large numbers: Bounding Rademacher complexity. 1. Metric entropy. 2. Canonical Rademacher and Gaussian processes 1 Recall: Covering numbers

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics

More information

Theoretical Statistics. Lecture 17.

Theoretical Statistics. Lecture 17. Theoretical Statistics. Lecture 17. Peter Bartlett 1. Asymptotic normality of Z-estimators: classical conditions. 2. Asymptotic equicontinuity. 1 Recall: Delta method Theorem: Supposeφ : R k R m is differentiable

More information

7 Influence Functions

7 Influence Functions 7 Influence Functions The influence function is used to approximate the standard error of a plug-in estimator. The formal definition is as follows. 7.1 Definition. The Gâteaux derivative of T at F in the

More information

Lecture 2: Uniform Entropy

Lecture 2: Uniform Entropy STAT 583: Advanced Theory of Statistical Inference Spring 218 Lecture 2: Uniform Entropy Lecturer: Fang Han April 16 Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal

More information

21.1 Lower bounds on minimax risk for functional estimation

21.1 Lower bounds on minimax risk for functional estimation ECE598: Information-theoretic methods in high-dimensional statistics Spring 016 Lecture 1: Functional estimation & testing Lecturer: Yihong Wu Scribe: Ashok Vardhan, Apr 14, 016 In this chapter, we will

More information

Lecture 1 Measure concentration

Lecture 1 Measure concentration CSE 29: Learning Theory Fall 2006 Lecture Measure concentration Lecturer: Sanjoy Dasgupta Scribe: Nakul Verma, Aaron Arvey, and Paul Ruvolo. Concentration of measure: examples We start with some examples

More information

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present

More information

MCMC 2: Lecture 3 SIR models - more topics. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

MCMC 2: Lecture 3 SIR models - more topics. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham MCMC 2: Lecture 3 SIR models - more topics Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham Contents 1. What can be estimated? 2. Reparameterisation 3. Marginalisation

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued

Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Introduction to Empirical Processes and Semiparametric Inference Lecture 02: Overview Continued Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research

More information

Theoretical Statistics. Lecture 23.

Theoretical Statistics. Lecture 23. Theoretical Statistics. Lecture 23. Peter Bartlett 1. Recall: QMD and local asymptotic normality. [vdv7] 2. Convergence of experiments, maximum likelihood. 3. Relative efficiency of tests. [vdv14] 1 Local

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

The Uniform Weak Law of Large Numbers and the Consistency of M-Estimators of Cross-Section and Time Series Models

The Uniform Weak Law of Large Numbers and the Consistency of M-Estimators of Cross-Section and Time Series Models The Uniform Weak Law of Large Numbers and the Consistency of M-Estimators of Cross-Section and Time Series Models Herman J. Bierens Pennsylvania State University September 16, 2005 1. The uniform weak

More information

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,

More information

Theoretical Statistics. Lecture 1.

Theoretical Statistics. Lecture 1. 1. Organizational issues. 2. Overview. 3. Stochastic convergence. Theoretical Statistics. Lecture 1. eter Bartlett 1 Organizational Issues Lectures: Tue/Thu 11am 12:30pm, 332 Evans. eter Bartlett. bartlett@stat.

More information

Uniform laws of large numbers 2

Uniform laws of large numbers 2 C H A P T E R 4 Uniform laws of large numbers The focus of this chapter is a class of results known as uniform laws of large numbers. 3 As suggested by their name, these results represent a strengthening

More information

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous:

MATH 51H Section 4. October 16, Recall what it means for a function between metric spaces to be continuous: MATH 51H Section 4 October 16, 2015 1 Continuity Recall what it means for a function between metric spaces to be continuous: Definition. Let (X, d X ), (Y, d Y ) be metric spaces. A function f : X Y is

More information

Proof: The coding of T (x) is the left shift of the coding of x. φ(t x) n = L if T n+1 (x) L

Proof: The coding of T (x) is the left shift of the coding of x. φ(t x) n = L if T n+1 (x) L Lecture 24: Defn: Topological conjugacy: Given Z + d (resp, Zd ), actions T, S a topological conjugacy from T to S is a homeomorphism φ : M N s.t. φ T = S φ i.e., φ T n = S n φ for all n Z + d (resp, Zd

More information

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma

More information

SDS : Theoretical Statistics

SDS : Theoretical Statistics SDS 384 11: Theoretical Statistics Lecture 1: Introduction Purnamrita Sarkar Department of Statistics and Data Science The University of Texas at Austin https://psarkar.github.io/teaching Manegerial Stuff

More information

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications. Stat 811 Lecture Notes The Wald Consistency Theorem Charles J. Geyer April 9, 01 1 Analyticity Assumptions Let { f θ : θ Θ } be a family of subprobability densities 1 with respect to a measure µ on a measurable

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses Information and Coding Theory Autumn 07 Lecturer: Madhur Tulsiani Lecture 9: October 5, 07 Lower bounds for minimax rates via multiple hypotheses In this lecture, we extend the ideas from the previous

More information

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity; CSCI699: Topics in Learning and Game Theory Lecture 2 Lecturer: Ilias Diakonikolas Scribes: Li Han Today we will cover the following 2 topics: 1. Learning infinite hypothesis class via VC-dimension and

More information

Concentration, self-bounding functions

Concentration, self-bounding functions Concentration, self-bounding functions S. Boucheron 1 and G. Lugosi 2 and P. Massart 3 1 Laboratoire de Probabilités et Modèles Aléatoires Université Paris-Diderot 2 Economics University Pompeu Fabra 3

More information

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and

More information

Policy Gradient. U(θ) = E[ R(s t,a t );π θ ] = E[R(τ);π θ ] (1) 1 + e θ φ(s t) E[R(τ);π θ ] (3) = max. θ P(τ;θ)R(τ) (6) P(τ;θ) θ log P(τ;θ)R(τ) (9)

Policy Gradient. U(θ) = E[ R(s t,a t );π θ ] = E[R(τ);π θ ] (1) 1 + e θ φ(s t) E[R(τ);π θ ] (3) = max. θ P(τ;θ)R(τ) (6) P(τ;θ) θ log P(τ;θ)R(τ) (9) CS294-40 Learning for Robotics and Control Lecture 16-10/20/2008 Lecturer: Pieter Abbeel Policy Gradient Scribe: Jan Biermeyer 1 Recap Recall: H U() = E[ R(s t,a ;π ] = E[R();π ] (1) Here is a sample path

More information

Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related Gaussian couplings

Empirical and multiplier bootstraps for suprema of empirical processes of increasing complexity, and related Gaussian couplings Available online at www.sciencedirect.com ScienceDirect Stochastic Processes and their Applications 126 (2016) 3632 3651 www.elsevier.com/locate/spa Empirical and multiplier bootstraps for suprema of empirical

More information

Lecture 6: September 19

Lecture 6: September 19 36-755: Advanced Statistical Theory I Fall 2016 Lecture 6: September 19 Lecturer: Alessandro Rinaldo Scribe: YJ Choe Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

NONLINEAR LEAST-SQUARES ESTIMATION 1. INTRODUCTION

NONLINEAR LEAST-SQUARES ESTIMATION 1. INTRODUCTION NONLINEAR LEAST-SQUARES ESTIMATION DAVID POLLARD AND PETER RADCHENKO ABSTRACT. The paper uses empirical process techniques to study the asymptotics of the least-squares estimator for the fitting of a nonlinear

More information

Concentration behavior of the penalized least squares estimator

Concentration behavior of the penalized least squares estimator Concentration behavior of the penalized least squares estimator Penalized least squares behavior arxiv:1511.08698v2 [math.st] 19 Oct 2016 Alan Muro and Sara van de Geer {muro,geer}@stat.math.ethz.ch Seminar

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding

More information

Theoretical Statistics. Lecture 19.

Theoretical Statistics. Lecture 19. Theoretical Statistics. Lecture 19. Peter Bartlett 1. Functional delta method. [vdv20] 2. Differentiability in normed spaces: Hadamard derivatives. [vdv20] 3. Quantile estimates. [vdv21] 1 Recall: Delta

More information

Problem set 1, Real Analysis I, Spring, 2015.

Problem set 1, Real Analysis I, Spring, 2015. Problem set 1, Real Analysis I, Spring, 015. (1) Let f n : D R be a sequence of functions with domain D R n. Recall that f n f uniformly if and only if for all ɛ > 0, there is an N = N(ɛ) so that if n

More information

EMPIRICAL PROCESSES: Theory and Applications

EMPIRICAL PROCESSES: Theory and Applications Corso estivo di statistica e calcolo delle probabilità EMPIRICAL PROCESSES: Theory and Applications Torgnon, 23 Corrected Version, 2 July 23; 21 August 24 Jon A. Wellner University of Washington Statistics,

More information

SUPPLEMENT TO POSTERIOR CONTRACTION AND CREDIBLE SETS FOR FILAMENTS OF REGRESSION FUNCTIONS. North Carolina State University

SUPPLEMENT TO POSTERIOR CONTRACTION AND CREDIBLE SETS FOR FILAMENTS OF REGRESSION FUNCTIONS. North Carolina State University Submitted to the Annals of Statistics SUPPLEMENT TO POSTERIOR CONTRACTION AND CREDIBLE SETS FOR FILAMENTS OF REGRESSION FUNCTIONS By Wei Li and Subhashis Ghosal North Carolina State University The supplementary

More information

4th Preparation Sheet - Solutions

4th Preparation Sheet - Solutions Prof. Dr. Rainer Dahlhaus Probability Theory Summer term 017 4th Preparation Sheet - Solutions Remark: Throughout the exercise sheet we use the two equivalent definitions of separability of a metric space

More information

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Online Learning: Random Averages, Combinatorial Parameters, and Learnability Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018 1 Overview This lecture mainly covers Recall the statistical theory of GANs

More information

Concentration inequalities: basics and some new challenges

Concentration inequalities: basics and some new challenges Concentration inequalities: basics and some new challenges M. Ledoux University of Toulouse, France & Institut Universitaire de France Measure concentration geometric functional analysis, probability theory,

More information

Notes on Gaussian processes and majorizing measures

Notes on Gaussian processes and majorizing measures Notes on Gaussian processes and majorizing measures James R. Lee 1 Gaussian processes Consider a Gaussian process {X t } for some index set T. This is a collection of jointly Gaussian random variables,

More information

Kernel Density Estimation

Kernel Density Estimation EECS 598: Statistical Learning Theory, Winter 2014 Topic 19 Kernel Density Estimation Lecturer: Clayton Scott Scribe: Yun Wei, Yanzhen Deng Disclaimer: These notes have not been subjected to the usual

More information

Solutions to Problem Set 5 for , Fall 2007

Solutions to Problem Set 5 for , Fall 2007 Solutions to Problem Set 5 for 18.101, Fall 2007 1 Exercise 1 Solution For the counterexample, let us consider M = (0, + ) and let us take V = on M. x Let W be the vector field on M that is identically

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Empirical Processes and random projections

Empirical Processes and random projections Empirical Processes and random projections B. Klartag, S. Mendelson School of Mathematics, Institute for Advanced Study, Princeton, NJ 08540, USA. Institute of Advanced Studies, The Australian National

More information

MAT 257, Handout 13: December 5-7, 2011.

MAT 257, Handout 13: December 5-7, 2011. MAT 257, Handout 13: December 5-7, 2011. The Change of Variables Theorem. In these notes, I try to make more explicit some parts of Spivak s proof of the Change of Variable Theorem, and to supply most

More information

δ xj β n = 1 n Theorem 1.1. The sequence {P n } satisfies a large deviation principle on M(X) with the rate function I(β) given by

δ xj β n = 1 n Theorem 1.1. The sequence {P n } satisfies a large deviation principle on M(X) with the rate function I(β) given by . Sanov s Theorem Here we consider a sequence of i.i.d. random variables with values in some complete separable metric space X with a common distribution α. Then the sample distribution β n = n maps X

More information

7 Continuous Variables

7 Continuous Variables 7 Continuous Variables 7.1 Distribution function With continuous variables we can again define a probability distribution but instead of specifying Pr(X j) we specify Pr(X < u) since Pr(u < X < u + δ)

More information

MAT137 - Term 2, Week 2

MAT137 - Term 2, Week 2 MAT137 - Term 2, Week 2 This lecture will assume you have watched all of the videos on the definition of the integral (but will remind you about some things). Today we re talking about: More on the definition

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research

More information

SMSTC (2017/18) Geometry and Topology 2.

SMSTC (2017/18) Geometry and Topology 2. SMSTC (2017/18) Geometry and Topology 2 Lecture 1: Differentiable Functions and Manifolds in R n Lecturer: Diletta Martinelli (Notes by Bernd Schroers) a wwwsmstcacuk 11 General remarks In this lecture

More information

An Algorithmist s Toolkit Nov. 10, Lecture 17

An Algorithmist s Toolkit Nov. 10, Lecture 17 8.409 An Algorithmist s Toolkit Nov. 0, 009 Lecturer: Jonathan Kelner Lecture 7 Johnson-Lindenstrauss Theorem. Recap We first recap a theorem (isoperimetric inequality) and a lemma (concentration) from

More information

19.1 Maximum Likelihood estimator and risk upper bound

19.1 Maximum Likelihood estimator and risk upper bound ECE598: Information-theoretic methods in high-dimensional statistics Spring 016 Lecture 19: Denoising sparse vectors - Ris upper bound Lecturer: Yihong Wu Scribe: Ravi Kiran Raman, Apr 1, 016 This lecture

More information

MAJORIZING MEASURES WITHOUT MEASURES. By Michel Talagrand URA 754 AU CNRS

MAJORIZING MEASURES WITHOUT MEASURES. By Michel Talagrand URA 754 AU CNRS The Annals of Probability 2001, Vol. 29, No. 1, 411 417 MAJORIZING MEASURES WITHOUT MEASURES By Michel Talagrand URA 754 AU CNRS We give a reformulation of majorizing measures that does not involve measures,

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Theoretical Statistics. Lecture 25.

Theoretical Statistics. Lecture 25. Theoretical Statistics. Lecture 25. Peter Bartlett 1. Relative efficiency of tests [vdv14]: Rescaling rates. 2. Likelihood ratio tests [vdv15]. 1 Recall: Relative efficiency of tests Theorem: Suppose that

More information

1 Review and Overview

1 Review and Overview DRAFT a final version will be posted shortly CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture # 16 Scribe: Chris Cundy, Ananya Kumar November 14, 2018 1 Review and Overview Last

More information

Concentration inequalities and the entropy method

Concentration inequalities and the entropy method Concentration inequalities and the entropy method Gábor Lugosi ICREA and Pompeu Fabra University Barcelona what is concentration? We are interested in bounding random fluctuations of functions of many

More information

Computational Learning Theory - Hilary Term : Learning Real-valued Functions

Computational Learning Theory - Hilary Term : Learning Real-valued Functions Computational Learning Theory - Hilary Term 08 8 : Learning Real-valued Functions Lecturer: Varun Kanade So far our focus has been on learning boolean functions. Boolean functions are suitable for modelling

More information

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539

Brownian motion. Samy Tindel. Purdue University. Probability Theory 2 - MA 539 Brownian motion Samy Tindel Purdue University Probability Theory 2 - MA 539 Mostly taken from Brownian Motion and Stochastic Calculus by I. Karatzas and S. Shreve Samy T. Brownian motion Probability Theory

More information

The Sherrington-Kirkpatrick model

The Sherrington-Kirkpatrick model Stat 36 Stochastic Processes on Graphs The Sherrington-Kirkpatrick model Andrea Montanari Lecture - 4/-4/00 The Sherrington-Kirkpatrick (SK) model was introduced by David Sherrington and Scott Kirkpatrick

More information

Descent methods. min x. f(x)

Descent methods. min x. f(x) Gradient Descent Descent methods min x f(x) 5 / 34 Descent methods min x f(x) x k x k+1... x f(x ) = 0 5 / 34 Gradient methods Unconstrained optimization min f(x) x R n. 6 / 34 Gradient methods Unconstrained

More information

Hölder s and Minkowski s Inequality

Hölder s and Minkowski s Inequality Hölder s and Minkowski s Inequality James K. Peterson Department of Biological Sciences and Department of Mathematical Sciences Clemson University September 1, 218 Outline Conjugate Exponents Hölder s

More information

EMPIRICAL PROCESS THEORY AND APPLICATIONS. Sara van de Geer. Handout WS 2006 ETH Zürich

EMPIRICAL PROCESS THEORY AND APPLICATIONS. Sara van de Geer. Handout WS 2006 ETH Zürich EMPIRICAL PROCESS THEORY AND APPLICATIONS by Sara van de Geer Handout WS 2006 ETH Zürich Contents Preface Introduction Law of large numbers for real-valued random variables 2 R d -valued random variables

More information

Geometric Parameters in Learning Theory

Geometric Parameters in Learning Theory Geometric Parameters in Learning Theory S. Mendelson Research School of Information Sciences and Engineering, Australian National University, Canberra, ACT 0200, Australia shahar.mendelson@anu.edu.au Contents

More information

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1, JANUARY 2002 251 Rademacher Averages Phase Transitions in Glivenko Cantelli Classes Shahar Mendelson Abstract We introduce a new parameter which

More information

Sampling Distributions

Sampling Distributions Sampling Distributions Mathematics 47: Lecture 9 Dan Sloughter Furman University March 16, 2006 Dan Sloughter (Furman University) Sampling Distributions March 16, 2006 1 / 10 Definition We call the probability

More information

1 Glivenko-Cantelli type theorems

1 Glivenko-Cantelli type theorems STA79 Lecture Spring Semester Glivenko-Cantelli type theorems Given i.i.d. observations X,..., X n with unknown distribution function F (t, consider the empirical (sample CDF ˆF n (t = I [Xi t]. n Then

More information

P (A G) dp G P (A G)

P (A G) dp G P (A G) First homework assignment. Due at 12:15 on 22 September 2016. Homework 1. We roll two dices. X is the result of one of them and Z the sum of the results. Find E [X Z. Homework 2. Let X be a r.v.. Assume

More information

Combinatorics of random processes and sections of convex bodies

Combinatorics of random processes and sections of convex bodies Combinatorics of random processes and sections of convex bodies M. Rudelson R. Vershynin Abstract We find a sharp combinatorial bound for the metric entropy of sets in R n and general classes of functions.

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Some Statistical Properties of Deep Networks

Some Statistical Properties of Deep Networks Some Statistical Properties of Deep Networks Peter Bartlett UC Berkeley August 2, 2018 1 / 22 Deep Networks Deep compositions of nonlinear functions h = h m h m 1 h 1 2 / 22 Deep Networks Deep compositions

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

f (r) (a) r! (x a) r, r=0

f (r) (a) r! (x a) r, r=0 Part 3.3 Differentiation v1 2018 Taylor Polynomials Definition 3.3.1 Taylor 1715 and Maclaurin 1742) If a is a fixed number, and f is a function whose first n derivatives exist at a then the Taylor polynomial

More information

MATH 6605: SUMMARY LECTURE NOTES

MATH 6605: SUMMARY LECTURE NOTES MATH 6605: SUMMARY LECTURE NOTES These notes summarize the lectures on weak convergence of stochastic processes. If you see any typos, please let me know. 1. Construction of Stochastic rocesses A stochastic

More information

MATH 423/ Note that the algebraic operations on the right hand side are vector subtraction and scalar multiplication.

MATH 423/ Note that the algebraic operations on the right hand side are vector subtraction and scalar multiplication. MATH 423/673 1 Curves Definition: The velocity vector of a curve α : I R 3 at time t is the tangent vector to R 3 at α(t), defined by α (t) T α(t) R 3 α α(t + h) α(t) (t) := lim h 0 h Note that the algebraic

More information

2 Upper-bound of Generalization Error of AdaBoost

2 Upper-bound of Generalization Error of AdaBoost COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Haipeng Zheng March 5, 2008 1 Review of AdaBoost Algorithm Here is the AdaBoost Algorithm: input: (x 1,y 1 ),...,(x m,y

More information

Lecture 3 January 28

Lecture 3 January 28 EECS 28B / STAT 24B: Advanced Topics in Statistical LearningSpring 2009 Lecture 3 January 28 Lecturer: Pradeep Ravikumar Scribe: Timothy J. Wheeler Note: These lecture notes are still rough, and have only

More information

COMPLETE METRIC SPACES AND THE CONTRACTION MAPPING THEOREM

COMPLETE METRIC SPACES AND THE CONTRACTION MAPPING THEOREM COMPLETE METRIC SPACES AND THE CONTRACTION MAPPING THEOREM A metric space (M, d) is a set M with a metric d(x, y), x, y M that has the properties d(x, y) = d(y, x), x, y M d(x, y) d(x, z) + d(z, y), x,

More information

Lecture 5 - Logarithms, Slope of a Function, Derivatives

Lecture 5 - Logarithms, Slope of a Function, Derivatives Lecture 5 - Logarithms, Slope of a Function, Derivatives 5. Logarithms Note the graph of e x This graph passes the horizontal line test, so f(x) = e x is one-to-one and therefore has an inverse function.

More information

Upper and Lower Bounds for Suprema of Chaos Processes

Upper and Lower Bounds for Suprema of Chaos Processes DEGREE PROJECT IN MATHEMATICS, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018 Upper and Lower Bounds for Suprema of Chaos Processes TIM FUCHS KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ENGINEERING SCIENCES

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley Department of Statistics 367 Evans Hall, CA 94720-3860, USA bartlett@stat.berkeley.edu

More information

Introduction to Algebraic and Geometric Topology Week 3

Introduction to Algebraic and Geometric Topology Week 3 Introduction to Algebraic and Geometric Topology Week 3 Domingo Toledo University of Utah Fall 2017 Lipschitz Maps I Recall f :(X, d)! (X 0, d 0 ) is Lipschitz iff 9C > 0 such that d 0 (f (x), f (y)) apple

More information

The Central Limit Theorem

The Central Limit Theorem The Central Limit Theorem Patrick Breheny September 27 Patrick Breheny University of Iowa Biostatistical Methods I (BIOS 5710) 1 / 31 Kerrich s experiment Introduction 10,000 coin flips Expectation and

More information

Measure and Integration: Solutions of CW2

Measure and Integration: Solutions of CW2 Measure and Integration: s of CW2 Fall 206 [G. Holzegel] December 9, 206 Problem of Sheet 5 a) Left (f n ) and (g n ) be sequences of integrable functions with f n (x) f (x) and g n (x) g (x) for almost

More information

Econ Lecture 14. Outline

Econ Lecture 14. Outline Econ 204 2010 Lecture 14 Outline 1. Differential Equations and Solutions 2. Existence and Uniqueness of Solutions 3. Autonomous Differential Equations 4. Complex Exponentials 5. Linear Differential Equations

More information

Summer Jump-Start Program for Analysis, 2012 Song-Ying Li

Summer Jump-Start Program for Analysis, 2012 Song-Ying Li Summer Jump-Start Program for Analysis, 01 Song-Ying Li 1 Lecture 6: Uniformly continuity and sequence of functions 1.1 Uniform Continuity Definition 1.1 Let (X, d 1 ) and (Y, d ) are metric spaces and

More information

Large deviations for random projections of l p balls

Large deviations for random projections of l p balls 1/32 Large deviations for random projections of l p balls Nina Gantert CRM, september 5, 2016 Goal: Understanding random projections of high-dimensional convex sets. 2/32 2/32 Outline Goal: Understanding

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

Statistical Theory MT 2006 Problems 4: Solution sketches

Statistical Theory MT 2006 Problems 4: Solution sketches Statistical Theory MT 006 Problems 4: Solution sketches 1. Suppose that X has a Poisson distribution with unknown mean θ. Determine the conjugate prior, and associate posterior distribution, for θ. Determine

More information

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

Homework # , Spring Due 14 May Convergence of the empirical CDF, uniform samples

Homework # , Spring Due 14 May Convergence of the empirical CDF, uniform samples Homework #3 36-754, Spring 27 Due 14 May 27 1 Convergence of the empirical CDF, uniform samples In this problem and the next, X i are IID samples on the real line, with cumulative distribution function

More information

Optimal Estimation of a Nonsmooth Functional

Optimal Estimation of a Nonsmooth Functional Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose

More information

Convergence of Feller Processes

Convergence of Feller Processes Chapter 15 Convergence of Feller Processes This chapter looks at the convergence of sequences of Feller processes to a iting process. Section 15.1 lays some ground work concerning weak convergence of processes

More information

Fourth Week: Lectures 10-12

Fourth Week: Lectures 10-12 Fourth Week: Lectures 10-12 Lecture 10 The fact that a power series p of positive radius of convergence defines a function inside its disc of convergence via substitution is something that we cannot ignore

More information

Statistics GIDP Ph.D. Qualifying Exam Theory Jan 11, 2016, 9:00am-1:00pm

Statistics GIDP Ph.D. Qualifying Exam Theory Jan 11, 2016, 9:00am-1:00pm Statistics GIDP Ph.D. Qualifying Exam Theory Jan, 06, 9:00am-:00pm Instructions: Provide answers on the supplied pads of paper; write on only one side of each sheet. Complete exactly 5 of the 6 problems.

More information

Continuity. Chapter 4

Continuity. Chapter 4 Chapter 4 Continuity Throughout this chapter D is a nonempty subset of the real numbers. We recall the definition of a function. Definition 4.1. A function from D into R, denoted f : D R, is a subset of

More information

Math 201 Handout 1. Rich Schwartz. September 6, 2006

Math 201 Handout 1. Rich Schwartz. September 6, 2006 Math 21 Handout 1 Rich Schwartz September 6, 26 The purpose of this handout is to give a proof of the basic existence and uniqueness result for ordinary differential equations. This result includes the

More information

Statistical Properties of Numerical Derivatives

Statistical Properties of Numerical Derivatives Statistical Properties of Numerical Derivatives Han Hong, Aprajit Mahajan, and Denis Nekipelov Stanford University and UC Berkeley November 2010 1 / 63 Motivation Introduction Many models have objective

More information