An Introduction to Laws of Large Numbers

Similar documents
Probability and Measure

Lecture 1: Probability Fundamentals

Notes 1 : Measure-theoretic foundations I

18.175: Lecture 3 Integration

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

Sample Spaces, Random Variables

Recitation 2: Probability

STA205 Probability: Week 8 R. Wolpert

Probability and Measure

µ (X) := inf l(i k ) where X k=1 I k, I k an open interval Notice that is a map from subsets of R to non-negative number together with infinity

1 Review of The Learning Setting

Integration on Measure Spaces

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

Lecture 11: Continuous-valued signals and differential entropy

7 Random samples and sampling distributions

ELEMENTS OF PROBABILITY THEORY

2 Measure Theory. 2.1 Measures

1: PROBABILITY REVIEW

Lecture 4: September Reminder: convergence of sequences

Some Concepts in Probability and Information Theory

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy

Lecture 6 Basic Probability

More on Distribution Function

8 Laws of large numbers

x 0 + f(x), exist as extended real numbers. Show that f is upper semicontinuous This shows ( ɛ, ɛ) B α. Thus

1.1. MEASURES AND INTEGRALS

Basics on Probability. Jingrui He 09/11/2007

Measure and integration

Homework 11. Solutions

Why study probability? Set theory. ECE 6010 Lecture 1 Introduction; Review of Random Variables

RS Chapter 1 Random Variables 6/5/2017. Chapter 1. Probability Theory: Introduction

Lecture 5: Asymptotic Equipartition Property

SDS 321: Introduction to Probability and Statistics

Lebesgue measure and integration

ABSTRACT INTEGRATION CHAPTER ONE

Notes on Measure, Probability and Stochastic Processes. João Lopes Dias

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

Confidence Intervals and Hypothesis Tests

Basic Measure and Integration Theory. Michael L. Carroll

Probability Theory for Machine Learning. Chris Cremer September 2015

3 Multiple Discrete Random Variables

Joint Distribution of Two or More Random Variables

EE514A Information Theory I Fall 2013

Lecture 7. Sums of random variables

1 Sequences of events and their limits

Probability. Lecture Notes. Adolfo J. Rumbos

Almost Sure Convergence of a Sequence of Random Variables

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Measures. Chapter Some prerequisites. 1.2 Introduction

A random variable is a quantity whose value is determined by the outcome of an experiment.

University of Sheffield. School of Mathematics & and Statistics. Measure and Probability MAS350/451/6352

Tips and Tricks in Real Analysis

Measures and Measure Spaces

Chapter 4. Measure Theory. 1. Measure Spaces

Product measures, Tonelli s and Fubini s theorems For use in MAT4410, autumn 2017 Nadia S. Larsen. 17 November 2017.

Basic Probability. Introduction

A D VA N C E D P R O B A B I L - I T Y

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations

Some Basic Concepts of Probability and Information Theory: Pt. 1

1 Measurable Functions

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation

Measures. 1 Introduction. These preliminary lecture notes are partly based on textbooks by Athreya and Lahiri, Capinski and Kopp, and Folland.

ST213 Mathematics of Random Events

4 Expectation & the Lebesgue Theorems

Discrete Probability Refresher

X n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2)

Kousha Etessami. U. of Edinburgh, UK. Kousha Etessami (U. of Edinburgh, UK) Discrete Mathematics (Chapter 7) 1 / 13

Lecture Notes 3 Convergence (Chapter 5)

Lecture 4: Probability and Discrete Random Variables

Machine Learning. Instructor: Pranjal Awasthi

Probability. Carlo Tomasi Duke University

MA 3280 Lecture 05 - Generalized Echelon Form and Free Variables. Friday, January 31, 2014.

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

Review of Probabilities and Basic Statistics

Joint Probability Distributions and Random Samples (Devore Chapter Five)

MATH41011/MATH61011: FOURIER SERIES AND LEBESGUE INTEGRATION. Extra Reading Material for Level 4 and Level 6

Random variables. DS GA 1002 Probability and Statistics for Data Science.

Quick Tour of Basic Probability Theory and Linear Algebra

Probability Theory. Probability and Statistics for Data Science CSE594 - Spring 2016

2.1 Lecture 5: Probability spaces, Interpretation of probabilities, Random variables

Then A n W n, A n is compact and W n is open. (We take W 0 =(

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

1 Probability space and random variables

Building Infinite Processes from Finite-Dimensional Distributions

IEOR 3106: Introduction to Operations Research: Stochastic Models. Professor Whitt. SOLUTIONS to Homework Assignment 2

Multivariate Distributions (Hogg Chapter Two)

1 Stat 605. Homework I. Due Feb. 1, 2011

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem

A course in. Real Analysis. Taught by Prof. P. Kronheimer. Fall 2011

Week 2: Review of probability and statistics

Discrete Mathematics and Probability Theory Fall 2012 Vazirani Note 14. Random Variables: Distribution and Expectation

Refresher on Discrete Probability

18.175: Lecture 2 Extension theorems, random variables, distributions

We are going to discuss what it means for a sequence to converge in three stages: First, we define what it means for a sequence to converge to zero

3 Integration and Expectation

Inference for Stochastic Processes

Real Analysis Notes. Thomas Goller

2 n k In particular, using Stirling formula, we can calculate the asymptotic of obtaining heads exactly half of the time:

JUSTIN HARTMANN. F n Σ.

6.1 Moment Generating and Characteristic Functions

Transcription:

An to Laws of John CVGMI Group

Contents 1

Contents 1 2

Contents 1 2 3

Contents 1 2 3 4

Intuition We re working with random variables. What could we observe? {X n } n=1

Intuition We re working with random variables. What could we observe? {X n } n=1... More specifically Bernoulli sequence, looking at probability of average of the sum of N random events. P(x 1 < N i=1 X i < x 2 )

Intuition We re working with random variables. What could we observe? {X n } n=1... More specifically Bernoulli sequence, looking at probability of average of the sum of N random events. P(x 1 < N i=1 X i < x 2 ) Coin Flipping anybody?

Intuition We re working with random variables. What could we observe? {X n } n=1... More specifically Bernoulli sequence, looking at probability of average of the sum of N random events. P(x 1 < N i=1 X i < x 2 ) Coin Flipping anybody? as N increases we see that the probability of observing an equal amount of 0 or 1 is equal

Intuition We re working with random variables. What could we observe? {X n } n=1... More specifically Bernoulli sequence, looking at probability of average of the sum of N random events. P(x 1 < N i=1 X i < x 2 ) Coin Flipping anybody? as N increases we see that the probability of observing an equal amount of 0 or 1 is equal This is intuitive: As the number of samples increases the average observation should tend toward the theoretical mean

Weak Law Let s try to work out this most basic fundamental theorem of random variables arising from repeatedly observing a random event. How do we build such a theorem?

Weak Law Let s try to work out this most basic fundamental theorem of random variables arising from repeatedly observing a random event. How do we build such a theorem? In blackbox scenario we want to be able to use independence.

Weak Law Let s try to work out this most basic fundamental theorem of random variables arising from repeatedly observing a random event. How do we build such a theorem? In blackbox scenario we want to be able to use independence. We want to be able to use variance and expectation: so let s make X n L 2 for all n, and associate with each X n it s mean, we ll denote by µ n, and variance, by σ n.

Weak Law Let s try to work out this most basic fundamental theorem of random variables arising from repeatedly observing a random event. How do we build such a theorem? In blackbox scenario we want to be able to use independence. We want to be able to use variance and expectation: so let s make X n L 2 for all n, and associate with each X n it s mean, we ll denote by µ n, and variance, by σ n. We will create new random variables from the sequence and work with these; aim for results in terms of them.

Weak Law Let s try to work out this most basic fundamental theorem of random variables arising from repeatedly observing a random event. How do we build such a theorem? In blackbox scenario we want to be able to use independence. We want to be able to use variance and expectation: so let s make X n L 2 for all n, and associate with each X n it s mean, we ll denote by µ n, and variance, by σ n. We will create new random variables from the sequence and work with these; aim for results in terms of them. This seems like a start, let s try to prove a theorem...

Weak Law Theorem Let {X n } be a sequence of independent L 2 random variables with means µ n and variances σ n. Then n i=1 X i µ i 0. So a sequence of functions on Ω, the sample space, are going towards a function = zero...

Weak Law Theorem Let {X n } be a sequence of independent L 2 random variables with means µ n and variances σ n. Then n i=1 X i µ i 0. So a sequence of functions on Ω, the sample space, are going towards a function = zero... Can we prove this?

Weak Law Theorem Let {X n } be a sequence of independent L 2 random variables with means µ n and variances σ n. Then n i=1 X i µ i 0. So a sequence of functions on Ω, the sample space, are going towards a function = zero... Can we prove this? We haven t used (σ n ): we ll clearly need these since otherwise the X i are wildly unpredictable.

Weak Law Theorem Let {X n } be a sequence of independent L 2 random variables with means µ n and variances σ n. Then n i=1 X i µ i 0. So a sequence of functions on Ω, the sample space, are going towards a function = zero... Can we prove this? We haven t used (σ n ): we ll clearly need these since otherwise the X i are wildly unpredictable. IDEA: Constrain lim n σi 2 = 0 i=1

Counterexample

Weak Law: Pitfalls We haven t specified a mode of convergence: this is a technical point, but if you recall uniform, pointwise, a.e., normed convergence from analysis then the goal of the proof can change radically. IDEA: We want our rule to hold with high probability, that is it should hold for essentially all values of ω Ω. in the normed space is pretty ambitious at this point.

Weak Law: Pitfalls We haven t specified a mode of convergence: this is a technical point, but if you recall uniform, pointwise, a.e., normed convergence from analysis then the goal of the proof can change radically. IDEA: We want our rule to hold with high probability, that is it should hold for essentially all values of ω Ω. in the normed space is pretty ambitious at this point. We haven t specified a rate of convergence. The rate of convergence of the σ n should imply something about the rate of convergence of the X n µ n. The X n µ n should converge slower than the σ n.

Weak Law: Pitfalls We haven t specified a mode of convergence: this is a technical point, but if you recall uniform, pointwise, a.e., normed convergence from analysis then the goal of the proof can change radically. IDEA: We want our rule to hold with high probability, that is it should hold for essentially all values of ω Ω. in the normed space is pretty ambitious at this point. We haven t specified a rate of convergence. The rate of convergence of the σ n should imply something about the rate of convergence of the X n µ n. The X n µ n should converge slower than the σ n. IDEA: Weaken constraint to lim n 2 n i=1 σ 2 i = 0

Weak Law: Pitfalls We haven t specified a mode of convergence: this is a technical point, but if you recall uniform, pointwise, a.e., normed convergence from analysis then the goal of the proof can change radically. IDEA: We want our rule to hold with high probability, that is it should hold for essentially all values of ω Ω. in the normed space is pretty ambitious at this point. We haven t specified a rate of convergence. The rate of convergence of the σ n should imply something about the rate of convergence of the X n µ n. The X n µ n should converge slower than the σ n. IDEA: Weaken constraint to lim n 2 n i=1 The X n µ n should converge on average σ 2 i = 0

Law of 2.0 Theorem Let {X n } be a sequence of independent L 2 random variables n with means µ n and variances σn. 2 Then 1 n X i µ i 0 in measure if lim 1 n 2 n σi 2 = 0. i=1 i=1 Let Y n (ω) = 1/n n (X i (ω) µ i ). Then E(Y n ) = 0 and σ 2 (Y n ) = 1/n 2 n i=1 σi 2 i=1 by independence. Now we can use the limit of σ 2 (Y n ) from the hypothesis, and need to prove P({ω : Y n (ω) > ɛ}) σ2 (Y n ) ɛ 2 0 as n

Aside: L p bounds What does actually say? P({ω : Y n (ω) > ɛ}) σ2 (Y n ) ɛ 2 0 as n Part of Y n > ɛ is bounded by the integral of Y 2 n divided by epsilon squared... Thinking about this makes it obvious. Better bounds can be obtained: Chernoff bounds

Law of 2.1 Theorem Let {X n } be a sequence of independent L 2 random variables n with means µ n and variances σn. 2 Then 1 n X i µ i 0 in measure if lim n 2 n i=1 σ 2 i = 0. weakened; use P(max k k i=1 X i µ i ɛ) ɛ 2 n i=1 σ2 i Theorem Let {X n } be a sequence of independent L 2 random variables n with means µ n and variances σn. 2 Then 1 n X i µ i 0 a.e. if lim n 2 n i=1 σ 2 i <. i=1 i=1

Khinchine s Law Theorem Let {X n } be a sequence of independent L 1 random variables, n identically distributed, with means µ. Then 1 n X i µ i 0 a.e. as n. i=1

Sample Size in Coding Now we need to use the laws. Start with an example in source coding. C : X (Ω) Σ encoding events Interesting to know how complex C must be to encode (Y i ) N i=1 Entropy: H(x) = E[ lg(p(x))]; representation of how uncertain a r.v. is Problem: p( ) is unknown to the function and the distribution needs to be learned. WLLN can be used to answer: how uncertain is the distribution?

Asymptotic Equipartition Property Theorem i.i.d. Asymptotic Equipartition Property If X 1,..., X n p(x) then i.i.d. X i 1 n lg(p(x 1,..., x n )) H(x). p(x) so lgp(x i) are i.i.d. The weak law of large numbers says that P( 1 n (lg( (p(x i ))) E[ lg(p(x)))] > ɛ) 0 i i P( 1 lg((p(x i ))) E[ lg(p(x))] > ɛ) 0 n i so the sample entropy approaches the true entropy in probability as the sample size increases.

Want to minimize R(f ) = E(l(x, y, f (x))) eg l = (x, y, f ) 1/2 f (x) y Stuck minimizing over all f, under a distribution we don t know... hopeless... IDEA: Take Law of numbers and apply it to this framework, and hopefully R emp (f ) = 1/m i l(x i, y i, f (x i )) R(f ) Use L p bounds to prove convergence results on testing error.

Mini : Measure I won t go into too much detail in this regard. If we ve made it this far then great. (X, A) is a measurable space if A 2 X is closed under countable unions and complements. These correspond to measurable events. (X, A) with µ is a measure space if µ : A [0, ) is countably additive over disjoint sets: µ( i A i ) = i µ(a i) if A i A j = if i j. More great properties fall out of this quickly: measure of the emptyset is zero, measures are monotonic under the containment (partial) ordering, even dominated and monotone convergence (of sets) come out from this. s Inequality - Let f L p (R) and denote E α = {x : f > α} and now f p p E α f p α p µ(e α ) by monotonicity and positivity of measures.

L 2 : Definition We say that a function f : X R belongs to the function space L p (X ) if f p = ( f (x) p dx) 1/p < and say f has finite L p norm. Definition The fundamental object being considered in statistics is the random variable. Given a measure space (Ω, B, µ) X : Ω R is an L 2 random variable if µ(ω) = 1 and ( {X < r} B r R X (ω) 2 dµ(ω)) 1/2 <. We may also say that a random variable with finite 2nd 2

L 2 : Questions Why B? Physical paradoxes, eg. Banach-Tarski Want to talk about physical events measurable sets B is as descriptive as we d like most of the time What does µ do here? µ weights the events measurable sets Puts values to {ω Ω : X (ω) < r}; Ω is the sample space X maps random events (patterns occuring in data) to the reals which have a good standard notion of measure. So X induces the distribution P(B) = µ(x 1 (B)) which gives P({X (ω) B}) = µ(x 1 (B)) as expected. This is usually written without the sample variable. For more sophisticated questions/answers see MAA 6616 or STA 7466, but discussion is encouraged

Statistics: Functions of Measured Events Now we can start building Theorem If f : R R is measurable, then f (x) dp = f (X (ω)) dµ ie. the random variable pushes forward a measure onto R, and the integrals of measurable functions of random variables are therefore ultimately determined by real integration. This may be proved using Dirac measures and measure properties. Following these lines we may develop the basics of the theory of probability from measure theory. (ok, Homework)

Statistics: Moments Definition Define the first moment as the linear functional E(X ) = t dp. Then the p th central moment is is a functional given by m p (X ) = (t E(X )) p dp. Note that when X L 2 these are well defined by the theorem above. Pullbacks are omitted. Since we re talking about L 2 let s define the second central moment as σ 2 for convenience. If we need higher moments, sadly, mathematics says no.

Statistics: Now that we know what random variables are, let s try to define how a collection of them interact in the most basic way possible. Definition Let X = {X α } be a collection of random variables. Then X is independent if P( α X 1 α (B α )) = P((X 1,..., X n ) 1 (B 1... B n )) = (µ(x 1 1 )... µ(x 1 n ))(B 1... B n ). (Head Explodes) Really this just means that the X α are independent if their joint distributions are given by the product measure over the induced measures.

Limits: Repeating Independent Trials The significance of the belabored definition of independence is that when a joint distribution, just a distribution over several random variables, contains zero information about how the variables are related. If we have an infinite collection of random variables, independent of each other - in spite of independence - we would like to be able to infer information about the lower order moments from the higher order moments and vice versa. If we have ideal conditions on the the random variables then we should be able to deduce information about moments from a very large sequence. The type of convergence we get should be sensitive to the hypotheses.

We clearly need to specify how the means are converging, right?

We clearly need to specify how the means are converging, right? Why?

We clearly need to specify how the means are converging, right? Why? So how could N i=1 (X i µ i )/N 0?

We clearly need to specify how the means are converging, right? Why? So how could N i=1 (X i µ i )/N 0? In measure : lim N P({ N i=1 (X i µ i )/N ɛ}) 0 In norm: lim N ( ( N i=1 (X i µ i )/N) p dµ(ω)) 1/p 0 A.E : P({lim N N i=1 (X i µ i )/N > 0}) 0 f n f nk AE µ L p L p i f n f nk