Probability Background

Similar documents
A Probability Review

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

5 Operations on Multiple Random Variables

The Multivariate Normal Distribution. In this case according to our theorem

3. Probability and Statistics

Lecture 11. Multivariate Normal theory

Quick Tour of Basic Probability Theory and Linear Algebra

Lecture 2: Review of Basic Probability Theory

Multiple Random Variables

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Multivariate Random Variable

Notes on Random Vectors and Multivariate Normal

Lecture 1: August 28

Regression and Statistical Inference

BASICS OF PROBABILITY

5. Random Vectors. probabilities. characteristic function. cross correlation, cross covariance. Gaussian random vectors. functions of random vectors

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

conditional cdf, conditional pdf, total probability theorem?

Chapter 7. Basic Probability Theory

Probability Lecture III (August, 2006)

Chapter 4 : Expectation and Moments

5.1 Consistency of least squares estimates. We begin with a few consistency results that stand on their own and do not depend on normality.

This exam contains 6 questions. The questions are of equal weight. Print your name at the top of this page in the upper right hand corner.

Gaussian vectors and central limit theorem

Lecture 11. Probability Theory: an Overveiw

BIOS 2083 Linear Models Abdus S. Wahed. Chapter 2 84

STAT 200C: High-dimensional Statistics

Expectation. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Random Vectors and Multivariate Normal Distributions

ACM 116: Lectures 3 4

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

Recitation 2: Probability

Continuous Random Variables

Course: ESO-209 Home Work: 1 Instructor: Debasis Kundu

1.1 Review of Probability Theory

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Exam P Review Sheet. for a > 0. ln(a) i=0 ari = a. (1 r) 2. (Note that the A i s form a partition)

Expectation. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

Probability Theory and Statistics. Peter Jochumzen

Kalman Filtering. Namrata Vaswani. March 29, Kalman Filter as a causal MMSE estimator

Limiting Distributions

Formulas for probability theory and linear models SF2941

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

where r n = dn+1 x(t)

EE4601 Communication Systems

6.1 Moment Generating and Characteristic Functions

Limiting Distributions

Exercises and Answers to Chapter 1

Stat 5101 Notes: Algorithms

Classical Estimation Topics

Multivariate Gaussians. Sargur Srihari

Statistics for scientists and engineers

(y 1, y 2 ) = 12 y3 1e y 1 y 2 /2, y 1 > 0, y 2 > 0 0, otherwise.

LIST OF FORMULAS FOR STK1100 AND STK1110

PCMI Introduction to Random Matrix Theory Handout # REVIEW OF PROBABILITY THEORY. Chapter 1 - Events and Their Probabilities

ECE302 Exam 2 Version A April 21, You must show ALL of your work for full credit. Please leave fractions as fractions, but simplify them, etc.

Multiple Random Variables

3. Review of Probability and Statistics

Random Variables. Cumulative Distribution Function (CDF) Amappingthattransformstheeventstotherealline.

1 Random Variable: Topics

STAT 512 sp 2018 Summary Sheet

Statistics 3657 : Moment Generating Functions

Discrete Probability Refresher

01 Probability Theory and Statistics Review

Lecture 1 Measure concentration

Lecture 1: Review on Probability and Statistics

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

Chapter 2. Discrete Distributions

EE514A Information Theory I Fall 2013

ECE534, Spring 2018: Solutions for Problem Set #3

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

We introduce methods that are useful in:

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Multivariate Analysis and Likelihood Inference

Lecture The Sample Mean and the Sample Variance Under Assumption of Normality

Mathematical Statistics

Review of Probability Theory

Distributions of Functions of Random Variables. 5.1 Functions of One Random Variable

Two hours. Statistical Tables to be provided THE UNIVERSITY OF MANCHESTER. 14 January :45 11:45

1.12 Multivariate Random Variables

ECE 4400:693 - Information Theory

Multivariate Distributions

Chapter 6. Convergence. Probability Theory. Four different convergence concepts. Four different convergence concepts. Convergence in probability

Appendix A : Introduction to Probability and stochastic processes

ECE 541 Stochastic Signals and Systems Problem Set 9 Solutions

CS145: Probability & Computing

Lecture 4: Sampling, Tail Inequalities

FE 5204 Stochastic Differential Equations

ECE 650 Lecture 4. Intro to Estimation Theory Random Vectors. ECE 650 D. Van Alphen 1

Stochastic Processes and Monte-Carlo Methods. University of Massachusetts: Spring Luc Rey-Bellet

Gaussian random variables inr n

CSCI-6971 Lecture Notes: Probability theory

BMIR Lecture Series on Probability and Statistics Fall 2015 Discrete RVs

MAS223 Statistical Inference and Modelling Exercises

Hoeffding, Chernoff, Bennet, and Bernstein Bounds

CS37300 Class Notes. Jennifer Neville, Sebastian Moreno, Bruno Ribeiro

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

Problem Solving. Correlation and Covariance. Yi Lu. Problem Solving. Yi Lu ECE 313 2/51

The Multivariate Gaussian Distribution [DRAFT]

Transcription:

Probability Background Namrata Vaswani, Iowa State University August 24, 2015 Probability recap 1: EE 322 notes Quick test of concepts: Given random variables X 1, X 2,... X n. Compute the PDF of the second smallest random variable (2nd order statistic). 1 Some Topics 1. Chain Rule: P (A 1, A 2,..., A n ) = P (A 1 )P (A 2 A 1 )P (A 3 A 1, A 2 )... P (A n A 1, A 2,... A n 1 ) 2. Total probability: if B 1, B 2,... B n form a partition of the sample space, then P (A) = P (A B i )P (B i ) Partition: space. The events are mutually disjoint and their union is equal to the sample 3. Union bound: suppose P (A i ) 1 p i for small probabilities p i, then P ( i A i ) = 1 P ( i A c i) 1 i P (A c i) 1 i p i 4. Independence: events A, B are independent iff P (A, B) = P (A)P (B) events A 1, A 2,... A n are mutually independent iff for any subset S {1, 2,..., n}, P ( i S A i ) = P (A i ) i S 1

analogous definition for random variables: for mutually independent r.v. s the joint pdf of any subset of r.v. s is equal to the product of the marginal pdf s. 5. Conditional Independence: events A, B are conditionally independent given an event C iff P (A, B C) = P (A C)P (B C) extend to a set of events as above extend to r.v. s as above 6. Given X is independent of {Y, Z}. Then, X is independent of Y ; X is independent of Z X is conditionally independent of Y given Z E[XY Z] = E[X Z]E[Y Z] E[XY Z] = E[X]E[Y Z] 7. Law of Iterated Expectations: E X,Y [g(x, Y )] = E Y [E X Y [g(x, Y ) Y ]] 8. Conditional Variance Identity: V ar X,Y [g(x, Y )] = E Y [V ar X Y [g(x, Y ) Y ]] + V ar Y [E X Y [g(x, Y ) Y ]] 9. Cauchy-Schwartz Inequality: (a) For vectors v 1, v 2, (v 1v 2 ) 2 v 1 2 2 v 2 2 2 (b) For vectors: ( 1 n x iy i ) 2 ( 1 n x i 2 2)( 1 n y i 2 2) (c) For matrices: 1 n X i Y i 2 1 n X i X i 2 1 n YY 2 (d) For scalar r.v. s X, Y : (E[XY ]) 2 E[X 2 ]E[Y 2 ] (e) For random vectors X, Y, (E[X Y ]) 2 E[ X 2 2]E[ Y 2 2] 2

(f) Proof follows by using the fact that E[(X αy ) 2 ] 0. Get a quadratic equation in α and use the condition to ensure that this is non-negative (g) For random matrices X, Y, E[X Y ] 2 2 λ max (E[X X ])λ max (E[YY ]) = E[X X ] 2 E[YY ] 2 Recall that for a positive semi-definite matrix M, M 2 = λ max (M). (h) Proof: use the following definition of M 2 : M 2 = max x,y: x 2 =1, y 2 =1 x My, and then apply C-S for random vectors. 10. Convergence in probability. A sequence of random variables, X 1, X 2,... X n converges to a constant a in probability means that for every ɛ > 0, lim P r( X n a > ɛ) = 0 n 11. Convergence in distribution. A sequence of random variables, X 1, X 2,... X n converges to random variable Z in distribution means that lim F X n (x) = F Z (x), for almost all pointsx n 12. Convergence in probability implies convergence in distribution 13. Consistent Estimator. An estimator for θ based on n random variables, ˆθn (X), is consistent if it converges to θ in probability for large n. 14. independent and identically distributed (iid) random variables: X 1, X 2,... X n are iid iff they are mutually independent and have the same marginal distribution For all subsets S {1, 2,... n} of size s, the following two things hold F Xi,i S(x 1, x 2,... x s ) = i S F Xi (x i ) (independent) and F Xi (x i ) = F X1 (x 1 ) (iid) Clearly the above two imply that the joint distribution for any subset of variables is also equal F Xi,i S(x 1, x 2,... x s ) = s F X1 (x i ) = F X1,X 2,...X s (x 1, x 2,... x s ) 15. Moment Generating Function (MGF) M X (u) M X (u) := E[e ut X ] 3

It is the two-sided Laplace transform of the pdf of X for continuous r.v. s X For a scalar r.v. X, M X (t) := E[e tx ], differentiating this i times with respect to t and setting t = 0 gives the i-th moment about the origin 16. Characteristic Function C X (u) := M X (iu) = E[e iut X ] C X ( u) is the Fourier transform of the pdf or pmf of X Can get back the pmf or pdf by inverse Fourier transform 17. Union bound: suppose P (A i ) 1 p i for small probabilities p i, then P ( i A i ) = 1 P ( i A c i) 1 i P (A c i) 1 i p i 18. Hoeffding s lemma: bounds the MGF of a zero mean and bounded r.v.. Suppose E[X] = 0 and P (X [a, b]) = 1, then M X (s) := E[e sx ] e s2 (b a) 2 8 if s > 0 Proof: use Jensen s inequality followed by mean value theorem, see http://www. cs.berkeley.edu/~jduchi/projects/probability_bounds.pdf 19. Markov inequality and its implications (a) Markov inequality: for a non-negative r.v. i.e. for X for which P (X < 0) = 0 P (X > a) E[X] a (b) Chebyshev inequality: apply Markov to (Y µ Y ) 2 P ((Y µ Y ) 2 > a) σ2 Y a if the variance is small, w.h.p. Y does not deviate too much from its mean (c) Chernoff bounds: apply Markov to e ty for any t > 0. P (X > a) min t>0 e ta E[e tx ] P (X < b) min t>0 etb E[e tx ] or sometimes one gets a simpler expression by using a specific value of t > 0 4

20. Using Chernoff bounding to bound P (S n [a, b]), S n := n X i when X i s are iid n P (S n a) min t>0 e ta E[e tx i ] = min t>0 e ta (E[e tx 1 ]) n := p 1 P (S n b) min t>0 etb n E[e tx i ] = min t>0 etb (E[e tx 1 ]) n := p 2 Thus, using the union bound with A 1 = {S n < a}, A 2 = {S n > b} P (b < S n < a) 1 p 1 p 2 With b = n(µ ɛ) and a = n(µ + ɛ), we can conclude that w.h.p. µ ± ɛ Xn := S n /n lies b/w 21. A similar thing can also be done when X i s just independent and not iid. Sometimes have an upper bound for E[e tx i ] and that can be used, for example Hoeffding lemma gives one such bound 22. Hoeffding inequality: Chernoff bound for sums of independent bounded random variables, followed by using Hoeffding s lemma Given independent and bounded r.v. s X 1,... X n : P (X i [a i, b i ]) = 1, 2t 2 P ( S n E[S n ] t) 2 exp( n (b i a i ) ) 2 or let X n := S n /n and µ n := i E[X i]/n, then P ( X 2ɛ 2 n 2 n µ n ɛ) 2 exp( n (b i a i ) ) 2 exp( 2ɛ 2 n 2 max i (b i a i ) ) 2 Proof: use Chernoff bounding followed by Hoeffding s lemma 23. Various other inequalities: Bernstein inequality, Azuma inequality 24. Weak Law of Large Numbers (WLLN) for i.i.d. scalar random variables, X 1, X 2,... X n, with finite mean µ. Define For any ɛ > 0, X n := 1 n X i lim P ( X n µ > ɛ) = 0 n Proof: use Chebyshev if σ 2 is finite. Else use characteristic function 25. Central Limit Theorem for i.i.d. random variables. Given an iid sequence of random variables, X 1, X 2,... X n, with finite mean µ and finite variance σ 2 as the sample mean. Then n( X n µ) converges in distribution a Gaussian rv Z N (0, σ 2 ) 5

26. Many of the above results also exist for certain types of non-iid rv s. Proofs much more difficult. 27. Mean Value Theorem and Taylor Series Expansion 28. Delta method: if N(X N θ) converges in distribution to Z then N(g(X N ) g(θ)) converges in distribution to g (θ)z as long as g (θ) is well defined and non-zero. Thus if Z N (0, σ 2 ), then g (θ)z N (0, g (θ) 2 σ 2 ). 29. If g (θ) = 0, then one can use what is called the second-order Delta method. This is derived by using a second order Taylor series expansion or second-order mean value theorem to expand out g(x N ) around θ. 30. Second order Delta method: Given that N(X N θ) converges in distribution to Z. Then, if g (θ) = 0, N(g(X N ) g(θ)) converges in distribution to g (θ) 2 Z 2. If Z N (0, σ 2 ), then Z 2 = σ 2 g (θ) 2 χ 2 1 where ch 2 1 is a r.v. that has a chi-square distribution with 1 degree of freedom. 31. Slutsky s theorem 2 Jointly Gaussian Random Variables First note that a scalar Gaussian r.v. X with mean µ and variance σ 2 has the following pdf f X (x) = 1 e (x µ)2 2σ 2 2πσ Its characteristic function can be computed by computing the Fourier transform at t to get C X (t) = e jµt e σ2 t 2 2 jointly Gaussian r.v. s. Any of the following can be used as a definition of j G. 1. The n 1 random vector X is jointly Gaussian if and only if the scalar u T X is Gaussian distributed for all n 1 vectors u 2. The random vector X is jointly Gaussian if and only if its characteristic function, C X (u) := E[e iut X ] can be written as where µ = E[X] and Σ = cov(x). C X (u) = e iut µ e ut Σu/2 6

Proof: X is j G implies that V = u T X is G with mean u T µ and variance u T Σu. Thus its characteristic function, C V (t) = e itut µ e t2 u T Σu/2. But C V (t) = E[e itv ] = E[e itut X ]. If we set t = 1, then this is E[e iut X ] which is equal to C X (u). Thus, C X (u) = C V (1) = e iut µ e ut Σu/2. Proof (other side): we are given that the charac function of X, C X (u) = E[e iut X ] = e iut µ e ut Σu/2. Consider V = u T X. Thus, C V (t) = E[e itv ] = C X (tu) = e iut µ e t2 u T Σu/2. Also, E[V ] = u T µ, var(v ) = u T Σu. Thus V is G. 3. The random vector X is jointly Gaussian if and only if its joint pdf can be written as f X (x) = 1 ( 2π) n det(σ) e (X µ)t Σ 1 (X µ)/2 (1) Proof: follows by computing the characteristic function from the pdf and vice versa 4. The random vector X is j G if and only if it can be written as an affine function of i.i.d. standard Gaussian r.v s. Proof uses 2. Proof: suppose X = AZ + a where Z N (0, I); compute its c.f. and show that it is a c.f. of a j G Proof (other side): suppose X is j G; let Z := Σ 1/2 (X µ) and write out its c.f.; can show that it is the c.f. of iid standard G. 5. The random vector X is j G if and only if it can be written as an affine function of jointly Gaussian r.v s. Proof: Suppose X is an affine function of a j G r.v. Y, i.e. X = BY + b. Since Y is j G, by 4, it can be written as Y = AZ + a where Z N (0, I) (i.i.d. standard Gaussian). Thus, X = BAZ + (Ba + b), i.e. it is an affine function of Z, and thus, by 4, X is j G. Proof (other side): X is j G. So by 4, it can be written as X = BZ + b. But Z N (0, I) i.e. Z is a j G r.v. Properties 1. If X 1, X 2 are j G, then the conditional distribution of X 1 given X 2 is also j G 2. If the elements of a j G r.v. X are pairwise uncorrelated (i.e. non-diagonal elements of their covariance matrix are zero), then they are also mutually independent. 3. Any subset of X is also j G. 7

3 Optimization: basic fact Claim: min t1,t 2 f(t 1, t 2 ) = min t1 (min t2 f(t 1, t 2 )) Proof: show that LHS RHS and LHS RHS Let [ˆt 1, ˆt 2 ] arg min t1,t 2 f(t 1, t 2 ) (if the minimizer is not unique let ˆt 1, ˆt 2 minimizer), i.e. Let ˆt 2 (t 1 ) arg min t2 f(t 1, t 2 ), i.e. min f(t 1, t 2 ) = f(ˆt 1, ˆt 2 ) t 1,t 2 be any one Let ˆt 1 arg min f(t 1, ˆt 2 (t 1 )), i.e. Combining last two equations, Notice that f(ˆt 1, ˆt 2 (ˆt 1 )) = min t 1 min t 2 f(t 1, t 2 ) = f(t 1, ˆt 2 (t 1 )) min t 1 f(t 1, ˆt 2 (t 1 )) = f(ˆt 1, ˆt 2 (ˆt 1 )) f(t 1, ˆt 2 (t 1 )) = min(min f(t 1, t 2 )) t 1 t 2 f(t 1, t 2 ) min t 2 f(t 1, t 2 ) = f(t 1, ˆt 2 (t 1 )) min t 1 f(t 1, ˆt 2 (t 1 )) = f(ˆt 1, ˆt 2 (ˆt 1 )) (2) The above holds for all t 1, t 2. In particular use t 1 ˆt 1, t 2 ˆt 2. Thus, min f(t 1, t 2 ) = f(ˆt 1, ˆt 2 ) min f(t 1, ˆt 2 (t 1 )) = min(min f(t 1, t 2 )) (3) t 1,t 2 t 1 t 1 t 2 Thus LHS RHS. Notice also that min f(t 1, t 2 ) f(t 1, t 2 ) (4) t 1,t 2 and this holds for all t 1, t 2. In particular, use t 1 ˆt 1, t 2 ˆt 2 (ˆt 1 ). Then, min f(t 1, t 2 ) f(ˆt 1, ˆt 2 (ˆt 1 )) = min(min f(t 1, t 2 )) (5) t 1,t 2 t 2 Thus, LHS RHS and this finishes the proof. t 1 8