CSCI-6971 Lecture Notes: Probability theory

Similar documents
Chapter 2. Discrete Distributions

CSCI-6971 Lecture Notes: Monte Carlo integration

Math 180A. Lecture 16 Friday May 7 th. Expectation. Recall the three main probability density functions so far (1) Uniform (2) Exponential.

Short course A vademecum of statistical pattern recognition techniques with applications to image and video analysis. Agenda

Chapter 6: Large Random Samples Sections

Lecture 11. Probability Theory: an Overveiw

Introduction to Probability

Probability Background

Review: mostly probability and some statistics

Solutions and Proofs: Optimizing Portfolios

Recitation 2: Probability

A Probability Review

Probability Review. Yutian Li. January 18, Stanford University. Yutian Li (Stanford University) Probability Review January 18, / 27

Exercises with solutions (Set D)

ECON 5350 Class Notes Review of Probability and Distribution Theory

2 Statistical Estimation: Basic Concepts

We introduce methods that are useful in:

EE5319R: Problem Set 3 Assigned: 24/08/16, Due: 31/08/16

The binary entropy function

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Chp 4. Expectation and Variance

Lecture Notes 5 Convergence and Limit Theorems. Convergence with Probability 1. Convergence in Mean Square. Convergence in Probability, WLLN

ECE 4400:693 - Information Theory

Continuous Random Variables

Quick Tour of Basic Probability Theory and Linear Algebra

Chapter 5. Statistical Models in Simulations 5.1. Prof. Dr. Mesut Güneş Ch. 5 Statistical Models in Simulations

Lecture 14: Multivariate mgf s and chf s

Introduction to Probability Theory

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

where r n = dn+1 x(t)

= 1 2 x (x 1) + 1 {x} (1 {x}). [t] dt = 1 x (x 1) + O (1), [t] dt = 1 2 x2 + O (x), (where the error is not now zero when x is an integer.

Stat 5101 Notes: Algorithms

Tutorial for Lecture Course on Modelling and System Identification (MSI) Albert-Ludwigs-Universität Freiburg Winter Term

Lecture 2: CDF and EDF

1: PROBABILITY REVIEW

Expectation. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

matrix-free Elements of Probability Theory 1 Random Variables and Distributions Contents Elements of Probability Theory 2

3. Review of Probability and Statistics

ECS /1 Part IV.2 Dr.Prapun

E X A M. Probability Theory and Stochastic Processes Date: December 13, 2016 Duration: 4 hours. Number of pages incl.

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Multiple Random Variables

ECON 3150/4150, Spring term Lecture 6

Probability Models. 4. What is the definition of the expectation of a discrete random variable?

Lecture 4: Sampling, Tail Inequalities

CS145: Probability & Computing

COMPSCI 240: Reasoning Under Uncertainty

Chapter 4. Continuous Random Variables 4.1 PDF

Tom Salisbury

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Probability and Distributions

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Lecture 2: Review of Basic Probability Theory

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Two-dimensional Random Vectors

Appendix A : Introduction to Probability and stochastic processes

Expectation. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

Probability reminders

Contents 1. Contents

conditional cdf, conditional pdf, total probability theorem?

EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

ESS011 Mathematical statistics and signal processing

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

II. Probability. II.A General Definitions

Lecture 1: August 28

Exam P Review Sheet. for a > 0. ln(a) i=0 ari = a. (1 r) 2. (Note that the A i s form a partition)

MATHEMATICS 154, SPRING 2009 PROBABILITY THEORY Outline #11 (Tail-Sum Theorem, Conditional distribution and expectation)

Multivariate Distribution Models

TMA4265: Stochastic Processes

11. Regression and Least Squares

Review (Probability & Linear Algebra)

Chapter 3 Single Random Variables and Probability Distributions (Part 1)

Elements of Probability Theory

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

Module 3. Function of a Random Variable and its distribution

Lecture Notes 2 Random Variables. Discrete Random Variables: Probability mass function (pmf)

Lecture 3 - Expectation, inequalities and laws of large numbers

MACHINE LEARNING ADVANCED MACHINE LEARNING

1 Joint and marginal distributions

Chapter 2. Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

Chapter 1 Statistical Reasoning Why statistics? Section 1.1 Basics of Probability Theory

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

Proving the central limit theorem

Northwestern University Department of Electrical Engineering and Computer Science

Formulas for probability theory and linear models SF2941

n px p x (1 p) n x. p x n(n 1)... (n x + 1) x!

Notes on Random Vectors and Multivariate Normal

Lecture 2: Review of Probability

2. A Basic Statistical Toolbox

BASICS OF PROBABILITY

Solutions to Homework Set #6 (Prepared by Lele Wang)

Stochastic Processes. Review of Elementary Probability Lecture I. Hamid R. Rabiee Ali Jalali

ECE302 Spring 2015 HW10 Solutions May 3,

Probability. Table of contents

Lecture 11. Multivariate Normal theory

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

Transcription:

CSCI-6971 Lecture Notes: Probability theory Kristopher R. Beevers Department of Computer Science Rensselaer Polytechnic Institute beevek@cs.rpi.edu January 31, 2006 1 Properties of probabilities Let, A, B, C be events. Then the following properties hold: A B P (A) P (B) P (A B) = P (A) + P (B) P (A B), so P (A B) P (A) + P (B) Definition 1.1. Conditional probability: P (A B) = P (A B) P (B) Definition 1.2. The Law of Total Probability: if A 1,..., A n are disjoint events that partition the sample space, then P (B) = P (A 1 B) +... + P (A n B) (2) Definition 1.3. Bayes Rule: By the def of conditional probability, so and by the Law of Total Probability P (A B) = P (A B) P (B) = P (B A) P (A) (3) P (A B) = P (A B) = P (B A) P (A) P (B) P (B A) P (A) P (A) P (B A) + P (A) P (B A) Definition 1.4. Independence: A and B are independent iff P (A B) = P (A) P (B) or equivalently P (A B) = P (A). Definition 1.5. Conditional independence: A and B are independent when conditioned on C iff P (A B C) = P (A C) P (B C). Note that independence and conditional independence do not imply each other. The primary sources for most of this material are: Introduction to Probability, D.P. Bertsekas and J.N. Tsitsiklis, Athena Scientific, Belmont, MA, 2002; and Randomized Algorithms, R. Motwani and P. Raghavan, Cambridge University Press, Cambridge, UK, 1995; and the author s own notes. (1) (4) (5) 1

2 Random variables Let X and Y be random variables. Definition 2.1. A probability density function (PDF) is a function f X () such that: For every B R, P (X B) = B f X () d For all, f X () 0 f X () d = 1 Note that f X () = the probability of an event; in particular, f X () may be greater than one. Definition 2.2. A cumulative density function (CDF) is defined as: F X () = P (X ) = f X (t) dt (6) So a CDF is defined in terms of a PDF, and given a CDF, the PDF can be obtained by differentiating, i.e.: f X () = df X () /d. Definition 2.3. The epectation (epected value or mean) of X is defined as: Some properties of the epectation: E X] = E i X i ] = i E X i ] regardless of independence For α R, E αx] = αe X] E XY] = E X] E Y] iff X and Y are independent f X () d (7) Linearity of epectation: given Y = ax + b, a linear function of the random variable X, E Y] = ae X] + b, which we show for the discrete case: E Y] = (a + b) f X () (8) = a f X () + b f X () (9) = ae X] + b (10) Law of iterated epectations or law of total epectation: if X and Y are random variables in the same space, then E E X Y]] = E X], shown as follows: ] E E X Y]] = E = y = y = ( P (X = y Y = y) ) P (X = Y = y) (11) P (Y = y) (12) P (Y = y X = ) P (X = ) (13) P (X = ) P (Y = y X = ) (14) y = P (X = ) (15) = E X] (16) 2

Note that E X Y] is itself a random variable whose value depends on Y, i.e. E X Y] is a function of y. Definition 2.4. The variance of X is defined as: var (X) = E (X E X]) 2] (17) This can be rewritten into the often useful form var (X) = E X 2] (E X]) 2, which we will illustrate for the discrete case: var (X) = E (X E X]) 2] (18) = ( E X]) 2 f X () (19) = = ( 2 2E X] + (E X]) 2) f X () (20) 2 f X () 2E X] f X () + (E X]) 2 f X () (21) = E X 2] 2(E X]) 2 + (E X]) 2 (22) = E X 2] (E X]) 2 (23) The law of total variance asserts that var (X) = E var (X Y)] + var (E X Y]), which we can show using the law of iterated epectation: var (X) = E X 2] (E X]) 2 (24) ]] = E E X 2 Y E (E X Y]) 2] (25) = E var (X Y)] + E (E X Y]) 2] E E X Y]] 2 (26) = E var (X Y)] + var (E X Y]) (27) Definition 2.5. The covariance of X and Y is defined as: which can be rewritten: cov (X, Y) = E (X E X])(Y E Y])] (28) cov (X, Y) = E (X E X])(Y E Y])] (29) = E XY E X] Y E Y] X + E X] E Y]] (30) = E XY] E E X] Y] E E Y] X] + E X] E Y] (31) = E XY] E X] E Y] (32) Note that if X and Y are independent, E XY] = E X] E Y] so cov (X, Y) = 0. Definition 2.6. The correlation coefficent of X and Y is obtained from the covariance: ρ(x, Y) = cov (X, Y) var (X) var (Y) (33) The correlation coefficient can be thought of as a normalized measure of the covariance of X and Y. If ρ(x, Y) = 1 X and Y are fully positively correlated; if ρ(x, Y) = 1 they are fully negatively correlated. 3

2.1 The variance of sums of random variables Let X i = X i E X i ]. Then var ( n X i ) = E ( n = E n = = = n j=1 n n j=1 n E X i 2 n X i ) 2 (34) X i X j ] E X i X j ] ] n 1 + 2 n 1 var (X i ) + 2 2.2 Joint probability density functions n j=i+1 n j=i+1 E X i X j ] cov ( X i, X j ) Given two random variables X and Y, their joint PDF is defined as: (35) (36) (37) (38) f X,Y (, y) = P (X =, Y = y) (39) We also define the marginal PDFs f X () and f Y (y) and the conditional PDFs f X Y ( y) and f Y X (y ). We can obtain f X (() ) by marginalizing the joint PDF: f X () = The definition of conditional probability can be applied to obtain: f X,Y (, y) dy (40) f X Y (, y) = f X,Y (, y) f Y (y) (41) Combining these, a different epression for the marginal PDF is: 2.3 Convolutions f X () = f Y (y) f X Y ( y) dy (42) Definition 2.7. Suppose X and Y are independent random variables with PDFs f X, f Y, respectively. The PDF f W representing the distribution of W = X + Y is known as the convolution of f X and f Y. To derive the distribution f W we start with the CDF: P (W w X = ) = P (X + Y w X = ) (43) = P ( + Y w X = ) (44) independence = P ( + Y w) (45) = P (Y w ) (46) 4

This is a CDF of Y. Net we differentiate both sides with respect to w to obtain the PDF: f W X (w ) = f Y (w ) (47) f X () f W X (w ) = f X () f Y (w ) (48) f X,W (, w) f W (w) conditional prob. 3 Least squares estimation = f X () f Y (w ) (49) marginalization = f X () f Y (w ) d (50) Suppose we are given the value of a random variable Y that is somehow related to the value of an unknown variable X. In other words, Y is some form of measurement of X. How can we compute an estimate c of the value of X given Y that minimizes the squared error (X c) 2? First, consider an arbitrary c. Then the mean squared error is: E (X c) 2] = var (X c) + (E X c]) 2 = var (X) + (E X] c) 2 (51) by Equation 23. If we are given no measurements, we should pick the value of c that minimizes this equation. Since var (X) is independent of c, we choose c = E X] which eliminates the second term. Now suppose we are given a measurement Y = y. Then to minimize the conditional mean squared error, we should choose c = E X Y = y]. This value is the least squares estimate of X given Y. (The proof is omitted.) Note that we have said nothing yet about the relationship between X and Y. In general, the estimate E X Y = y] is a function of y, which we refer to as an estimator. 3.1 Estimation error Let ˆX = E X Y] be the least squares estimate of X, and X = X ˆX be the estimation error. The estimation error ehibits the following properties: X is zero mean: E X Y ] = E X ˆX Y ] = E X Y] E ˆX Y ] = ˆX ˆX = 0 (52) (Note that E ˆX Y ] = ˆX since ˆX is completely determined by Y.) X and the estimate ˆX are uncorrelated; using E X Y ] = 0: cov ( ˆX, X ) = E ( ˆX E ˆX ] )( X E X ] ) ] (53) iter. ep. = E ( ˆX E X Y]) X ] (54) = E ( ˆX E X]) X Y ] (55) = ( ˆX E X])E X Y ] (56) = 0 (57) Because X = X + ˆX, the var (X) can be decomposed based on Equation 38: var (X) = var ( ˆX ) + var ( X ) + 2cov ( ˆX, X ) = var ( ˆX ) + var ( X ) (58) 5

3.2 Linear least squares Suppose we have the linear estimator X = ay + b. In other words, the random variable X is a linear function of the random variable Y. Our goal is to find values for the coefficients a and b that minimize the mean squared estimation error E (X ay b) 2]. First, suppose a is fied. Then by Equation 51 we choose: b = E X ay] = E X] ae Y] (59) Substituting this into our objective and manipulating, we obtain: E (X ay E X] + ae Y]) 2] = var (X ay) (60) = var (X) + a 2 var (Y) + 2cov (X, ay) (61) = var (X) + a 2 var (Y) 2acov (X, Y) (62) Our goal is to minimize this quantity with respect to a. Since it is quadratic in a, it is minimized when its derivative with respect to a is zero, i.e.: The mean squared error of our estimate is then: 0 = 2avar (Y) 2cov (X, Y) (63) cov (X, Y) var (Y) = a (64) var (X) ρ var (Y) = a (65) var (X) + a 2 var (Y) 2acov (X, Y) (66) 2 var (X) var (X) = var (X) + ρ var (Y) 2ρ ρ var (X) var (Y) (67) var (Y) var (Y) ( = 1 ρ 2) var (X) (68) The basic idea behind the linear least squares estimator is to start with the baseline estimate E X] for X, and then adjust the estimate by taking into account the value of Y E Y] and the correlation between X and Y. 4 Normal random variables The univariate Normal distribution with mean µ and variance σ 2, denoted N(µ, σ), is defined as: N(µ, σ) = 1 2πσ e ( µ)2 /2σ 2 (69) The Standard Normal distribution is the particular case where µ = 0 and σ = 1, i.e.: N(0, 1) = 1 2π e 2 /2 (70) The cumulative density function of the Standard Normal (The Standard Normal CDF), denoted Φ, is thus: Φ(y) = P (Y y) = 1 y e t2 /2 dt (71) 2π 6

Note that since N(0, 1) is symmetric, Φ( y) = 1 Φ(y): Φ( y) = P (Y y) = P (Y y) = 1 P (Y < y) = 1 Φ(y) (72) Finally, the CDF of any random variable X N(µ, σ) can be epressed in terms of the Standard Normal CDF. First, by simple manipulation: ( X µ P (X ) = P µ ) (73) σ σ We see that ] X µ E = E X] µ = 0 (74) σ σ ( ) X µ var (X) var = σ σ 2 = 1 (75) So Y = (X µ)/σ N(0, 1) and the CDF is: ( ) µ P (X ) = Φ σ (76) 5 Limit theorems We first eamine the asymptotic behavior of sequences of random variables. Let X 1, X 2,..., X n be independent and identically distributed, each with mean µ and variance σ 2, and let S n = i X i. Then var (S n ) = var (X i ) = nσ 2 (77) i So as n increases, the variance of S n does not converge. Instead, consider the sample mean M n = S n /n. M n converges as follows: E M n ] = 1 n i E X i ] = µ (78) var (M n ) = i var (X i ) n = 1 n 2 var (X i ) = σ2 n i (79) So lim n var (M n ) = 0, i.e. as the number of samples n increases, the sample mean tends to the true mean. 5.1 Central limit theorem Suppose X i are defined as above. Let Z n = i X i nµ σ n (80) The Central limit theorem, which we will not prove, states that as n increases, the CDF of Z n tends to Φ(z) (the Standard Normal CDF). In other words, the sum of a large number of random variables is approimately normally distributed. 7

5.2 Markov inequality For a random variable X > 0, define random variable Y as follows: Y = { 0 if X < a 1 otherwise (81) Clearly Y X so E Y] E X]. Furthermore, by the definition of epectation, E Y] = 0 P (X < a) + ap (X a) so ap (X a) E X] (82) P (X a) E X] (83) a Equation 83 is known as the Markov inequality, which essentially asserts that if a nonnegative random variable has a small mean, the probability that variable takes a large value is also small. 5.3 Chebyshev inequality Let X be a random variable with mean µ and variance σ 2. By the Markov inequality, Since P ( (X µ) 2 c 2) = P ( X µ c), ( P (X µ) 2 c 2) E (X µ) 2] c 2 = σ2 c 2 (84) P ( X µ c) σ2 c 2 (85) Equation 85 is known as the Chebyshev inequality. The Chebyshev inequality is often rewritten as: P ( X µ kσ) 1 k 2 (86) In other words, the probability that a random variable takes a value more than k standard deviations from its mean is at most 1/k 2. 5.4 Weak law of large numbers Applying the Chebyshev inequality to the sample mean M n, and using Equations 78 and 79, we obtain: P ( M n µ ɛ) σ2 nɛ 2 (87) In other words, for large n, the bulk of the distribution of M n is concentrated near µ. A common application is to fi ɛ and compute the number of samples needed to guarantee that the sample mean is an accurate estimate. 5.5 Jensen s inequality Let f () be a conve function, i.e. d 2 f /d 2 > 0 for all. First, note that if f () is conve, then the first order Taylor approimation of f () is an underestimate: f () Fund. Thm. of Calculus = f (a) + a f (t) dt (88) 8

Taylor appro. f (a) + f (a) dt (89) a = f (a) + ( a) f (a) (90) Thus if X is a random variable, Now, let a = E X]. Then we have f (a) + (X a) f (a) f (X) (91) f (E X]) + (E X] E X]) f (E X]) E f (X)] (92) Equation 93 is known as Jensen s inequality. 5.6 Chernoff bound f (E X]) E f (X)] (93) Finally we turn to the Chernoff bound, a powerful technique for bounding the probability that a random variable deviates far from its epectation. First, observe that the Chebyshev inequality provides a polynomial bound on the probability that X takes a value in the tails of its density function. The Chernoff-type bounds, on the other hand, are eponential. We define such a bound as follows. Let X 1, X 2,..., X n be independent identically distributed random variables. Assume that E X 1 ] = E X 2 ] =... = E X n ] = µ < and that var (X 1 ) = var (X 2 ) =... var (X n ) = σ 2 < Further, let X = n X i, so that E X] = nµ and var (X) = nσ 2. The Chernoff bound states that, for t > 0 and 0 X i 1, i such that 1 i n, P ( X nµ X nt) 2e 2nt2 (94) Note that this bound is significantly better than that of the Chebyshev inequality. Chebyshev decreases in a manner inversely proportional to n, whereas the Chernoff bound decreases eponentially with n. We now proove the bound stated in equation 94. In particular, we will proove the bound for the case P (X nµ nt) e 2nt2 The proof for the second case, P (X nµ nt) e 2nt2 is very similar. The complete bound is merely the sum of these two probabilities. Proof: We first define the function { 1 if X nµ nt f () = 0 if X nµ < nt Note that E f ()] = P (X nµ nt) (95) which is eactly the probability we are interested in computing. 9

Lemma 5.1. For all positive reals h, f () e h(x nµ nt) Proof: If X nµ nt 0, then f () = 1 and e h(x nµ nt) 1. Note that this condition holds only for all positive reals. So, we now have that E f ()] E e h(x nµ nt)] (96) We will concentrate on bounding the above epectation, and then minimizing it with respect to h. Let us first manipulate the epectation as follows: E e h(x nµ nt)] = E e ] h(x 1+X 2 +...+X n ) nµ nt] e ] hnt e h(x 1 µ)+h(x 2 µ)+...+(x n µ) = E = e hnt E n e h(x i µ) ] So, E e h(x nµ nt)] independence = e hnt n ] E e h(x i µ) (97) Lemma 5.2. Let Y be a random variable such that 0 Y 1. Then, for any real number h 0, E e hy] (1 E Y]) + E Y] e h Proof: This follows directly from the definition of conveity. So, using equation 97 and lemma 5.2, we have that e hnt n ] E e h(x i µ) e hnt n E e ( hµ (1 µ) + µe h)] Lemma 5.3. Let Proof: First, e hµ ( (1 µ) + µe h) e h2 /8 e hµ ( (1 µ) + µe h) = e hµ+ln((1 µ)+µeh ) L(h) = hµ + ln ((1 µ) + µe h) (98) Taking the Taylor series epansion, L (h) = µ + L (h) = µe h (1 µ) + µe h = µ + µ (1 µ)e h + µ u(1 µ)e h ( (1 µ)e h + µ ) 2 1 4 10

So, we see that the Taylor series is L(h) = L(0) + L (0)h + L (0) h2 2! +... h2 8 So, Combining equations 95,96,97 and 98, we have that E f ()] = P (X nµ nt) e hnt n e h2 /8 = e hnt e nh2 /8 = e hnt+nh2 /8 E f ()] e hnt+nh2 /8 Now we minimize this equation over all positive reals h. Taking the derivative of ( hnt + nh 2 /8), we find that (e hnt+nh2 /8 ) is minimized when h = 4t. Subsituting this into 99, we see that P (X nµ nt) e 2nt2 (100) which is our objective. 5.6.1 Etension of the Chernoff Bound One of the conditions for the Chernoff bound we have just proven to hold is that 0 X i 1. We can generalize the bound to address this constraint. If X 1, X 2,..., X n are independent, identically distributed random variables such that E X i ] = µ <, i and var (X i ) = σ 2 <, i, and a i X i b i for some constants a i and b i for all i, then for all t > 0 We will not prove this bound here. P ( X nµ nt) 2e (99) 2n 2 t 2 n (a i b i )2 (101) 11