ICME Refresher Course: robability ad Statistics Staford Uiversity robability ad Statistics Luyag Che September 20, 2016 1 Basic robability Theory 11 robability Spaces A probability space is a triple (Ω, F, ), where Ω is a set of outcomes, F is a set of evets, ad : F [0, 1] is a fuctio that assigs probabilities to evets A σ-algebra (or σ-field) F is a collectio of subsets of Ω that satisfy 1, Ω F 2 if A F, the A c F 3 if A i F is a coutable sequece of sets, the i A i F A measurable space (Ω, F) is a space o which we ca put a measure A measure µ : F R is a oegative coutably additive set fuctio that satisfies 1 µ(a) µ( ) = 0 for all A F 2 if A i F is a coutable sequece of disjoit sets, the µ( i A i ) = i µ(a i ) If µ(ω) = 1, we call µ a probability measure Let µ be a measure o (Ω, F) 1 Mootoicity If A B, the µ(a) µ(b) 2 Subadditivity If A m=1a m, the µ(a) m=1 µ(a m) 3 Cotiuity from below If A i A (ie, A 1 A 2 ad i A i = A), the µ(a i ) µ(a) 4 Cotiuity from above If A i A (ie, A i A 2 ad i A i = A), with µ(a 1 ) <, the µ(a i ) µ(a) 12 istributios A radom variable X is a real-valued fuctio defied o Ω, such that for every Borel set B R, we have X 1 (B) = {ω Ω : X(ω) B} F A radom variable X is discrete if its possible values are fiite or coutably ifiite A radom variable X is cotiuous if its possible values form a ucoutable set ad the probability that X equals ay such value exactly is zero A trivial, but useful, type of example of a radom variable is the idicator fuctio of a set A F: { 1 ω A 1 A (ω) = 0 ω / A Luyag Che: lych@stafordedu 1
ICME Refresher Course: robability ad Statistics Staford Uiversity If X is a radom variable, the X iduces a probability measure o R called its distributio, by settig µ(a) = (X A) for Borel sets A The distributio of a radom variable X is described by givig its distributio fuctio F (x) = (X x) Ay distributio fuctio F has the followig properties: 1 F is odecreasig 2 lim x F (x) = 1, lim x F (x) = 0 3 F is right cotiuous, that is, lim y x F (y) = F (x) 4 lim y x F (y) = F (x ) = (X < x) Ay fuctio F satisfyig 1 3 above is the distributio fuctio of some radom variable Whe the distributio fuctio F (x) has the form we say that X has desity fuctio f 13 Itegratio & Expected Value F (x) = x Suppose f ad g are itegrable fuctios o (Ω, F, µ) 1 If f 0 ae, the fdµ 0 2 For all a R, afdµ = a fdµ 3 f + gdµ = fdµ + gdµ 4 If g f ae, the gdµ fdµ 5 If g = f ae, the gdµ = fdµ 6 fdµ f dµ f(y)dy If X is a radom variable o (Ω, F, ), the we defie its expected value to be E[X] = Xd E[X] does ot always exist Jese s iequality Suppose φ is covex, ad X ad φ(x) are both itegrable, the φ(e[x]) E[φ(X)] Hölder s iequality If p, q (1, ) with 1/p + 1/q = 1, the E[ XY ] (E[ X p ]) 1 p (E[ Y q ]) 1 q The special case p = q = 2 is called the Cauchy-Schwarz iequality Markov s iequality ( X a) a 1 E[ X ] Chebyshev s iequality ( X a) a 2 E[ X 2 ] If k is a positive iteger, the E[X k ] is called the kth momet of X The first momet E[X] is usually called the mea ad deoted by µ If E[X 2 ] <, the the variace of X is defied to be var(x) = E[(X µ) 2 ] = E[X 2 ] µ 2 The covariace of two radom variables X ad Y is defied as cov(x, Y ) = E[(X µ X )(Y µ Y )] = E[XY ] µ X µ Y Luyag Che: lych@stafordedu 2
ICME Refresher Course: robability ad Statistics Staford Uiversity 14 Itegratio to the Limit omiated Covergece Theorem If X X as, X Y for all ad E[Y ] <, the E[X ] E[X] Mootoe Covergece Theorem If 0 X X, the E[X ] E[X] Fatou s Lemma If X 0, the 15 Fubii s Theorem X Y lim if E[X ] E[lim if X ] Fubii s theorem If f 0 or f dµ <, the f(x, y)µ 2 (dy)µ 1 (dx) = fdµ = X Y Exercise Let X be a oegative radom variable Show that 2 Covergece 21 Covergece Cocepts E[X] = 0 Y (X t)dt X f(x, y)µ 1 (dx)µ 2 (dy) Coverge i probability We say that X X i probability, if for ay ε > 0, lim ( X X > ε) = 0 Coverge i L p We say that X X i L p, if lim E[ X X p ] = 0 Coverge almost surely We say that X X as, if (lim X = X) = 1 Coverge i distributio We say that X X i distributio, their CFs coverge, ie F (x) F (x) for ay cotiuous poit x of F Note The followig three statemets are equivalet: 1 lim E[g(X )] = E[g(X)] for all bouded ad cotiuous g(x) 2 lim E[e iαx ] = E[e iαx ] poitwise for all α R 3 lim F (x) = F (x) for ay cotiuous poit x of F 22 Relatioship betwee ifferet Covergeces If X as X, the X X roof ( ε>0 N>0 N { X X < ε}) = 1 = ( ε>0 N>0 N { X X ε}) = 0 = ( N>0 N { X N ε}) = 0 ε > 0 = lim ( X X ε) = 0 Luyag Che: lych@stafordedu 3
ICME Refresher Course: robability ad Statistics Staford Uiversity as X X does t imply X X Couterexample { i 1 t < i+1 f 2 +i(t) = 2 k 2 k k 0 otherwise i = 0, 1,, 2 k 1, k = 0, 1, X = f (U) where U is uiformly distributed o [0, 1] X coverges to 0 i probability, but ot as If X L p X, the X X roof ( X X ε) E[ X X p ] ε p 0 L X X does t imply p X X Couterexample f (t) = { 1/p 0 t < 1 0 otherwise X = f (U) where U is uiformly distributed o [0, 1] X coverges to 0 i probability, but ot i L p If X X, the X X If X a (costat), the X a 23 Cotiuous Mappig Theorem ad Slutsky s Theorem Cotiuous Mappig Theorem Suppose g : R R is a cotiuous fuctio 1 If X X, the g(x ) g(x) 2 If X X, the g(x ) g(x) 3 If X as X, the g(x ) as g(x) Slutsky s Theorem If X X ad Y a (costat), the X + Y X + a ad X Y ax 24 elta Method Theorem Let X 1, X 2, be a sequece of radom variables such that (X a) Z for some radom variable Z ad costat a Let g : R R be cotiuously differetiable at a The (g(x ) g(a)) g (a)z roof where X a X a (g(x ) g(a)) = g ( X ) (X a) (X a) Z X a X a g ( X ) g (a) The use Slutsky s Theorem Luyag Che: lych@stafordedu 4
ICME Refresher Course: robability ad Statistics Staford Uiversity 25 Weak Laws of Large Numbers (WLLN) Theorem Let X 1, X 2, be ucorrelated radom variables with E[X i ] = µ ad var(x i ) C < If S = X 1 + + X the as, S / µ i L 2 ad also i probability roof E[S /] = µ E[ S / µ 2 ] = var(s /) = 1 2 var(s ) = 1 2 var(x i ) C 0 Theorem Let X 1, X 2, be iid radom variables with E[X i ] = µ ad E[ X i ] < If S = X 1 + + X the as, S / µ i probability roof S / µ = 1 (X i 1 { Xi } + X i 1 { Xi >}) E[X 1 1 { X1 }] + E[X 1 1 { X1 }] E[X 1 ] ( 1 = ) (X i 1 { Xi } E[X 1 1 { X1 }]) + 1 = I + II + III ( ) X i 1 { Xi >} + E[X 1 1 { X1 }] E[X 1 ] E[ I 2 ] = 1 E[ X 11 { X1 } E[X 1 1 { X1 }] 2 ] 1 E[ X 1 2 1 { X1 }] = 1 E[ X 1 2 1 { X1 ε }] + 1 E[ X 1 2 1 {ε < X1 }] ε 2 + E[ X 1 1 { X1 >ε }] [ 1 ] E[ II ] = E X i 1 { Xi >} 1 E[ X i 1 { Xi >}] = E[ X 1 1 { X1 >}] 0 III = E[X 1 1 { X1 }] E[X 1 ] E[ X 1 1 { X1 >}] 0 Note Neither idepedece of the X i or their fiite variace are eeded for the validity of WLLN 26 Strog Laws of Large Numbers (SLLN) Theorem Let X 1, X 2, be iid radom variables with E[X i ] = µ ad E[ X i ] < If S = X 1 + + X the as, S / µ as If the iid radom variables {X i } have fiite forth order momets, E[ X i 4 ] < or E[ X i µ 4 ] <, the a applicatio of the Chebyshev iequality with p = 4 gives the eeded estimate ad we have the SLLN i this case Of course, this is oly a sufficiet coditio for its validity As with the WLLN, it is eough that E[ X i ] < 27 Cetral Limit Theorem Theorem Let X 1, x 2, be iid radom variables with E[X i ] = µ ad var(x i ) = σ 2 < If S = X 1 + + X the (S / µ) N (0, σ 2 ) roof E[e iα (S / µ) ] = E[e i α j=1 (Xj µ) ] = φ ( α ) where φ(α) = E[e iα(x1 µ) ] The φ(0) = 1, φ (0) = 0, φ (0) = σ 2 By Taylor s theorem, we have where 0 < α < α φ( α ) = 1 φ (α ) α2 2 φ ( α ) e α2 σ 2 2 Luyag Che: lych@stafordedu 5
ICME Refresher Course: robability ad Statistics Staford Uiversity 3 Statistics 31 robability ad Statistics The basic problem of probability is: Give the distributio of the data, what are the properties (eg expectatio, variace, etc ) of the outcomes? The basic problem of statistics is: Give the outcomes, what ca we say about the distributio of the data? (Give X 1,, X F, what ca we say about F? ) 32 Fudametal Cocepts oit estimatio ivolves the use of sample data to calculate a sigle value (kow as a statistic) which is to serve as a best guess or best estimate of a ukow (fixed or radom) populatio parameter Let X 1,, X be iid data poits from some distributio F (x; θ ) A poit estimator ˆθ of parameter θ is some fuctio of X 1,, X : ˆθ = g(x 1,, X ) We itroduce the followig two methods: Method of Momets ad Maximum Likelihood I statistics, the bias of a estimator is the differece betwee this estimator s expected value ad the true value of the parameter beig estimated A estimator with zero bias is called ubiased Otherwise the estimator is said to be biased Let ˆθ be a estimate of a parameter θ based o a sample of size The ˆθ is said to be cosistet i probability if ˆθ coverges i probability to θ as approaches ifiity A 1 α cofidece iterval for a parameter θ is a iterval C = (a, b) where a = a(x 1,, X ) ad b = b(x 1,, X ) are fuctios of the data such that (θ C ) 1 α 33 The Methods of Momets The kth momet of a probability law is defied as µ k = E[X k ], where X is a radom variable followig that probability law If X 1,, X are iid radom variables from that distributio, the kth sample momet is defied as ˆµ k = 1 Xk i We ca view ˆµ k as a estimate of µ k The method of momets estimates parameters by fidig expressios for them i term of the lowest possible order momets ad the substitutig sample momets ito the expressios Example The first ad secod momets for the ormal distributio N (µ, σ 2 ) are µ 1 = E[X] = µ µ 2 = E[X 2 ] = µ 2 + σ 2 Therefore, µ = µ 1 ad σ 2 = µ 2 µ 2 1 The correspodig estimates of µ ad σ 2 from the sample momets are ˆσ 2 = 1 ˆµ = 1 ( 1 Xi 2 X i = X X i ) 2 = 1 (X i X) 2 Luyag Che: lych@stafordedu 6
ICME Refresher Course: robability ad Statistics Staford Uiversity Questio Are the two estimators above ubiased? Are the two estimators above cosistet? What are the cofidece itervals? E[ˆµ] = µ ˆσ 2 = 1 (X i µ) 2 ( X µ) 2 E[ˆσ 2 ] = σ 2 1 σ2 = 1 σ2 ˆµ is ubiased ˆσ 2 is biased Both ˆµ ad ˆσ 2 are cosistet estimators A 1 α cofidece iterval of ˆµ is [µ σ Φ 1 (1 α/2), µ + σ Φ 1 (1 α/2)] (ˆσ 2 /σ 2 χ 2 ( 1)) 34 The Method of Maximum Likelihood Suppose that radom variables X 1,, X have a joit desity f(x 1,, x θ) Give observed values X i = x i, i = 1,,, the likelihood of θ as a fuctio of x 1,, x is defied as L(θ) = f(x 1,, x θ) If X i are assumed to be iid, the likelihood is L(θ) = f(x i θ) The log likelihood is l(θ) = log L(θ) = log f(x i θ) The maximum likelihood estimate (MLE) of θ is that value of θ that maximizes the likelihood, that is, makes the observed data most probable or most likely The estimates obtaied by the method of maximum likelihood are ot always the same as those obtaied by the method of momets Example If X 1,, X are iid N (µ, σ 2 ), their joit desity is the product of their margial desities: 1 ( f(x 1,, x µ, σ) = exp 1 [ xi µ ] 2 ) 2πσ 2 2 σ The log likelihood is thus l(µ, σ) = log σ 2 The partials with respect to µ ad σ are l µ = 1 σ 2 log 2π 1 2σ 2 (X i µ) l σ = σ + 1 σ 3 The followig are the good properties of the MLE: (X i µ) 2 (X i µ) 2 ˆµ MLE = X ˆσ MLE = 1 (X i X) 2 1 Uder appropriate smoothess coditios o f, the MLE from a iid sample is cosistet 2 Uder appropriate smoothess coditios o f, (ˆθ θ ) N (0, 1/I(θ )) 3 The MLE achieves the Cramer-Rao lower boud Fisher Iformatio [ ] 2 [ 2 ] I(θ) = E θ log f(x θ) = E θ 2 log f(x θ) Luyag Che: lych@stafordedu 7
ICME Refresher Course: robability ad Statistics Staford Uiversity 35 Hypothesis Testig H 0 : the ull hypotheses H 1 (or H A ): the alterative hypothesis Rejectig H 0 whe it is true is called a type I error The probability of a type I error is called the sigificace level of the test ad is usually deoted by α Acceptig the ull hypothesis whe it is false is called a type II error Its probability is usually deoted by β The set of values of the test statistic that leads to rejectio of the ull hypothesis is called the rejectio regio, ad the set of values that leads to acceptace is called the acceptace regio The probability distributio of the test statistic whe the ull hypothesis is true is called the ull distributio The p-value is the probability of a result as or more extreme tha that actually observed if the ull hypothesis were true Some familiar hypothesis tests: z-test, Studet s t-test, Geeralized Likelihood Ratio Test Suppose that the observatios X = (X 1,, X ) have a joit desity fuctio f(x 1,, x θ) H 0 specifies that θ ω 0 ad H 1 specifies that θ ω 1, where ω 0 ω 1 = ad Ω = ω 0 ω 1 The test statistic Λ = max[l(θ)] θ ω 0 max [L(θ)] θ Ω Uder smoothess coditios o the probability desity, the ull distributio of 2 log Λ teds to a chi-square distributio with degrees of freedom equal to dim Ω dim ω 0 as the sample size teds to ifiity 36 Liear Regressio Cosider the followig regressio model: where Y = y 1 y The least square estimator β = β 1 β p Y = Xβ + ε ε = ε 1 ε X = ˆβ LS = arg mi Y Xβ 2 2 Cosider the model above ad we have the followig assumptios: 1 X is o-radom matrix with full colum rak 2 E[ε] = 0 3 cov(ε i, ε j ) = σ 2 δ ij 4 ε i iid N (0, σ 2 ) ˆβ LS = (X T X) 1 X T Y Uder assumptio 1-2, ˆβ LS is a ubiased estimator x 11 x 1p x 1 x p Luyag Che: lych@stafordedu 8
ICME Refresher Course: robability ad Statistics Staford Uiversity Uder assumptio 1-3, Cov( ˆβ LS ) = σ 2 (X T X) 1 A ubiased estimator of σ 2 is s 2 = 1 p RSS = 1 p (Y X ˆβ LS ) T (Y X ˆβ LS ) Uder assumptio 1 ad 4, ˆβ LS N (β, σ 2 (X T X) 1 ) RSS σ 2 χ 2 p ˆβ LS,j β j s t p c jj where c jj is the jth elemet o the diagoal of (X T X) 1 Luyag Che: lych@stafordedu 9