Basic of Probability Theory for Ph.D. students in Education, Social Sciences and Business (Shing On LEUNG and Hui Ping WU) (May 2015)

Size: px

Start display at page:

Download "Basic of Probability Theory for Ph.D. students in Education, Social Sciences and Business (Shing On LEUNG and Hui Ping WU) (May 2015)"

Grace Phelps
5 years ago
Views:

1 Basic of Probability Theory for Ph.D. students in Education, Social Sciences and Business (Shing On LEUNG and Hui Ping WU) (May 2015) This is a series of 3 talks respectively on: A. Probability Theory B. Hypothesis Testing C. Bayesian Inference Lecture 3: Bayesian Inference (Most statistical details can be found via web-search. These lectures emphasize on conceptual understanding instead of technical details.) 1

2 C. Bayesian approach Bayes Theorem Pr(A/B) = Pr(A & B) / Pr(B) = Pr(B/A) * Pr(A) / Pr(B) Pr(B) = Pr(B/A)*Pr(A) + Pr(B/not A)*Pr(not A) For our convenience, let A = θ (model), B = X (observations) So, Pr(θ/X) = Pr(θ and X) / Pr(X) = Pr(X/θ) * Pr(θ) / Pr(X) 2

3 The roles of X and θ are interchanged Classical Pr(X/θ): sampling distribution of data Frequentist implicitly assume many realization of data (e.g. each day it can be raining or not raining), but in reality only one (each day only one event happen: rain or not rain) (Yuen 2011) Bayesian Pr(θ/X): uncertainty over the parameter space Bayesianly, raining yesterday is X (happened, only once), data; tomorrow is y (predictive), which is governed by θ, parameter 3

4 An easy example: 4

5 5

6 M 1 = 1 black and 9 white balls, θ=θ 1 =0.1 M 2 = 9 black and 1 white ball, θ=θ 2 =0.9 Procedure: select a bag at random (Pr(θ 1 )=Pr(θ 2 )=0.5), and select 1 ball, and guess which bag (M 1 or M 2 ) it comes from X=B if the selected ball is black; X=W if it is white Pr(X=B) = Pr(X=B/θ 1 )*Pr(θ 1 ) + Pr(X=B/θ 2 )*Pr(θ 2 ) = 0.1* * 0.5 = 0.5 Pr(θ 1 /X=B) = Pr(X=B/θ 1 )*Pr(θ 1 ) / Pr(X=B) = 0.05 / 0.5 = 0.1 Pr(θ 2 /X=B) = 0.9 Pr(θ 1 /B) vs Pr(θ 2 /B) = 0.1 vs 0.9 6

7 Interpretation If the ball is black, we guess it is come from M 2, otherwise, M 1 We make inference (M) based on our observations (X) X can be medical symptom (data), M can be disease X can be students achievements, M can be SES, efforts, esteem, attitude, etc. * Important: Classical Pr(X/θ) Vs Bayesian Pr(θ/X) Prior and Posterior (In its continuous forms) p(θ/x) = p(x/θ) *p(θ) / p(x), and, p(x) =ʃp(x/θ) *p(θ) dθ Pr(M) or p(θ) is prior Pr(M/X) or p(θ/x) is posterior 7

8 Simple example with Hypothesis Testing (Classical Approach) H 0 : The ball is from M 1 H 1 : The ball is from M 2 Decision rule: If the ball is Black (X=B), we reject H 0 and said the ball is from M 2. Under H 0, Pr(X=B/M 1 ) = 0.1 > 0.05, so we do not reject H 0. (Unless 5 black out of 100 balls) With p=0.1, we can use the term marginal significant. Still, we are protecting the H 0. Under H 1, Pr(X=B/M 2 ) = 0.9 (power of the test is good) Pr(X=W/M 2 ) = 0.1 (this is type 2 error) But, this is without knowing what happen in M 2, as we concentrate on M 1. Hence, power can be less, and type 2 error can be large. No statement on Pr(θ 1 /B) vs Pr(θ 2 /B) = 0.1 vs 0.9 8

9 Credible Interval Vs Confident Interval A credible interval (CreI) is an interval in Pr(θ/X) (posterior) to specify the most possible range (say 95%) of a parameter. In multi-dimensional parameter space, it is the credible region A 95% Confident Interval (ConI): If we have many samples, 95% of ConI will contain the true value (in practice we got only one sample). Here, parameters are fixed and intervals are random CreI not equal to ConI because (i) prior exist, and (ii) treatment of nuisance parameters For (i), if prior are "reasonably unbiased", the differences are minor. For (ii), ConI (or classical approach) take mle values of nuisance parameters. Bayesian has to conduct integrations (and that is the most difficult task) 9

10 Simple Example: t-test Golf scores for males and females are: Male: 82, 80, 85, 85, 78, 87, 82; Female: 75, 76, 80, 77, 80, 77, 73 Sample difference: y = - = 5.85 N = 7 for both sex (but not necessary) t-test from SPSS, path: Analyze Compare mean Independent samples t-test t = 3.83, df = 12, p-value = 0.002, 95%, CI (2.52, 9.19) 10

11 Bayesian Inference of this example x 1 ~ N(µ 1, σ 1 2 ), and x 2 ~ N(µ 2, σ 2 2 ), then ~ N(µ 1, σ 1 2 /n 1 ), ~ N(µ 2, σ 2 2 /n 2 ) Assume σ 1 and σ 2 are known, but (later) Let θ=µ 1 -µ 2, parameter on differences, and y = - be the random variable y θ ~ N(θ, σ 1 2 /n 1 + σ 2 2 /n 2 ) p(θ y) = p(y θ) * p(θ) / p(y); or p(θ y) p(y θ) * p(θ) Assume: Prior of θ ~ N(0, σ 3 2 ); 0 imply no biase; σ 3 known if every things is Normal, p(θ y), posterior is Normal, and θ y ~ N(µ, σ) where and. Credible Interval is CreI (U 1, U 2 ) = (µ-2*σ, µ+2*σ) 11

12 But, Everything is Normal is not always the case If σ is unknown, p(θ y) is not Normal even everything is Normal In general, the posterior p(θ y) can be very complicated. This is one of the difficulties in using Bayesian approach. 12

13 Prior and Credible Interval σ 3 µ CreI(U 1, U 2 ) (0.38, 3.13) (0.80, 6.58) (1.00, 8.28) (1.24, 10.20) (1.26, 10.43) CI 5.85 (2.52, 9.19) 13

14 Interpretation: Final results is a mixture between: (i) prior, (ii) observation The prior mean is 0. When prior variance, σ 3, is small, the effect of prior is stronger and the final results will be closer to 0. If σ 3 is known and is very large, we do not have any strong assumption on prior belief (unbiased prior or non-informative prior), final result will depend largely on observations Generally, non-informative (or large variance) prior makes Bayesian similar to Classical Another factor: σ 3 is assumed to be known, which is unrealistic. This is call nuisance parameters (or hyper-parameters) Everyone uses computer packages, later, we introduce Bayesian t-test web-based calculator. There are many others. 14

15 Nuisance parameters (Nuisance but important!) Classically, we take nuisance parameters, σ 3, at its mle Bayesian Evidences, p(θ/x) = p(x/θ) *p(θ) / p(x), or Posterior = Evidences * Prior / p(x) p(x/θ) = ʃp(x, σ 3 /θ) dσ 3, called evidence To compute evidences, we need to integrate ʃ over all possible values of σ 3 in the parameter space. "... this averaging automatically controls the complexity of different models..." (Wetzels et al 2011) 15

16 The following 5 terms carry the same technical implications i. penalty towards extra parameters ii. over-fitting iii. complexity of the models iv. weighted average of the likelihood v. posterior probability vi. evidence That simple sign ʃ can imply 100 dimensional integrals! Models are more complex, Large prediction errors, Bayesianly, more parameters, more ʃ, more complex, larger prediction errors,. etc. Eventually, models are not preferred Approximation methods are used, say MCMC, or BIC (created immediately with AIC), or Laplace method,, etc. (web-search) 16

17 Bayes Factor Recall Posterior = Evidences * Prior / p(x) To compare two models, M0 and M1, we compare their posterior Posterior of M 0 = Evidence(M 0 ) * Prior(M 0 ) / p(x) Posterior of M 1 = Evidence(M 1 ) * Prior(M 1 ) / p(x) Since (i) p(x) is the same, and, (ii) further, of we assume two prior are equally likely, comparing posteriors becomes comparing evidence So, Bayes Factor is: B 01 = Evidence(M 0 ) / Evidence (M 1 ), or B 01 = (ʃpr(x/θ,m 0 ) * Pr(θ/M 0 ) dθ) / (ʃpr(x/θ,m 1 ) * Pr(θ/M 1 ) dθ) Bayesisanly, there is no preference between M 0 and M 1, unlike frequentist (which protect H 0 and try to reject it) But, Evidence (M 0 and M 1 ) need to ʃ over all nuisance parameters! 17

18 Bayes Factor can apply to many situations, some with many parameters, say Factor Analysis, SEM, etc, but some with very few parameters, say Bayesian t-test. Now, if (i) all nuisance parameters take mle (instead of ʃ ), Bayes Factor equals to Likelihood Ratio (and use LRT and then Classical approach) 18

19 Interpretations of Bayes Factor B01 Interpretation < 1/10 Substantially prefer M 1, more than 10 times as likely 1/10 ~ 1/3 Slightly prefer M 1, between 3 to 10 times as likely 1/3 ~ 3 Indifferent (within 3 times as likely) 3 ~ 10 Slightly prefer M 0, between 3 to 10 times as likely > 10 Substantially prefer M 0, more than 10 times as likely B 01 = Evidence(M 0 ) / Evidence(M 1 ); B 10 = Evidence(M 1 ) / Evidence(M 0 ); so, B 01 = 1 / B 10 Because of the reciprocal relationship, the strength of index is the same for B 01 and 1/ B 01, say 1/3 and 3 are the same.) 19

20 Bayesian t-test There are many Bayesian t-test, Rouder et al (2009) is only one convenient example. Rouder et al (2009) and Wetzels et al (2011) Reference: Rouder, J. N., Speckman P. L., Sun D., Morey R. D., & Iverson G. (2009) Bayesian t-tests for Accepting and Rejecting the Null Hypothesis. Psychonomic Bulletin & Review, 16, Wetzels, Ruud; Matzke, Dora; Lee, Michael D.; Rouder, Jeffrey N.; Iversion, Geoffrey J. and Wagenmakers, Eric-Jan (2011) Statistical Evidence in Experimental Psychology: An Empirical Comparison Using 855 t Tests. Perspectives on Psychological Science, 6(3) µ: mean of difference; σ: variance of difference δ=µ /σ: effect size of difference M 0 : δ= 0 (hence µ=0) (so called null) M 1 : δ ~ N(0, σ 2 2 δ ), σ δ is σ of δ (alternative) 20

21 Note: effect size δ gives a standard way to compare means of different populations, and researchers have an intrinsic scale about the ranges of effect sizes that applies broadly (Rouder et al, 2009) For M 0, δ= 0, and evidence under M 0 only take δ=0 For M 1, δ ~ N(0, σ δ 2 ) and evidence under M 1 will take automatic averaging of all δ in the whole range. So, one more integration under M 1 In the above example, we only need to input: n 1 =n 2 =7, t = 3.83, r = 0.707, results are: Scaled JZS Bayes Factor = 14.7 Scaled-Information Bayes Factor = 17.7 But, need to take the reciprocal of the above values, because the web-calculator calculate B10 instead of B01, so Scaled JZS Bayes Factor = 1/14.7 =

22 Scaled-Information Bayes Factor = 1/17.7 = We expect more packages in Bayesian applications in future, but the principles remain the same In most cases, Bayesian t-test are similar to usual t-test. But, in the marginal situations (e.g. p = 0.06), results can be different. 22

23 Summary (statistical term are different from daily English) Bayesian makes inferences based on data we observed (i.e. Pr(θ/X), posterior ), which is more natural. In doing so, we need to specify prior, which is mostly unbiased, or non-informative Posterior can have very complicated distributions to be solved analytically, which prohibit Bayesian approach and its packages Handling nuisance parameters is the major technical difficulty, but it automatically handle model complexity (give penalty towards complicated models) and give direct probability statement (Pr(Model/Data)). *** Important: Classical Pr(X/θ) Vs Bayesian Pr(θ/X) 23

24 Q&A Shing On LEUNG Hui Ping WU 24

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist