Bayesian Inference and Decision Theory

Size: px

Start display at page:

Download "Bayesian Inference and Decision Theory"

Terence Lawson
5 years ago
Views:

1 Bayesian Inference and Decision Theory Instructor: Kathryn Blackmond Laskey Room 2214 ENGR (703) Office Hours: Thursday 4:00-6:00 PM, or by appointment Spring 2018 Unit 2: Random Variables, Parametric Models, and Inference from Observation Unit 2 (v4) - 1 -

2 Learning Objectives for Unit 2 Define and apply the following: Event Random variable (univariate and multivariate) Joint, conditional and marginal distributions Independence; conditional independence Mass and density functions; cumulative distribution functions Measures of central tendency and spread Be familiar with common parametric statistical models Continuous distributions:» Normal, Gamma, Exponential, Uniform, Beta, Dirichlet Discrete distributions:» Bernoulli, Binomial, Multinomial, Poisson, Negative Binomial Use Bayes rule to find the posterior distribution of a parameter given a set of observations State de Finetti's Theorem and its relationship to the concept of "true" probabilities Use R to do Bayesian inference with discrete random variables Unit 2 (v4) - 2 -

3 Events, Partitions and Probability Distributions Some definitions relating to mathematical models of situations involving uncertainty: Sample space - a set S consisting of all possible elementary outcomes Event - a subset E Ì S Partition - a finite or infinite collection {H 1, H 2, } events such that H i Ç H j = Æ if i j and È i H i = S The sample space S completely characterizes the outcome to the finest level of detail needed for the model We say an event E Ì S occurs if the actual outcome s is an element of E If {H 1, H 2, } is a partition of S, then exactly one of the events H i occurs A probability distribution is a function from events to real numbers satisfying the axioms of probability If E is an event and {H 1, H 2, } is a partition, then the axioms imply: Law of total probability: Bayes Rule: k P(E) = P(E H k )P(H k ) P(H k E) = P(E H k)p(h k ) P(E) = j P(E H k )P(H k ) Note: This is the standard definition. Hoff s definition (p. 15 of Hoff) is non-standard. P(E H j )P(H j ) Unit 2 (v4) - 3 -

4 Random Variables Random variable (RV) - mathematical representation for uncertain outcome RV has set of possible values (also called states or outcomes) Categorical RVs - possible values are category labels (e.g., rainy/cloudy/sunny) Ordinal RVs - possible values are qualitatively ordered (e.g., high/medium/low) Numerical RVs - possible values are numbers» Discrete - values drawn from finite or countable set (e.g., 0, 1, 2, 3, )» Continuous - values drawn from continuous set (e.g., real numbers) Probability distribution quantifies likelihood For a Bayesian agent the probability distribution: Summarizes all information needed for inference or decision Is always conditional on the agent s current knowledge Changes with new information according to Bayes Rule Unit 2 (v4) - 4 -

5 Random Variable: Formal Definition A probability space (S, A, P) consists of : A set S of elementary outcomes, called the sample space A s-algebra A over S» Collection of events (subsets of S) that contains Æ and S» Closed under countable intersection, union and complementation» For discrete RV we typically use s-algebra of all subsets; for continuous RV we use s-algebra generated by intervals A probability function P mapping events in A to numbers between 0 and 1, such that probability axioms are satisfied A random variable X is a function from S to set X of possible values A random variable maps element of sample space to outcome X(s) = x» The probability distribution on S induces a probability distribution on X» P(Y) is the probability of the subset A of S that maps onto the subset Y of X» We usually write X and not X(s)» Often the sample space is left implicit (this can cause confusion) S A x=x(s) Y X Notation convention: Some statisticians use uppercase letters to denote random variables and lowercase letters to denote values of sample space; others use only lowercase letters Unit 2 (v4) - 5 -

6 RV from Frequentist View: Example Question: does randomly chosen patient have disease? Sample space S = {s 1, s 2, } represents population of patients arriving at medical clinic 30% of the patients in this population have the disease 95% of the patients who have the disease test positive 85% of the patients who do not have the disease test negative Random variables: X maps randomly drawn patient s to X(S) = 0 if patient does not have disease and X(S) = 1 if patient has disease Y maps randomly drawn patient S to Y(S) = 0 if patient s test result is negative and Y(S) = 1 if patient s test result is positive Usually we suppress the argument and write (X, Y) for the two-component random variable representing the condition and test result of a randomly drawn patient Pr(X=1) = 0.3 Pr(Y=1 X=1) = 0.95 Pr(Y=0 X=0) = 0.85 Pr(X=1 Y=1) = 0.73 (by Bayes rule) These probabilities refer to a sampling process and not to any particular patient Unit 2 (v4) - 6 -

7 RV from Subjectivist View: Example Question: does a specific patient (e.g., George Smith) have disease? Sample space S = {s 1, s 2, } represents possible worlds (or ways the world could be) In 30% of these possible worlds, George has the disease In 95% of these possible worlds in which George has the disease, the test is positive In 85% of the possible worlds in which George does not have the disease, the test is negative Random variables: X maps possible world s to X(s) = 0 if George does not have disease and X(s) = 1 if George has disease in this world Y maps possible world s to Y(s) = 0 if George s test result is negative and Y(s) = 1 if George s test result is positive in this world Usually we suppress the argument and write (X, Y) for the two-component random variable representing George s condition and test result Pr(X=1) = 0.3 Pr(Y=1 X=1) = 0.95 Pr(Y=0 X=0) = 0.85 Pr(X=1 Y=1) = 0.73 (by Bayes rule) These probabilities refer to our beliefs about George s specific case Unit 2 (v4) - 7 -

8 Comparison Frequentists say a probability distribution arises from some random process on the sample space (such as random selection) Pr(X=1) = 0.3 means: If patients are chosen at random from the population of clinic patients, 30% of them will have the disease Pr(X=1 Y=1) = 0.73 means If patients are chosen at random from the population of clinic patients who test positive, 73% of them will have the disease The probability George Smith has the disease is either 1 or 0, depending on whether he does or does not have the disease Subjectivists say a probability distribution represents someone s beliefs about the world Pr(X=1) = 0.3 means: If we have no information except that George walked into the clinic, then our degree of belief is 30% that he has the disease Pr(X=1 Y=1) = 0.73 means If we learn that George tested positive, our belief increases to 73% that he has the disease The probability that George has the disease depends on our information about George. If we know only that he went to the clinic, our probability is 30%. If we also learn that he tested positive, our probability becomes 73%. If we receive a definitive diagnosis, then the probability he has the disease is either 1 or 0, depending on whether he does or does not have the disease. Unit 2 (v4) - 8 -

9 Frequentists and Subjectivists: Random Sampling Suppose we sample a patient at random from a population in which: 30% of the patients in this population have the disease 95% of the patients who have the disease test positive 85% of the patients who do not have the disease test negative Subjectivist and frequentist will agree that: Pr(X=1) = 0.3 Pr(Y=1 X=1) = 0.95 See example from Unit 1 p. 14 Pr(X=1 Y=1) = A frequentist will say: These probabilities apply to the population and the sampling process There is no such thing as the probability that a particular patient has the disease A subjectivist will say: These probabilities apply to any particular patient for whom the information given is our only relevant information about the patient If we learn something new (e.g., George s father had the disease) then we need to update our probabilities to reflect the new knowledge Frequentists apply Bayesian methods when there is a frequency interpretation of the probabilities In practice, Bayesians and frequentists are often in close agreement on the practical implications of statistical analysis Unit 2 (v4) - 9 -

10 Mass, Density and Cumulative Distribution Functions Probability mass function (pmf) - Maps each possible value of a discrete random variable to its probability Probability density function (pdf) - Maps each possible value of a continuous random variable to a positive number representing its likelihood relative to other possible values Cumulative distribution function (cdf) - Value of cdf at x is probability that the random variable is less than or equal to x Unit 2 (v4)

Discrete Random Variables: pmf and cdf Probability mass function f(x) is probability that random variable X is equal to x f (x) = 1 x Cumulative Distribution Function Cumulative distribution function

11 Discrete Random Variables: pmf and cdf Probability mass function f(x) is probability that random variable X is equal to x f (x) = 1 x Cumulative Distribution Function Cumulative distribution function F(x) is probability that random variable X is less than or equal to x Increases from zero (at - ) to 1 (at + ) Step function has jump at each value Height of step at possible value x is equal to value of pmf at x F(x) = x' x f (x ') F(x) x Unit 2 (v4)

12 Continuous Random Variables: pdf and cdf Probability Density f(x) versus x Probability density function Each individual value has probability zero pdf is derivative of cdf: f(x) = df(x)/dx f(x)dx is approximately equal to the probability that X lies in [x-dx/2, x+dx/2] f (x)dx = 1 x Area under f(x) between a and b is equal to difference F(b)-F(a) Cumulative distribution function cdf is integral of pdf: F(x) = f (x ')dx ' Probability that X lies in interval [a,b] is difference of cdf values; also integral of pdf: P(a < X b) = F(b) F(a) = b a x' x f (x)dx Probability Density F(x) versus x { Unit 2 (v4)

13 Examples: Shape of pdf f(x) versus x F(x) versus x Symmetric, unimodal distribution f(x) versus x F(x) versus x Skewed, unimodal distribution f(x) versus x F(x) versus x Bimodal distribution Unit 2 (v4)

14 Generalized Probability Density Function (gpdf) It is useful to have common notation for density and mass functions Integral symbol is used to mean integration when distribution is continuous and summation when distribution is discrete Remember that an integral is a limit of sums gpdf f(x) is either mass or density function: ò x ò x g(x)f(x)d µ (x) º åg(x)f(x) g(x)f(x)d µ (x) º x ò x g(x)f(x) dx if X is discrete if X is continuous µ(x) stands for measure (counting measure for discrete random variables and uniform measure for continuous random variables) Unit 2 (v4)

15 Central Tendency Measure of central tendency: Summary that indicates the center of the distribution Always lies between largest and smallest value Mean (expected value E[X]): E[X] = xf (x)dµ(x) Weight each observation by its probability of occurring In skewed distributions the mean is pulled toward the skewed tail Sample mean of a random sample converges to the distribution mean as the sample size becomes large Median (50th percentile x 0.5 ): P(X x 0.5 ) = 0.5 Half of the observations are larger, half smaller than the median Less influenced by skewness and outliers than the mean Sample median of a random sample of observations converges to population median as sample size becomes large Generalization to multivariate distributions? Mode (x max ): f (x max ) f (x) for all x Most probable value Distributions may have multiple modes (multiple peaks in pmf or pdf) In multimodal distributions the mean and median don t say much about the center of the distribution; a single numerical summary may not be appropriate x Unit 2 (v4)

16 Spread Measure of spread: Summary that indicates how spread out a distribution is, or how far from the center a typical observation is likely to fall Variance and standard deviation Variance is the expected value of the squared V[X] = difference between an observation and the mean Standard deviation is the square root of the variance Standard deviation is in same units as the variable itself Median absolute deviation (MAD) Median of the absolute value of the difference between an observation and the median Less influenced by outliers than the standard deviation Credible interval (Bayesian confidence interval) Level-a credible interval [r,s]:» probability that the RV lies between r and s is a Kinds of level-a credible intervals:» one-sided credible intervals ( x E[X] ) 2 f (x)dµ(x)» two-sided credible intervals with equal probability outside interval on either side» two-sided highest-posterior density intervals Which to use depends on the purpose x SD[X] = V[X] P( X x 0.5 x MAD ) = 0.5 P( r X s) = α Unit 2 (v4)

17 Different Varieties of Credible Interval Highest posterior density (HPD) Two-sided symmetric tail areas One-sided Unit 2 (v4)

18 Reporting Measures of Spread No presentation of statistical results is complete without a measure of spread! Policy makers sometimes resist presentations that include variability Need to educate decision makers on why spread is important Need to present variability in ways that make policy implications clear» Forecast is $32 per barrel with a 90% credible interval of [25, 48] may be hard to grasp» Decision makers can more easily grasp: Best guess forecast is $32 per barrel. The price is very unlikely to drop below $25 per barrel, but there are many low-probability, high-impact scenarios which could result in a significant increase in price above that level, to as much as $48 per barrel, especially if some of the scenarios are described Graphical representations of spread are important Taken from Hannah Laskey s science fair entry, February 2011 Unit 2 (v4)

19 Theoretical and Sample Summary Statistics Expected value or mean of a distribution: Variance of a distribution: Sample mean for sample X 1,,X n : Sample variance for sample X 1,,X n : In some definitions, denominator is n instead of n-1 Dividing by n-1 gives unbiased estimate of distribution mean If sample is iid from a normal distribution then dividing by n gives maximum likelihood estimate (MLE) Quantile function: Var[X] = How does this function behave for a discrete distribution?» Between jumps of F(x)?» At a jump of F(x)? There are several different definitions of quantiles for discrete distributions and samples; this is one of the most commonly used Median is 0.5 quantile (or 50th percentile) x x X = 1 n F 1 (q) = min{ F(x) q} E[X] = x xf (x)dµ(x) ( x E[X] ) 2 f (x)dµ(x) n i=1 X i S 2 = 1 n 1 n i=1 ( X i X) 2 Unit 2 (v4)

Data Summary in R Data set on 44 babies born in Brisbane, Australia on 21 December 1997 Description: http://www.amstat.org/publications/jse/datasets/babyboom.

20 Data Summary in R Data set on 44 babies born in Brisbane, Australia on 21 December 1997 Description: Dataset provided as part of UsingR add-on package in R Screenshot of Console from R Session R reads the data into a data frame called babyboom - R object for representing a data set - List of equal-length vectors each with a name - babyboom$wt is the 3 rd column containing weights of babies See summary is one of the many R methods for data frames Unit 2 (v4)

21 Graphical Summaries in R There are many ways to customize R graphics Box extends from lower to upper quartile, with median shown as line across box. Upper whisker extends to largest observation (upper quartile x interquartile range) Lower whisker extends to smallest observation (lower quartile x interquartile range) Unit 2 (v4)

22 Multivariate Distributions A random vector X is a vector of random variables Underscore denotes a vector Underscore may be omitted many authors do not have special notation for multivariate distributions Joint distribution function cdf F(x)=P(X x) gpdf f(x) (density if continuous, mass function if discrete) Expected value vector Covariance matrix E[X] = Conditional independence assumptions are often used to simplify specification and inference for multivariate distributions R n Cov[X] = xf (x)dµ(x) ( X E[X] )( X E[X] ) T f (x)dµ(x) R n Unit 2 (v4)

23 Covariance and Correlation The covariance matrix measures how random quantities vary together The (i,j) th off-diagonal element is called Cov(X i,x j ) or covariance of X i and X j, Cov(X i,x j ) is equal to the product of the standard deviations of X i and X j if they are perfectly linearly related Cov(X i,x j ) = 0 if X i and X j are independent The correlation matrix is The diagonal elements are equal to 1 The (i,j) th off-diagonal element Cov(X i,x j )/(SD(X i ) SD(X j )) is the covariance divided by the product of the standard deviations The off-diagonal elements are real numbers between -1 and 1» Zero correlation means RVs are independent» The closer the correlation is to 1 or -1, the more nearly linearly related the RVs are Unit 2 (v4)

24 Example: Joint Density Functions f(x,y)=f(x)f(y) f(x,y)=f(x)f(y x) Unit 2 (v4)

25 Example: Scatterplot of Random Sample from Joint Distribution Unit 2 (v4)

26 Marginal and Conditional Distributions If (X,Y) has joint gpdf f(x,y): The marginal gpdf of X is f (x) = f (x, y)dµ(y) (y is marginalized out) y f (x, y) The conditional gpdf of X given Y is f (x y) = f (y) The joint gpdf of X and Y can be factored into conditional and marginal gpdfs: f (x, y) = f (x y) f (y) = f (y x) f (x) The joint gpdf for n random variables (X 1,, X n ) can be written as: f (x 1, x n ) = f (x 1 ) f (x 2 x 1 ) f (x 3 x 1, x 2 )! f (x n x 1, x n 1 ) n = f (x i x 1, x i 1 ) i=1 Unit 2 (v4)

Distributions in R Each R distribution has four functions, invoked by adding a prefix to the distribution s base name: p for probability the cumulative distribution function (cdf) q for quantile the

27 Distributions in R Each R distribution has four functions, invoked by adding a prefix to the distribution s base name: p for probability the cumulative distribution function (cdf) q for quantile the inverse cdf d for density the density or mass function r for random generates random numbers from the distribution Examples: rpois, dgamma Information on distributions in R: Source: Functions for the Beta-Binomial distribution are available in the VGAM package Unit 2 (v4)

28 Independence and Conditional Independence X and Y are independent if the conditional distribution of X given Y does not depend on Y When X and Y are independent, the joint density (or mass) function is equal to the product of the marginal density (or mass) functions f(x,y) = f(x)f(y) X and Y are conditionally independent given Z if: f(x,y,z) = f(x z)f(y z)f(z) Graphs provide a powerful tool for visualizing and formally modeling dependencies between random variables Random variables (RVs) are represented by nodes Direct dependencies are represented by edges Absence of an edge means no direct dependence X Y P(X,Y ) = P(X)P(Y ) General form is f(x,y,z) = f(x y,z)f(y z)f(z) X Y W P(X,Y,Z,W ) = P(X)P(Y )P(W X,Y )P(Z W ) Z Unit 2 (v4)

Independent and Identically Distributed Observations Statistical models often assume the observations are a random sample from a parameterized distribution Mathematically,

factors: f (x 1, x 2,, x n θ) = f (x 1 θ) f (x 2 θ)!

29 Independent and Identically Distributed Observations Statistical models often assume the observations are a random sample from a parameterized distribution Mathematically, this is represented as independent and identically distributed (iid) conditional on the parameter q The density function for an iid sample X 1,,X n is written as a product of factors: f (x 1, x 2,, x n θ) = f (x 1 θ) f (x 2 θ)! f (x n θ) = n i=1 f (x i θ) X 1 Θ X 2 X 3 Θ X i i=1,,3 plate A plate represents repeated structure The RVs inside the plate are iid conditional on their parents In this example, the X i are iid given Θ Joint gpdf for (X,Θ): g(q)p i f(x i q) Conditional gpdf for x givenq: P i f(x i q) Unit 2 (v4)

Graphical Probability Models Graphical probability models provide a means of specifying complex multivariate distributions in compact form, visualizing dependence relationships, and computing results

to estimate parameters of these more complex models from data (such as JAGS / R) Birth rates Λ k for hospitals k=1,, n are sampled randomly from A A ~ h(α) a distribution with parameter A A Λ k Λ 1 Λ

30 Graphical Probability Models Graphical probability models provide a means of specifying complex multivariate distributions in compact form, visualizing dependence relationships, and computing results of inference In the past, statistical packages have been limited to models with simple random sampling Graphical models can represent more complex kinds of dependencies Software is becoming available to estimate parameters of these more complex models from data (such as JAGS / R) Birth rates Λ k for hospitals k=1,, n are sampled randomly from A A ~ h(α) a distribution with parameter A A Λ k Λ 1 Λ 2 Hourly births X ki at hospital k are sampled from a Poisson distribution with parameter Λ k Λ n Λ k ~ g(λ α) X ki i=1,,m k k=1,,n X X X 21 X 21 X 11 X 21 X X n1 X n1 n1 X ki ~ Poisson(λ k ) Unit 2 (v4)

31 Parametric Families of Distributions Much of statistics is based on constructing models of phenomena using parametric families of distributions Simple models typically assume observations X 1,, X N are sampled randomly from a distribution with gpdf f(x Q)» The symbol Q denotes a parameter» Functional form f(x Q) is known but value of parameter is unknown Parameter may be a number Q or a vector Q = (Q 1,...,Q p ) Canonical problems: Use the sample to construct an estimate (point or interval) for the parameter Test a hypothesis about the value of the parameter, e.g.:» Is Q equal greater than a particular value q 0?» Are two parameters Q 1 and Q 2 equal to each other? Is the functional form f(x Q) adequate as a model for the phenomenon generating the observations? Predict features (e.g., mean) of a future sample X 1+N,, X M+N Hierarchical (or multi-level) Bayesian models Retain simplicity of parametric models with standard functional form Build complex models out of simple building blocks Many distributions have multiple parameterizations in common use and it is important not to confuse different parameterizations Unit 2 (v4)

32 Some Discrete Parametric Families (p. 1) Binomial distribution with number of trials n and success probability π: Sample space: Non-negative integers n pmf: f (x n,π ) = π x (1 π ) n x x E[X n,π] = nπ; Var[X n, π] = nπ (1-π) Multinomial distribution with number of trials n and success probability π: Sample space: Vectors of non-negative integers pmf: n! f (x 1, x p n,π 1,,π p ) = x 1!!x p! E[X i n, π] = nπ i ; Var[X i n, π] = nπ i (1-π i ) Poisson distribution with rate λ: Sample space: Non-negative integers pmf: f (x λ) = e λ λ x x! E[X λ] = λ ; Var[X λ]= λ p i=1 π i x i Bernoulli distribution is a Binomial distribution with n=1 Unit 2 (v4)

33 Some Discrete Parametric Families (p. 2) Negative binomial distribution with size α and probability π: Sample space: Non-negative integers α + x 1 pmf: f (x α,π ) = α π α (1 π ) x E[X α,π ] = πα 1 π Var[X α,π ] = πα (1 π ) 2 Unit 2 (v4)

34 Some Continuous Parametric Families (p. 1) Gamma distribution with shape α and scale β: Sample space: Nonnegative real numbers 1 pdf: f (x α,β) = β α Γ(α) xα 1 e x/β E[X α,β] = αβ ; Var[X α,β]= αβ 2 Beta distribution with shape parameters α and β: Sample space: Real numbers between 0 and 1 Gamma distribution is sometimes parameterized with scale and rate r = 1/β Γ(α + β) pdf: f (x α,β) = Γ(α)Γ(β) xα-1 (1 x) β-1 αβ E[X α,β] = α/(α+β) ; Var[X α,β]= (α + β) 2 (α + β +1) Univariate normal distribution with mean µ and standard deviation σ: Sample space: Real numbers 1 pdf: f (x µ,σ) = 2πσ exp + 2 % x 1 i - µ (., 2' * / - & σ ) 0 E[X µ,σ] = µ ; Var[X µ,σ]= σ 2 Multivariate normal distribution with mean µ covariance matrix Σ: Sample space: Vectors of real numbers 1 pdf: f (x µ, Σ) = 2π Σ exp 1 2 ( x µ )'Σ 1 x µ E[X µ,σ] = µ ; Var[X µ,σ]= Σ Exponential distribution is a Gamma distribution with α=1 Chi-square distribution with δ degrees of freedom is a Gamma distribution with α=δ/2 and β=2 Uniform distribution is a Beta distribution with α=β=1 Gamma function { ( )} Γ(y) 0 u y 1 e u du Unit 2 (v4)

35 Some Continuous Parametric Families (p. 2) Dirichlet distribution with shape parameters α 1,α 2,,α p : Sample space: Real non-negative vectors summing to 1 pdf: f (x 1,, x p α 1,...,α p ) = Γ(α +!+α ) 1 p Γ(α 1 )!Γ(α p ) x 1 E[X i α 1,...,α p ] = α i Var[X α i α 1,...,α p ] = j j α 1-1!x p α p -1 α i (1 α i ) α j j +1 j ( ) 2 ( α ) j Dirichlet distribution is a multivariate generalization of the Beta distribution Wishart distribution with degrees of freedom α and scale matrix V, where V is a positive definite matrix of dimension p and α > p-1 Sample space: Positive definite p-dimensional matrices pdf: f X α, V = X (12324)/6 e 289(V:; X)/ V 1 6Γ 3 ( α 2 ) Γ 3 is the multivariate Gamma function tr is the trace function E X α, V = αv Wishart distribution is a multivariate generalization of Gamma and Chi-Square distributions Unit 2 (v4)

36 Sufficient Statistic A sufficient statistic is a data summary (function of sample of observations) such that the observations are independent of the parameter given the sufficient statistic Example: For a sample of n iid observations from a Poisson distribution, the sum of the observations is a sufficient statistic for the rate parameter Example: For a sample of n observations from a Bernoulli distribution, the total number of successes is a sufficient statistic for the success probability If observations X 1,, X N are sampled randomly from a distribution with gpdf f(x θ) and Y is sufficient for the parameter θ, then the posterior distribution f(θ X) depends on the observations only through the sufficient statistic Fisher s factorization theorem: T(X) is sufficient for θ if and only if the conditional probability distribution f(x θ) can be factored as: f(x θ) = h(x)g θ (T(x)) Unit 2 (v4)

37 Is a Parametric Distribution a Good Model? Before applying a parametric model, we should perform exploratory analysis to evaluate whether the model is an adequate representation of the data A q-q plot is a commonly used diagnostic tool Plot quantiles of data distribution against quantiles of theoretical distribution If theoretical distribution is correct, the plot should look approximately like the plot of y=x Problem: what about unknown parameters? Solution: We can estimate parameters from data Solution: in the case of a location-scale family, we can plot data quantiles against standard distribution (location parameter = 0 and scale parameter = 1)» If theoretical distribution is correct the plot should look approximately like a straight line» Slope and intercept of line are determined by scale and location parameters Another diagnostic tool is to compare empirical and theoretical counts for discrete RV (or discretized continuous RV) What other checks for model fit have you encountered? In your homework I expect you to assess whether the parametric model I give you is a good one (sometimes it won t be!) Unit 2 (v4)

38 Are Birth Weights Normally Distributed? Data on birth weights of babies born in a Brisbane hospital on December 18, 1997 Source: Unit 2 (v4)

39 Birth Weights of Boys and Girls Unit 2 (v4)

40 Are Times Between Births Exponentially Distributed? The data: times between births of babies born in a Brisbane hospital on December 18, 1997 R notes: ppoints function computes probabilities for evaluating quantiles diff function computes lagged differences lines function adds lines to a plot qexp function computes exponential quantiles Unit 2 (v4)

41 Looking for Temporal Patterns (1 of 2) Time Series Plot of Birth Intervals Time Until Next Birth (minutes) If birth times are iid, then we should not see any patterns (such as more births at some times of day) Do you see any patterns in this plot? Time of Birth (minutes from midnight) plot(birthtim[1:43],birthinterval,main="time Series Plot of Birth Intervals", ylab="time Until Next Birth (minutes)",xlab="time of Birth (minutes from midnight)") Unit 2 (v4)

42 Looking for Temporal Patterns (2 of 2) Lag Plot of Birth Intervals birthinterval[2:43] The lag plot looks for a dependency between successive birth intervals (such as long intervals being followed by long intervals) Do you see any pattern in this plot? birthinterval[1:42] plot(birthinterval[1:42],birthinterval[2:43],main="lag Plot of Birth Intervals") Unit 2 (v4)

fit (if the expected counts are not too small) Births Empirical Count

43 Do Births Per Hour Follow a Poisson Distribution? Comparing empirical to theoretical counts is a useful tool for evaluating fit (if the expected counts are not too small) Births Empirical Count Expected Count TOTAL Unit 2 (v4)

Poisson distribution with mean 20 (in bins of 1 and 3 time units) When expected counts are small, plots of observed and

44 Frequency Plot is Misleading when Expected Counts are Too Small Blue bars: empirical counts from sample of size 35 from Poisson distribution with mean 20 (in bins of 1 and 3 time units) Pink bars: expected counts for sample of size 35 from Poisson distribution with mean 20 (in bins of 1 and 3 time units) When expected counts are small, plots of observed and expected counts will not look similar Common rule of thumb: choose bin size so most bins have expected count of at least 5 Unit 2 (v4)

Inferring the Value of a Parameter Often we model a set of observations as independent trials from a probability distribution with unknown parameter f(x 1,,x n q) = f(x 1 q) f(x n q) The joint gpdf

45 Inferring the Value of a Parameter Often we model a set of observations as independent trials from a probability distribution with unknown parameter f(x 1,,x n q) = f(x 1 q) f(x n q) The joint gpdf of x given q, viewed as a function of q, is called the likelihood function Likelihood principle: The likelihood function captures everything about the observations needed to draw inferences To draw inferences about q, a Bayesian statistician must also specify a prior distribution g(q) We condition on the sample to obtain the posterior distribution using Bayes Rule: g(θ x 1, x n ) = n i=1 n i=1 f (x i θ) g(θ) f (x i θ) g(θ)dµ(θ) f(x) is called the marginal likelihood of the observations x=(x 1,,x n ) Note: the X i are independent given q, but are not independent after integrating over q = f (x θ)g(θ) f (x) Θ X i i=1,,n Unit 2 (v4)

46 Example: Transmission Errors Number of transmission errors per hour is distributed as Poisson with parameter λ f (x λ) = e λ λ x x! Data on previous system established error rate as 1.6 errors per hour New system has design goal of cutting this rate in half to 0.8 errors per hour Observe new system for 6 one-hour periods: Data: 1, 0, 1, 2, 1, 0 Questions: Have we met the design goal? Does the new system improve the error rate? Unit 2 (v4)

47 The Prior Distribution (discretized) Error rate can be any positive real number We use expert judgment to define prior distribution on a discrete set of values (later we will revisit this problem with a continuous prior distribution) Experts familiar with the new system design said: Meeting the design goal of 0.8 errors per hour is about a proposition. The chance of making things worse than current rate of 1.6 errors per hour is small but not negligible Expert agrees that the discretized distribution shown here is a good reflection of his prior knowledge Expected value is about 1.0 Distribution is heavy tailed on the right P(Λ 0.8) = 0.54 P(Λ 1.6) = 0.87 Values of Λ greater than 3 are unlikely enough to ignore Unit 2 (v4)

48 Bayesian Inference for Error Rate Unit 2 (v4)

49 Features of the Posterior Distribution Central tendency Posterior mean of L is 0.87 Prior mean of L is 0.97; data mean is.83 Typically posterior central tendency is a compromise between the prior distribution and the center of the data Variation Posterior standard deviation of L is about.33 Prior standard deviation of L is about.62 Typically variation in the posterior is less than variation in the prior (we have more information) Meeting the threshold Posterior probability of meeting or doing better than design goal is about.58 Posterior probability that new system is better than old system is about.96 Posterior probability that new system is worse than old system is less than.02 Unit 2 (v4)

50 Triplot Visual tool for examining Bayesian belief dynamics Plot prior distribution, normalized likelihood, and posterior distribution Normalized likelihood: Divide likelihood by sum or integral over l Posterior distribution we would obtain if all values were assigned equal prior probability Unit 2 (v4)

51 Bayesian Belief Dynamics: Sequential and Batch Processing of Observations Batch processing: Use Bayes rule with prior g(θ) and combined likelihood f(x 1,, X n θ) to find posterior g(θ X 1,, X n ) Sequential processing: Use Bayes rule with prior g(θ) and likelihood f(x 1 θ) to find posterior g(θ X 1 ) Use Bayes rule with prior g(θ X 1 ) and likelihood f(x 2 θ) to find posterior g(θ X 1, X 2 ) Use Bayes rule with prior g(θ X 1,, X n-1 ) and likelihood f(x n θ) to find posterior g(θ X 1,, X n ) The posterior distribution after n observations is the same with both methods Unit 2 (v4)

52 Visualizing Bayesian Belief Dynamics: Hour-by-Hour Triplots for Transmission Error Data Unit 2 (v4)

53 Fundamental Identity of Bayesian Inference Likelihood function f(x θ)g(θ) = f(x)g(θ x) Prior density function Marginal likelihood The joint gpdf of parameter and data can be expressed in two different ways: Prior density for parameter times likelihood function Marginal likelihood times posterior density for parameter Posterior density function» Marginal likelihood is (conditional) likelihood integrated over parameter f x = θ g θ dμ(θ)= E θ g θ dθ F f(x θ g θ continuous parameter discrete Unit 2 (v4)

54 Marginal Likelihood Before we have seen X, we use the marginal likelihood to predict the value of X When used for predicting X, the marginal likelihood is called the predictive distribution for X After we see X=x, we divide the joint probability f(x θ)g(θ) by the marginal likelihood f(x) to obtain the posterior probability of θ: g(θ x) = f (x θ)g(θ) f (x) The marginal likelihood f(x) is the normalizing constant in Bayes Rule we divide by f(x) to ensure that the posterior probabilities sum to 1 The marginal likelihood f(x) = Σ θ f(x θ)g(θ) includes uncertainty about both θ and x given θ Non-Bayesians sometimes predict future observations using a point estimate of θ Predictions using the marginal likelihood are more spread out (include more uncertainty) than predictions using a point estimate of θ Unit 2 (v4)

55 Predicting Future Observations Compare Marginal Likelihood with Poisson(E[Λ observations so far]) Unit 2 (v4)

56 Parameters and Conditioning: Frequentists and Subjectivists Frequentist view of parameters Parameter Θ represents a true but unknown property of a random data generating process It is not appropriate to put a probability distribution on Θ because the value of Θ is fixed, not random Subjectivist view of parameters A subjectivist puts a probability distribution on Θ as well as X i given Θ Parametric distributions are a convenient means to specify probability distributions that represent our beliefs about as yet unobserved data Many subjectivists don t believe true parameters exist Frequentists and subjectivists on conditioning Frequentists condition on parameter and base inferences on data distribution f(x 1 Θ) f(x n Θ)» Even after X has been observed it is treated as random Subjectivists condition on knowns and treat unknowns probabilistically» Before observing data X=x, the joint distribution of parameters and data is f(x 1 Θ) f(x n Θ)g(Θ)» After observing data, X=x is known and Θ has distribution g(θ x 1,,x n ) Unit 2 (v4)

Connecting Subjective and Frequency Probability: de Finetti s Representation Theorem IF Your beliefs are represented by a probability distribution P over a

depend on the order of successes and failures) THEN You believe with probability 1 that the proportion of successes will tend to a definite limit as the

would assess if you believed the observations were random draws from some unknown true value of p with probability density function f(p) P(S n = k) = 1 0 n

57 Connecting Subjective and Frequency Probability: de Finetti s Representation Theorem IF Your beliefs are represented by a probability distribution P over a sequence of events X 1, X 2, You believe the sequence is infinitely exchangeable (your probability for any sequence of successes and failures does not depend on the order of successes and failures) THEN You believe with probability 1 that the proportion of successes will tend to a definite limit as the number of trials goes to infinity S n where S n = # successes in n trials n p as n Your probability distribution for S n is the same as the distribution you would assess if you believed the observations were random draws from some unknown true value of p with probability density function f(p) P(S n = k) = 1 0 n k p k (1 p) n k g(p)dp P A sequence a frequentist would call random draws from a true distribution is one a Bayesian would call exchangeable X i i=1, Unit 2 (v4)

Bayesian Inference for Continuous Random Variables Inference for continuous random variable is limiting case of inference with discretized random variable as number of bins and width of bin goes to

58 Bayesian Inference for Continuous Random Variables Inference for continuous random variable is limiting case of inference with discretized random variable as number of bins and width of bin goes to zero Accuracy tends to increase with more bins Accuracy tends to increase with smaller area per bin Be careful with tail area of unbounded random variables Closed form solution for continuous problem exists in special cases (see Unit 3) The prior distribution we used for the transmission error example was obtained by discretizing a Gamma distribution In Unit 3, we will compare with the exact result using the Gamma prior distribution Unit 2 (v4)

59 Summary and Synthesis A random variable represents an uncertain hypothesis Categorical, ordinal, discrete numerical, continuous numerical A probability distribution maps sets of possible values to numbers Probability of a set indicates how likely it is that the value belongs to the set Probability mass functions, density functions, and cumulative distribution functions are tools for calculating probabilites of sets We examined measures of central tendency and spread Parametric families of distributions are convenient and practical off the shelf models for common types of uncertain processes We listed several commonly used parametric families We noted that observations are often modeled as randomly sampled from a parametric family with unknown parameter We applied Bayes rule to infer a posterior distribution for a parameter (using a discretized prior distribution) We contrasted the subjectivist and frequentist approaches to parameter inference We saw that de Finetti s exchangeability theorem provides a connection between the frequentist and subjectivist views of parameters Unit 2 (v4)

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems Review of Basic Probability The fundamentals, random variables, probability distributions Probability mass/density functions