Introduction to Bayesian Statistics

School of Computing & Communication, UTS January, 207

Random variables Pre-university: A number is just a fixed value. When we talk about probabilities: When X is a continuous random variable, it has a probability density function (pdf) When X is a discrete random variable, it has a probability mass function (pmf) p(x) p(x x) means that: The probability when a random variable X is equal to a fixed number x, i.e., the probablity that number of machine learning participants 20

Mean or Expectation discrete case: continous case: µ E(X) N x i N i µ E(X) xp(x)dx x S can also measure the expecation of a function: E(f (X)) f (x)p(x)dx x S For example, E(cos(X)) cos(x)p(x)dx E(X 2 ) x 2 p(x)dx x S x S What about f (E(X)): Discuss later when we discuss Jensens Equality in Expecation-Maximization

Variances an intuitive explanation You have data X {2,,, 2,, 4}, i.e., x 2, x 2,... x 4 You have the mean: µ 2 + + + 2 + + 4 2.5 The variance is then: VAR(data) σ 2 N N (x i µ) 2 i Division by N is intuitive. Otherwise, more data implies more variance Also think about what kind of values can VAR and σ take? - we will look at what kind of distribuiton is required for them.

Two alternative expression: People sometimes use: You have data X {2,,, 2,, 4}, i.e., x 2, x 2,... x 4 VAR(X) σ 2 N (x i µ) 2 N i N x 2 i N 2x i µ + N µ 2 N N N i i i N x 2 i 2µ N x i +µ 2 N N i i }{{} µ ( ) N x 2 i µ 2 N i Other times, people use: You have data X {, 2,, 4}, and P(X ) 2, P(X 2) 2, P(X ) and P(X 4). Discrete :VAR(X) σ 2 x X(x µ) 2 p(x) Continous :VAR(X) σ 2 (x µ) 2 p(x) x X VAR(X) x X x 2 p(x) 2 x X x X x 2 p(x) 2µ x X x X ( x 2 2µx + µ 2) p(x) µxp(x) + x X µ 2 p(x) xp(x) +µ 2 p(x) x X x 2 p(x) µ 2 x X } {{ } µ } {{ } It s easy to verify that both sides are the same

Numerical example First version X {2,,, 2,, 4}, i.e., x 2, x 2,... x 4 ( ) N VAR(X) x 2 i µ 2 N i (2 2.5)2 + ( 2.5) 2 + ( 2.5) 2 + (2 2.5) 2 + ( 2.5) 2 + (4 2.5) 2 0.97 Both sides are the same Second version X {, 2,, 4}, and P(X ) 2, P(X 2) 2, P(X ) and P(X 4). Discrete :VAR(X) σ 2 x X(x µ) 2 p(x) Continous :VAR(X) σ 2 x X VAR(X) x 2 p(x) µ 2 x X (x µ) 2 p(x) }{{} f (x) ( 2.5) 2 + (2 2.5)2 2 + ( 2.5)2 2 + (4 2.5) 2 0.97

Important fact of the Variances VAR(X) E[(X E(X)) 2 ] (x µ) 2 p(x)dx x S x S E(X 2 ) (E(X)) 2 x 2 p(x)dx 2µ xp(x)dx + µ 2 xp(x)dx x S x S Think VAR(X) as mean-subtracted second order moment of random variable X.

Joint distributions The following is a tablet form of joint density Pr(X, Y ): Y 0 Y Y 2 Total X 0 0 5 5 5 2 8 X 0 5 5 5 X 2 0 0 5 5 9 Total 5 5 5 This table shows Pr(X, Y ) or Pr(X x, Y y). For example, p(x, Y ) 5 : exercise what is the probablity that X 2, Y? exercise what is the probablity that X, Y 2? exercise what is the value of: 2 2 Pr(X i, Y j)? i0 j0

Marginal distributions Y 0 Y Y 2 Total X 0 0 5 5 5 2 8 X 0 5 5 5 X 2 0 0 5 5 9 Total 5 5 5 Using sum rule, the marginal distribution tells us that: Pr(X) Pr(x, y) or p(x) p(x, y)dy y S y S y y For example: Pr(Y ) 2 2 i0 j0 exercise what is Pr(X 2) and Pr(X )? p(x i, y ) 5 + 5 + 0 5 9 5

Conditional distributions Y 0 Y Y 2 Total X 0 0 5 5 5 2 8 X 0 5 5 5 X 2 0 0 5 5 9 Total 5 5 5 Conditional density: p(x Y ) p(x, Y ) p(y ) p(y X)p(X) p(y ) What about p(x Y y)? Pick an example: p(y X)p(X) X p(y X)p(X) p(x Y ) p(x, Y ) p(y ) /5 9/5 2

Conditional distributions: Exercise Y 0 Y Y 2 Total X 0 0 5 5 5 2 8 X 0 5 5 5 X 2 0 0 5 5 9 Total 5 5 5 The formulation for conditional density: p(x Y ) p(x, Y ) p(y ) exercise: What is p(x 2 Y )? exercise: What is p(x Y 2)? p(y X)p(X) p(y ) p(y X)p(X) X p(y X)p(X)

Independence If X and Y are independent: p(x Y ) p(x) p(x, Y ) P(X)P(Y ) Both factors are related when A and B are independent: p(x Y ) p(x, Y ) p(y ) p(x)p(y ) p(y ) p(x) Y 0 Y Y 2 Total X 0 0 5 5 5 2 8 X 5 5 0 5 X 2 5 0 0 5 9 Total 5 5 5 X and Y are NOT independent X 0 8 225 X 24 225 X 2 225 Total 5 Y 0 Y Y 2 Total 54 8 225 225 5 72 24 8 225 225 5 9 225 225 5 9 5 5 X and Y are independent

Conditional Independence Imagine we have three random variables: X, Y and Z : Once we know Z, then knowing Y does NOT tell us any additional information about X Therefore: Pr(X Y, Z ) Pr(X Z ) This means that X is conditionally independent of Y given Z. If Pr(X Y, Z ) Pr(X Z ), then what about Pr(X, Y Z )? Pr(X, Y Z ) Pr(X, Y, Z ) Pr(Z ) Pr(X Y, Z ) Pr(Y Z ) Pr(X Z ) Pr(Y Z ) Pr(X Y, Z ) Pr(Y, Z ) Pr(Z )

An example of Conditional Independence We will study Dynamic model later. x t x t x t+ y t y t y t+ From this model, we can see: p(x t x,..., x t, y,..., y t ) p(x t x t ) p(y t x,..., x t, x t, y,..., y t ) p(y t x t ) Right now, think of if a given variable is the only item that blocks the path between two (or more) variables.

Another Example: Bayesian Linear Regression We have data pairs: Input: X x,... x N Output: Y y,... y N Each pair, x i and y i are related through model equation: y i f (x i w) + N (0, σ 2 ) Input alone isn t going to tell you model parameter: p(w X) p(w) Output alone isn t going to tell you model parameter: p(w Y ) p(w) Obviously: p(w X, Y ) p(w) Posterior over parameter w: p(w x, y) p(y w, x)p(w x)p(x) p(y x)p(x) p(y w, x)p(w) p(y x) p(y w, x)p(w) w p(y x, w)p(w)

Expectation of Joint probabilities Given that X, Y is a two-dimensional random variable: Continous case: E[f (X, Y )] f (x, y)p(x, y)dxdy y S y x S x Discrete case: N i N j E[f (X, Y )] f (X i, Y j)p(x i, Y j) i j

Numerical Example: Y Y 2 Y X 0 5 5 2 X 2 5 5 0 X 5 0 0 p (X,Y) Y Y 2 Y X 7 8 X 2 2 X 8 f (X,Y) N N i j E[f (X, Y )] f (X i, Y j)p(x i, Y j) i j 0 + 7 5 + 8 5 + 2 5 + 5 + 2 0 + 5 + 8 0 + 0

Conditional Expectation It s a useful property for later E(Y ) X X E(Y X)p(X)dx yp(y X)dy Y }{{} ( y p(y, X)dx Y X yp(y )dy E(Y ) Y p(x)dx yp(y, X)dydx X Y ) dy

Bayesian Predictive distribution Put marginal distribution and Conditional Independence into a test: Very often, in machine learning, you want to compute the probability of new data y given training data Y, i.e., p(y Y ). You assume there are some model explains both Y and y. The model parameter is θ: p(y Y ) p(y θ)p(θ Y )dθ θ Excercise, tell me why the above works?

Revisit Bayes Theorem Instead of using arbitrary random variable symbols, we now use: θ for model parameter and X x,... x n for dataset: p(θ X) }{{} posterior p(x θ) p(θ) }{{}}{{} likelihood prior p(x) }{{} normalization constant p(x θ)p(θ) θ p(x θ)p(θ)

An Intrusion Detection System (IDS) Example The setting: Imagine out of all the TCP connections (say millons), % of which are intrusions: When there is an intrusion, the probability of system sends alarm is 87%. When there is no intrusion, the probability of system sends alarm is %. Prior probability: % of which are intrusions p(θ intrusion) 0.0 p(θ no intrusion) 0.99 Likelihood probability: given intrusion occur, probability of system sends alarm is 87% p(x alarm θ intrusion) 0.87 p(x no alarm θ intrusion) 0. given there is no intrusion, the probability of system sends alarm is %: p(x alarm θ no intrusion) 0.0 p(x no alarm θ no intrusion) 0.94

Posterior We are interested in posterior probability: Pr(θ X): There 2 two possible values for parameter θ and 2 possible observation X Therefore, there are 4 rates we need to compute: True Positive When system sends alarm, probability of an intrusion occurs: Pr(θ intrusion X alarm) False Positive When system sends alarm, probability that there is no intrusion: Pr(θ no intrusion X alarm) True Negative When system sends no alarm, probability that there is no intrusion: Pr(θ no intrusion X no alarm) False Negative When system sends no alarm, probability that an intrusion occurs: Pr(θ intrusion X no alarm) Question which are the two probabilities you d like to maximise?

Apply Bayes Theorem in this setting Pr(X θ) Pr(θ) Pr(θ X) θ Pr(X θ) Pr(θ) Pr(X θ) Pr(θ) Pr(X θ Intrusion) Pr(θ Intrusion) + Pr(X θ no intrusion) Pr(θ no intrusion)

Apply Bayes Theorem in this setting True Positive rate When system sends alarm, what is the probability of an intrusion occurs: Pr(θ intrusion X alarm) Pr(X alarm θ intrusion) Pr(θ intrusion) Pr(X alarm θ Intrusion) Pr(θ Intrusion) + Pr(X alarm θ no intrusion) Pr(θ Intrusion) 0.87 0.0 0.87 0.0 + 0.0 0.99 0.278 False Positive rate When system sends alarm, what is the probability that there is no intrusion: Pr(θ no intrusion X alarm) Pr(X alarm θ no intrusion) Pr(θ no intrusion) Pr(X alarm θ no intrusion) Pr(θ no intrusion) + Pr(X alarm θ no intrusion) Pr(θ no intrusion) 0.0 0.99 0.87 0.0 + 0.0 0.99 0.8722

Apply Bayes Theorem in this setting False Negative When system sends no alarm, what is the probability that an intrusion occurs? Pr(θ intrusion X no alarm) Pr(X no alarm θ intrusion)p(θ intrusion) Pr(X no alarm θ Intrusion) Pr(θ Intrusion) + Pr(X no alarm θ no intrusion) Pr(θ no Intrusion) 0. 0.0 0. 0.0 + 0.94 0.99 0.004 True Negative When system sends no alarm, what is the probability that there is no intrusion? Pr(θ no intrusion X no alarm) Pr(X no alarm θ no intrusion) Pr(θ no intrusion) Pr(X no alarm θ no intrusion) Pr(θ no intrusion) + Pr(X no alarm θ no intrusion) Pr(θ no intrusion) 0.94 0.99 0.87 0.00 + 0.0 0.99 0.998

Statistics way to think about Posterior Inference The posterior inference is to find the best q(θ) to approximate p(θ X), such that: { inf q(θ) Q KL(q(θ) p(θ)) Eθ q(θ) ln(p(x θ) } { inf q(θ) Q ln q(θ) } θ p(θ) q(θ) ln(p(x θ)q(θ) θ { } inf q(θ) Q [ln q(θ) (ln p(θ) + ln p(x θ))] q(θ) θ { [ ] } q(θ) inf q(θ) Q ln q(θ) θ p(θ)p(x θ) { [ p(x) inf q(θ) Q ln q(θ) ] } q(θ) θ p(θ X) inf q(θ) Q {KL(q(θ) p(θ X))}