Introduction to Probability for Grahical Models CSC 4 Kaustav Kundu Thursday January 4, 06 *Most slides based on Kevin Swersky s slides, Inmar Givoni s slides, Danny Tarlow s slides, Jaser Snoek s slides, Sam Roweis s review of robability, Bisho s book, and some images from Wikiedia
Outline Basics Probability rules Eonential family models Maimum likelihood Conjugate Bayesian inference time ermitting
Why Reresent Uncertainty? The world is full of uncertainty What will the weather be like today? Will I like this movie? Is there a erson in this image? We re trying to build systems that understand and ossibly interact with the real world We often can t rove something is true, but we can still ask how likely different outcomes are or ask for the most likely elanation Sometimes robability gives a concise descrition of an otherwise comle henomenon.
Why Use Probability to Reresent Uncertainty? Write down simle, reasonable criteria that you'd want from a system of uncertainty common sense stuff, and you always get robability. Co Aioms Co 946; See Bisho, Section..3 We will restrict ourselves to a relatively informal discussion of robability theory.
Notation A random variable X reresents outcomes or states of the world. We will write to mean ProbabilityX Samle sace: the sace of all ossible outcomes may be discrete, continuous, or mied is the robability mass density function Assigns a number to each oint in samle sace Non-negative, sums integrates to Intuitively: how often does occur, how much do we believe in.
Joint Probability Distribution ProbX, Yy Probability of X and Yy, y Conditional Probability Distribution ProbXYy Probability of X given Yy y,y/y
The Rules of Probability Sum Rule marginalization/summing out: Product/Chain Rule:,...,,..., 3 N y y N,...,...,...,, N N N y y
Bayes Rule One of the most imortant formulas in robability theory This gives us a way of reversing conditional robabilities ' ' ' y y y y y
Indeendence Two random variables are said to be indeendent iff their joint distribution factors Two random variables are conditionally indeendent given a third if they are indeendent after conditioning on the third, y y y y y z z z y z z y z y,,
Continuous Random Variables Outcomes are real values. Probability density functions define distributions. E.g., P, σ e πσ σ Continuous joint distributions: relace sums with integrals, and everything holds E.g., Marginalization and conditional robability P, z P, y, z y y P, z y P y
Summarizing Probability Distributions It is often useful to give summaries of distributions without defining the whole distribution E.g., mean and variance Mean: E[ ] d Variance: var E[ ] E[ ]E[] d
Summarizing Probability Distributions It is often useful to give summaries of distributions without defining the whole distribution E.g., mean and variance Mean: E[ ] d Variance: var E[ ] E[ ]E[] d
Eonential Family Family of robability distributions Many of the standard distributions belong to this family Bernoulli, Binomial/Multinomial, Poisson, Normal Gaussian, Beta/Dirichlet, Share many imortant roerties e.g. They have a conjugate rior we ll get to that later. Imortant for Bayesian statistics
Definition The eonential family of distributions over, given arameter η eta is the set of distributions of the form T η h g ηe{ η u } -scalar/vector, discrete/continuous η natural arameters u some function of sufficient statistic gη normalizer T g η h e{ η u } d h base measure often constant
Sufficient Statistics Vague definition: called so because they comletely summarize a distribution. Less vague: they are the only art of the distribution that interacts with the arameters and are therefore sufficient to estimate the arameters.
Eamle : Bernoulli Binary random variable - heads Coin toss X {0,} [0,]
Eamle : Bernoulli } e{ln } ln ln e{ u + ln η σ η η σ η η + g e u h } e{ u g h T η η η e η η σ η
Eamle : Multinomial value k k For a single observation die toss Sometimes called Categorical For multile observations integer counts on N trials k [0,], k Prob came out 3 times, came out once,, 6 came out 7 times if I tossed a die 0 times M N! P,..., M! k k k k k M k M k k N
e{ Eamle : Multinomial P,..., M M k ln k k } observation T η h g ηe{ η u } M k T η e η k k h u Parameters are not indeendent due to constraint of summing to, there s a slightly more involved notation to address that, see Bisho.4
Eamle 3: Normal Gaussian Gaussian Normal Distribution, σ e πσ σ
Eamle 3: Normal Gaussian Distribution, σ e πσ σ is the mean σ is the variance Can verify these by comuting integrals. E.g., πσ e σ d
Eamle 3: Normal Gaussian Distribution Multivariate Gaussian P, π / e T
Eamle 3: Normal Gaussian Distribution Multivariate Gaussian / T, π e is now a vector is the mean vector Σ is the covariance matri
Imortant Proerties of Gaussians All marginals of a Gaussian are again Gaussian Any conditional of a Gaussian is Gaussian The roduct of two Gaussians is again Gaussian Even the sum of two indeendent Gaussian RVs is a Gaussian. Beyond the scoe of this tutorial, but very imortant: marginalization and conditioning rules for multivariate Gaussians.
Gaussian marginalization visualization
Eonential Family Reresentation } e{ 4 e } e{ e, + + σ σ η η η π σ σ σ πσ σ πσ σ } e{ u g h T η η η h η g T η u
Eamle: Maimum Likelihood For a D Gaussian Suose we are given a data set of samles of a Gaussian random variable X, D{,, N } and told that the variance of the data is σ N What is our best guess of? *Need to assume data is indeendent and identically distributed i.i.d.
Eamle: Maimum Likelihood For a D Gaussian What is our best guess of? We can write down the likelihood function: N N i i d, σ e i i πσ σ We want to choose the that maimizes this eression Take log, then basic calculus: differentiate w.r.t., set derivative to 0, solve for to get samle mean ML N N i i
Eamle: Maimum Likelihood For a D Gaussian σ ML N ML Maimum Likelihood
ML estimation of model arameters for Eonential Family D η,..., N h n lnd η η gη N e{η T u n n...,set to 0, solve for gη } ln g η N u ML N n n Can in rincile be solved to get estimate for eta. The solution for the ML estimator deends on the data only through sum over u, which is therefore called sufficient statistic What we need to store in order to estimate arameters.
Bayesian Probabilities d θ θ θ d d d θ is the likelihood function θ is the rior robability of or our rior belief over θ our beliefs over what models are likely or not before seeing any data d d θ P θ dθ is the normalization constant or artition function θ d is the osterior distribution Readjustment of our rior beliefs in the face of data
Eamle: Bayesian Inference For a D Gaussian Suose we have a rior belief that the mean of some random variable X is 0 and the variance of our belief is σ 0 We are then given a data set of samles of X, d{,, N } and somehow know that the variance of the data is σ What is the osterior distribution over our belief about the value of?
Eamle: Bayesian Inference For a D Gaussian N
Eamle: Bayesian Inference For a D Gaussian σ 0 N 0 Prior belief
Eamle: Bayesian Inference For a D Gaussian Remember from earlier is the likelihood function is the rior robability of or our rior belief over d d d d N i N i i i P d e, σ πσ σ 0 0 0 0 0 e, σ πσ σ
Eamle: Bayesian Inference For a D Gaussian D D D Normal N, σ N where N σ Nσ 0 + σ + Nσ 0 0 Nσ 0 + σ ML σ N σ + N 0 σ
Eamle: Bayesian Inference For a D Gaussian σ 0 N 0 Prior belief
Eamle: Bayesian Inference For a D Gaussian σ ML σ 0 N 0 ML Prior belief Maimum Likelihood
Eamle: Bayesian Inference For a D Gaussian σ N N N Prior belief Maimum Likelihood Posterior Distribution
Conjugate Priors Notice in the Gaussian arameter estimation eamle that the functional form of the osterior was that of the rior Gaussian Priors that lead to that form are called conjugate riors For any member of the eonential family there eists a conjugate rior that can be written like ν T η χ, ν f χ, ν g η e{ νη χ} Multily by likelihood to obtain osterior u to normalization of the form N η D, χ, ν g η ν + N e{ η u + νχ} Notice the addition to the sufficient statistic ν is the effective number of seudoobservations. T n n
Conjugate Priors - Eamles Beta for Bernoulli/binomial Dirichlet for categorical/multinomial Normal for Normal