Random Variables and Densities

Size: px

Start display at page:

Download "Random Variables and Densities"

Oscar Barrett
6 years ago
Views:

1 Rando Variables and Densities Review: Probabilit and Statistics Sa Roweis Rando variables X represents outcoes or states of world. Instantiations of variables usuall in lower case: We will write p() to ean probabilit(x = ). Saple Space: the space of all possible outcoes/states. (Ma be discrete or continuous or ied.) Probabilit ass (densit) function p() 0 Assigns a non-negative nuber to each point in saple space. Sus (integrates) to unit: p() = 1 or p()d = 1. Intuitivel: how often does occur, how uch do we believe in. Enseble: rando variable + saple space+ probabilit function October 2, 2002 Probabilit We use probabilities p() to represent our beliefs B() about the states of the world. There is a foral calculus for anipulating uncertainties represented b probabilities. An consistent set of beliefs obeing the Co Aios can be apped into probabilities. 1. Rationall ordered degrees of belief: if B() > B() and B() > B(z) then B() > B(z) 2. Belief in and its negation are related: B() = f[b( )] 3. Belief in conjunction depends onl on conditionals: B( and ) = g[b(), B( )] = g[b(), B( )] Epectations, Moents Epectation of a function a() is written E[a] or a E[a] = a = p()a() e.g. ean = p(), variance = ( E[])2 p() Moents are epectations of higher order powers. (Mean is first oent. Autocorrelation is second oent.) Centralized oents have lower oents subtracted awa (e.g. variance, skew, curtosis). Deep fact: Knowledge of all orders of oents copletel defines the entire distribution.

2 Joint Probabilit Ke concept: two or ore rando variables a interact. Thus, the probabilit of one taking on a certain value depends on which value(s) the others are taking. We call this a joint enseble and write p(, ) = prob(x = and Y = ) Conditional Probabilit If we know that soe event has occurred, it changes our belief about the probabilit of other events. This is like taking a slice through the joint table. p( ) = p(, )/p() z z p(, z) p(,,z) Marginal Probabilities We can su out part of a joint distribution to get the arginal distribution of a subset of variables: p() = p(, ) This is like adding slices of the table together. Σ z z p(,) Baes Rule Manipulating the basic definition of conditional probabilit gives one of the ost iportant forulas in probabilit theor: p( ) = p( )p() p() = p( )p() p( )p( ) This gives us a wa of reversing conditional probabilities. Thus, all joint probabilities can be factored b selecting an ordering for the rando variables and using the chain rule : p(,, z,...) = p()p( )p(z, )p(...,, z) Another equivalent definition: p() = p( )p().

3 Independence & Conditional Independence Entrop Two variables are independent iff their joint factors: p(, ) = p()p() p(,) = p() p() Measures the aount of abiguit or uncertaint in a distribution: H(p) = p() log p() Epected value of log p() (a function which depends on p()!). H(p) > 0 unless onl one possible outcoein which case H(p) = 0. Maial value when p is unifor. Tells ou the epected cost if each event costs log p(event) Two variables are conditionall independent given a third one if for all values of the conditioning variable, the resulting slice factors: p(, z) = p( z)p( z) z Watch the contet: e.g. Sipson s parado Be Careful! Define rando variables and saple spaces carefull: e.g. Prisoner s parado Cross Entrop (KL Divergence) An assetric easure of the distancebetween two distributions: KL[p q] = p()[log p() log q()] KL > 0 unless p = q then KL = 0 Tells ou the etra cost if events were generated b p() but instead of charging under p() ou charged under q().

4 Statistics Probabilit: inferring probabilistic quantities for data given fied odels (e.g. prob. of events, arginals, conditionals, etc). Statistics: inferring a odel given fied data observations (e.g. clustering, classification, regression). Man approaches to statistics: frequentist, Baesian, decision theor,... (Conditional) Probabilit Tables For discrete (categorical) quantities, the ost basic paraetrization is the probabilit table which lists p( i = k th value). Since PTs ust be nonnegative and su to 1, for k-ar variables there are k 1 free paraeters. If a discrete variable is conditioned on the values of soe other discrete variables we ake one table for each possible setting of the parents: these are called conditional probabilit tables or CPTs. z z p(,,z) p(, z) Soe (Conditional) Probabilit Functions Probabilit densit functions p() (for continuous variables) or probabilit ass functions p( = k) (for discrete variables) tell us how likel it is to get a particular value for a rando variable (possibl conditioned on the values of soe other variables.) We can consider various tpes of variables: binar/discrete (categorical), continuous, interval, and integer counts. For each tpe we ll see soe basic probabilit odels which are paraetrized failies of distributions. Eponential Fail For (continuous or discrete) rando variable p( η) = h() ep{η T () A(η)} = 1 Z(η) h() ep{η T ()} is an eponential fail distribution with natural paraeter η. Function T () is a sufficient statistic. Function A(η) = log Z(η) is the log noralizer. Ke idea: all ou need to know about the data is captured in the suarizing function T ().

5 Bernoulli For a binar rando variable with p(heads)=π: p( π) = π (1 π) { ( 1 ) } π = ep log + log(1 π) 1 π Eponential fail with: π η = log 1 π T () = A(η) = log(1 π) = log(1 + e η ) h() = 1 The logistic function relates the natural paraeter and the chance of heads 1 π = 1 + e η Multinoial For a set of integer counts on k trials k! p( π) = 1! 2! n! π 1 1 π 2 2 π n n = h() ep i log π i i But the paraeters are constrained: i π i = 1. So we define the last one π n = 1 n 1 i=1 π i. { n 1 ( ) } p( π) = h() ep i=1 log πi π n i + k log π n Eponential fail with: η i = log π i log π n T ( i ) = i A(η) = k log π n = k log i eη i h() = k!/ 1! 2! n! Poisson For an integer count variable with rate λ: p( λ) = λ e λ! = 1 ep{ log λ λ}! The softa function relates the basic and natural paraeters: π i = eη i j eη j Eponential fail with: η = log λ T () = A(η) = λ = e η h() = 1! e.g. nuber of photons that arrive at a piel during a fied interval given ean intensit λ Other count densities: binoial, eponential.

6 Gaussian (noral) For a continuous univariate rando variable: p( µ, σ 2 ) = 1 { ep 1 } 2πσ 2σ2( µ)2 Eponential fail with: = 1 2πσ ep η = [µ/σ 2 ; 1/2σ 2 ] T () = [ ; 2 ] A(η) = log σ + µ/2σ 2 h() = 1/ 2π { µ σ 2 2 2σ 2 µ2 2σ 2 log σ Note: a univariate Gaussian is a two-paraeter distribution with a two-coponent vector of sufficient statistis. } Iportant Gaussian Facts All arginals of a Gaussian are again Gaussian. An conditional of a Gaussian is again Gaussian. p( =0) 0 p(,) p() Σ Multivariate Gaussian Distribution For a continuous vector rando variable: { p( µ, Σ) = 2πΣ 1/2 ep 1 } 2 ( µ) Σ 1 ( µ) Eponential fail with: η = [Σ 1 µ ; 1/2Σ 1 ] T () = [ ; ] A(η) = log Σ /2 + µ Σ 1 µ/2 h() = (2π) n/2 Sufficient statistics: ean vector and correlation atri. Other densities: Student-t, Laplacian. For non-negative values use eponential, Gaa, log-noral. Gaussian Marginals/Conditionals To find these paraeters is ostl linear algebra: Let z = [ ] be norall distributed according to: [ ] ([ ] [ ]) a A C z = N ; b C B where C is the (non-setric) cross-covariance atri between and which has as an rows as the size of and as an coluns as the size of. The arginal distributions are: N (a; A) N (b; B) and the conditional distributions are: N (a + CB 1 ( b); A CB 1 C ) N (b + C A 1 ( a); B C A 1 C)

7 Moents For continuous variables, oent calculations are iportant. We can easil copute oents of an eponential fail distribution b taking the derivatives of the log noralizer A(η). The q th derivative gives the q th centred oent. da(η) = ean dη d 2 A(η) dη 2 = variance When the sufficient statistic is a vector, partial derivatives need to be considered. Generalized Linear Models (GLMs) Generalized Linear Models: p( ) is eponential fail with conditional ean µ = f(θ ). The function f is called the response function. If we chose f to be the inverse of the apping b/w conditional ean and natural paraeters then it is called the canonical response function. η = ψ(µ) f( ) = ψ 1 ( ) Paraeterizing Conditionals When the variable(s) being conditioned on (parents) are discrete, we just have one densit for each possible setting of the parents. e.g. a table of natural paraeters in eponential odels or a table of tables for discrete odels. When the conditioned variable is continuous, its value sets soe of the paraeters for the other variables. A ver coon instance of this for regression is the linear-gaussian : p( ) = gauss(θ ; Σ). For discrete children and continuous parents, we often use a Bernoulli/ultinoial whose paraters are soe function f(θ ). Potential Functions We can be even ore general and define distributions b arbitrar energ functions proportional to the log probabilit. p() ep{ H k ()} k A coon choice is to use pairwise ters in the energ: H() = a i i + w ij i j i pairs ij

8 Special variables If certain variables are alwas observed we a not want to odel their densit. For eaple inputs in regression or classification. This leads to conditional densit estiation. If certain variables are alwas unobserved, the are called hidden or latent variables. The can alwas be arginalized out, but can ake the densit odeling of the observed variables easier. (We ll see ore on this later.) Likelihood Function So far we have focused on the (log) probabilit function p( θ) which assigns a probabilit (densit) to an joint configuration of variables given fied paraeters θ. But in learning we turn this on its head: we have soe fied data and we want to find paraeters. Think of p( θ) as a function of θ for fied : L(θ; ) = p( θ) l(θ; ) = log p( θ) This function is called the (log) likelihood. Chose θ to aiize soe cost function c(θ) which includes l(θ): c(θ) = l(θ; D) c(θ) = l(θ; D) + r(θ) aiu likelihood (ML) aiu a posteriori (MAP)/penalizedML (also cross-validation, Baesian estiators, BIC, AIC,...) Multiple Observations, Coplete Data, IID Sapling A single observation of the data X is rarel useful on its own. Generall we have data including an observations, which creates a set of rando variables: D = { 1, 2,..., M } Two ver coon assuptions: 1. Observations are independentl and identicall distributed according to joint distribution of graphical odel: IID saples. 2. We observe all rando variables in the doain on each observation: coplete data. For IID data: Maiu Likelihood p(d θ) = l(θ; D) = p( θ) log p( θ) Idea of aiu likelihod estiation (MLE): pick the setting of paraeters ost likel to have generated the data we saw: θ ML = arga θ l(θ; D) Ver coonl used in statistics. Often leads to intuitive, appealing, or natural estiators.

9 Eaple: Bernoulli Trials We observe M iid coin flips: D=H,H,T,H,... Model: p(h) = θ p(t ) = (1 θ) Likelihood: l(θ; D) = log p(d θ) = log θ (1 θ) 1 = log θ + log(1 θ) (1 ) = log θn H + log(1 θ)n T Take derivatives and set to zero: l θ = N H θ N T 1 θ θml = N H N H + N T Eaple: Univariate Noral We observe M iid real saples: D=1.18,-.25,.78,... Model: p() = (2πσ 2 ) 1/2 ep{ ( µ) 2 /2σ 2 } Likelihood (using probabilit densit): l(θ; D) = log p(d θ) = M 2 log(2πσ2 ) 1 ( µ) 2 2 σ 2 Take derivatives and set to zero: l µ = (1/σ2 ) ( µ) l σ 2 = M 2σ σ 4 ( µ) 2 µ ML = (1/M) σml 2 = (1/M) 2 µ 2 ML Eaple: Multinoial Eaple: Univariate Noral We observe M iid die rolls (K-sided): D=3,1,K,2,... Model: p(k) = θ k k θ k = 1 Likelihood (for binar indicators [ = k]): l(θ; D) = log p(d θ) = log θ = log θ [ =1] 1... θ [ =k] k = log θ k [ = k] = N k log θ k k k Take derivatives and set to zero (enforcing k θ k = 1): l = N k M θ k θ k θ k = N k M A B

10 Eaple: Linear Regression In linear regression, soe inputs (covariates,parents) and all outputs (responses,children) are continuous valued variables. For each child and setting of discrete parents we use the odel: p(, θ) = gauss( θ, σ 2 ) The likelihood is the failiar squared error cost: l(θ; D) = 1 2σ 2 ( θ ) 2 The ML paraeters can be solved for using linear least-squares: l θ = ( θ ) θml = (X X) 1 X Y Sufficient Statistics A statistic is a function of a rando variable. T (X) is a sufficient statistic for X if T ( 1 ) = T ( 2 ) L(θ; 1 ) = L(θ; 2 ) θ Equivalentl (b the Nean factorization theore) we can write: p( θ) = h (, T ()) g (T (), θ) Eaple: eponential fail odels: p( θ) = h() ep{η T () A(η)} Eaple: Linear Regression Sufficient Statistics are Sus X In the eaples above, the sufficient statistics were erel sus (counts) of the data: Bernoulli: # of heads, tails Multinoial: # of each tpe Gaussian: ean, ean-square Regression: correlations As we will see, this is true for all eponential fail odels: sufficient statistics are average natural paraeters. Onl eponential fail odels have siple sufficient statistics. Y

11 MLE for Eponential Fail Models Recall the probabilit function for eponential odels: p( θ) = h() ep{η T () A(η)} For iid data, sufficient statistic is T ( ): ( ) ( l(η; D) = log p(d η) = log h( ) M A(η)+ η ) T ( ) Take derivatives and set to zero: l η = T ( ) M A(η) η A(η) η = M 1 T ( ) η ML = M 1 T ( ) recalling that the natural oents of an eponential distribution are the derivatives of the log noralizer. Fundaental Operations with Distributions Generate data: draw saples fro the distribution. This often involves generating a uniforl distributed variable in the range [0,1] and transforing it. For ore cople distributions it a involve an iterative procedure that takes a long tie to produce a single saple (e.g. Gibbs sapling, MCMC). Copute log probabilities. When all variables are either observed or arginalized the result is a single nuber which is the log prob of the configuration. Inference: Copute epectations of soe variables given others which are observed or arginalized. Learning. Set the paraeters of the densit functions given soe (partiall) observed data to aiize likelihood or penalized likelihood. Basic Statistical Probles Let s reind ourselves of the basic probles we discussed on the first da: densit estiation, clustering classification and regression. Densit estiation is hardest. If we can do joint densit estiation then we can alwas condition to get what we want: Regression: p( ) = p(, )/p() Classification: p(c ) = p(c, )/p() Clustering: p(c ) = p(c, )/p() c unobserved Learning In AI the bottleneck is often knowledge acquisition. Huan eperts are rare, epensive, unreliable, slow. But we have lots of data. Want to build sstes autoaticall based on data and a sall aount of prior inforation (fro eperts).

12 Known Models Man sstes we build will be essentiall probabilit odels. Assue the prior inforation we have specifies tpe & structure of the odel, as well as the for of the (conditional) distributions or potentials. In this case learning setting paraeters. Also possible to do structure learning to learn odel. Jensen s Inequalit For an concave function f() and an distribution on, E[f()] f(e[]) f(e[]) E[f()] e.g. log() and are concave This allows us to bound epressions like log p() = log z p(, z)

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood