CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Size: px

Start display at page:

Download "CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning"

Moses Potter
5 years ago
Views:

Science October 4, 2016 Some figures and materials courtesy David Barber,

1 CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning Professor Erik Sudderth Brown University Computer Science October 4, 2016 Some figures and materials courtesy David Barber, Bayesian Reasoning and Machine Learning

2 CS242: Lecture 4A Outline Ø MAP Estimation and Gaussian Priors Ø Graph Structure Learning via Sparse Regression

3 Degeneracies in ML Estimation ˆ = arg max LY `=1 p(x (`) ) = arg max LX log p(x (`) ) `=1 Ø The theory justifying maximum likelihood (ML) estimates is asymptotic: good properties as L becomes very large Ø But they can have poor properties with small datasets. Example: ML estimate of Bernoulli with no observed heads. ˆ = 1 L LX x (`) =0 `=1 ifx (`) = 0 for all ` assumes observing heads in future is impossible! Ø More generally, ML estimates can often give parameter estimates that are too extreme (too large or too small)

4 Bayesian Parameter Estimation Ø Suppose I have L independent observations sampled from some unknown probability distribution: x = {x (1),...,x (L) } Ø We have a likelihood model with unknown parameters: p(x ) = LY `=1 p(x (`) ) Ø We have a prior distribution on parameters (possible models): p( ) Ø Posterior distribution on parameters, given data: p( x) = 1 L p(x) p( ) Y p(x (`) ) `=1

5 Bayesian Parameter Estimation Ø Maximum a Posteriori (MAP) parameter estimate: Choose the parameters with largest posterior probability. ˆ = arg max p( x) = arg max p( ) Ø Posterior distribution on parameters, given data: p( x) = 1 L p(x) p( ) Y p(x (`) ) `=1 LY `=1 Ø Conditional Expectation parameter estimate: p(x (`) ) Set the parameters to the mean of the posterior distribution. ˆ = E[ x] = Z p( x) d

6 Priors for Discrete Exponential Families p(x ) =exp{ T (x) ( )} ( ) = log P X exp{ T (x)} For discrete variables, features are indicator functions: (x) 2{0, 1} d Assuming a finite number of variables, normalizable for any parameters: 2 R d = First derivatives of the log normalization constant are event probabilities: r ( ) =E [ (x)] = µ 2 [0, 1] d µ k =Pr [ k(x) = 1] What priors on exponential family parameters would favor simple discrete distributions? p( )

7 Example: Bernoulli Distribution Bernoulli Distribution: Single toss of a (possibly biased) coin Ber(x µ) =µ x (1 µ) 1 x x 2{0, 1} E[x µ] =P[x = 1] = µ 0 apple µ apple 1 Exponential Family Form: Ber(x ) =exp{ x ( )} 2 =R Logistic Function: 1 µ = 1+e µ = log 1 µ If =0thenµ =0.5. If 0 then µ 0.0. If 0 then µ 1.0.

8 Example: Pair of Binary Variables p(x) =exp{ 1 x x x 1 x 2 ( )} x i 2{0, 1} Represent arbitrary joint distribution on two bits with three statistics: µ 1 = E [x 1 ]=Pr [x 1 = 1] µ 2 = E [x 2 ]=Pr [x 2 = 1] µ 12 = E [x 1 x 2 ]=Pr [x 1 =1,x 2 = 1] Note that bits are independent when 12 =0: p(x) / exp{ 1 x 1 } exp{ 2 x 2 } / p(x 1 )p(x 2 ) Degree of dependence (positive or negative correlation) becomes large when 12 0 { } } conv{(0,0,0),(1,0,0),(0,1,0),(1,1,1)}

9 F V Factor Graphs & Exponential Families p(x) = 1 Z( ) Y f2f f (x f f ) set of hyperedges linking subsets of nodes set of N nodes or vertices, {1, 2,...,N} f V A factor graph is created from non-negative potentials: f (x f f )=exp T f f (x f ) f (x f ) 2 R d f By setting exponential family parameters to equal zero, we remove factors and simplify the graphical model! If f =0then f (x f f = 0) = 1 for all x f.

10 MAP Learning with Gaussian Priors p( ) =N ( 0, 1 I) Objective to Minimize: log p( ) = 2 T + constant f( ) = log p(x ) = P L `=1 log p(x(`) ) f( ) = log p(x ) log p( ) =f( )+ 2 T (plus constants) Gradient: r f( ) =r f( )+ Hessian: r 2 f( ) =r 2 f( )+ I Ø Including a Gaussian prior on parameters, or equivalently adding L 2 regularization, is a simple modification to the gradient vector & Hessian matrix for any model. Ø Biases exponential family towards weak factors, unless strong dependence justified by data

11 MAP Learning for Undirected Models Undirected graph encodes dependencies within a single training example: p(x ) = LY 1 Z( ) Y f (x (`) f f ) x = {x (1),...,x (L) } `=1 f2f Given L independent, identically distributed, completely observed samples: log p( x) =C + Take gradient with respect to parameters for a single factor: r f log p( x) = " L X `=1 " L X `=1 X f2f T f f (x (`) f ) f (x (`) f ) # # L ( ) LE [ f (x f )] 2 T f

12 MAP for Bernoulli with Gaussian Prior p( ) =N ( 0, 1 ) p(x ) =Ber(x ( )) Goal: Maximize log-posterior distribution Gradient: µ = ( ) = 1 1+e L( ) = s L ( ) 2 2, s = P L `=1 x(`) dl( ) d = s L ( ) (k+1) = (k) + rl( (k) )= (k) (k) +(s L ( (k) ))

13 What about Incomplete Data? Known: Fixed graph structure 82 and exponential 3 family 9 features: < p(x ) =exp 4 X = f T f (x : f ) 5 ( ) ; f2f x (`) i = variable at node i for observation ` Unknown: Numeric values of parameters Types of data used for parameter learning: Complete data: Incomplete (partial) data: N nodes (vars) x (`) i x (`) i x (`) i L observations

of hidden data q (t) = arg max L(q, (t 1) ) q M-Step: Given posterior distributions, learn parameters that fit data (t)

14 Reminder: EM for ML Learning X ln p(x ) =ln p(x, z ) z X p(x, z ) ln p(x ) q(z)ln q(z) Xz X ln p(x ) q(z)lnp(x, z ) q(z)lnq(z), L(q, ) z Initialization: Randomly select starting parameters (0) E-Step: Given parameters, infer posterior of hidden data q (t) = arg max L(q, (t 1) ) q M-Step: Given posterior distributions, learn parameters that fit data (t) = arg max L(q (t), ) Iteration: Alternate E-step & M-step until convergence z = p(z x, (t 1) ) = arg max X q (t) (z)lnp(x, z ) z

15 EM for MAP Learning Up to a constant independent of, ln p( x) = X ln p( )+lnp(x ) =lnp( )+ln p(x, z ) z ln p( )+ X X q(z)lnp(x, z ) q(z)lnq(z), L(q, ) z z Ø Initialization: Randomly select starting parameters (0) Ø E-Step: Given parameters, infer posterior of hidden data q (t) (z) =p(z x, (t 1) ) Ø M-Step: Given posterior distributions, learn parameters that fit data (t) = arg max log p( )+ X z q (t) (z)lnp(x, z ) Ø Iteration: Alternate E-step & M-step until convergence posterior distribution, exactly as in EM for ML weighted MAP learning

16 CS242: Lecture 4A Outline Ø MAP Estimation and Gaussian Priors Ø Graph Structure Learning via Sparse Regression

17 Learning Graphical Model Structures Over-Fitting: Maximum Likelihood always prefers fully-connected graphs Strategy 1: Place hard limit on graph s structural complexity Need to balance expressiveness, with learning & inference tractability Key example: Optimize over all (pairwise) tree-structure distributions Strategy 2: Define penalized likelihood which encourages simpler graphs Interpretable as assigning a prior on models, and finding posterior mode Classic approach: Search over graph structures Modern approach: Optimization with penalties that encourage sparsity Strategy 3: Bayesian model selection via marginal likelihoods of data Better in principle than simple penalties, but often intractable Revisit later in the course, once we ve developed more sophisticated algorithms for approximate learning and inference

18 Factor Graphs & Exponential Families p(x) = 1 Z( ) Y f2f f (x f f ) Graph structure selection is feature selection! Want to determine which factors (features) should be used, and which should be discarded (assigned zero weight). A factor graph is created from non-negative potentials: f (x f f )=exp T f f (x f ) f (x f ) 2 R d f By setting exponential family parameters to equal zero, we remove factors and simplify the graphical model! If f =0then f (x f f = 0) = 1 for all x f.

19 Laplace Distribution Probability Densities Gauss Laplace Lap( 0, )= 2 exp( ) Gauss Laplace Log Probability Densities When used as a zero-mean prior on vectors of model parameters: Ø Compared to Gaussian, stronger bias that many near zero Ø When find MAP estimate, some weights are exactly zero Ø Learning harder than for Gaussian, but still convex

20 Laplacian prior L 1 regularization Lasso regression Constrained Optimization Gaussian prior L 2 regularization Ridge regression p(w) = DY Lap(w j 0, ) j=1 p(w) = DY Norm(w j 0, j=1 2 ) f(w) = y w w 1 Where do level sets of the quadratic regression cost function first intersect the constraint set? f(w) = y w w 2 2

21 Gradient-Based Optimization Laplacian prior L 1 regularization Lasso regression Gaussian prior L 2 regularization Ridge regression p( ) = DY Lap( j 0, ) p( ) = j=1 DY Norm( j 0, j=1 2 ) Objective Function: Negative Gradient: f( ) = f 0 ( ) log p( ) (Informal intuition: Gradient of L 1 objective not defined at zero)

22 Generalized Norms: Bridge Regression log p(x )+ P j j b b ExpPower( µ, a, b) = 2a (1/b) exp x µ a b b=2 b=1 b=0.3 Ø Convex objective function (true norm): b 1 Ø Encourages sparse solutions (cusp at zero): b 1 Ø Lasso/Laplacian (convex & sparsifying): b = 1 Ø Ridge/Gaussian (classical, closed form solutions): b = 2 Ø Sparsity via discrete feature count (greedy search): b è 0

23 Bayesian Regression: 0 Observations Posterior Data Space p(y x, w) =N (y w 0 + w 1 x, 2 ) Prior

24 Bayesian Regression: 1 Observations Likelihood Posterior Data Space p(y x, w) =N (y w 0 + w 1 x, 2 ) Prior

25 Regression Posteriors with Sparse Priors log p(x )+ P j j b b ExpPower( µ, a, b) = 2a (1/b) exp x µ a b Priors Posteriors b=2 b=1 b=0.4

26 lcavol lweight age lbph svi lcp gleason pgg45 Ridge: Bound on L 2 norm Regularization Paths Prostate Cancer Dataset with N=67, D= Lasso: Bound on L 1 norm lcavol lweight age lbph svi lcp gleason pgg Ø Horizontal axis increases bound on weights (less regularization) Ø For each bound, plot values of estimated feature weights Ø Vertical lines are models chosen by cross-validation

Optimization: Projected Gradient Ø Generic method based on gradient & projection operators: Ø Projection onto non-negativity constraint is easy: k

27 Optimization: Projected Gradient Ø Generic method based on gradient & projection operators: Ø Projection onto non-negativity constraint is easy: k = w k Ø Guaranteed to converge to global minimum of any convex function on a convex set Ø Extensions modify descent directions for faster convergence

28 Sparse Learning for Undirected Models log p(x ) = log p( x) =C LX `=1 X f2f apple L X `=1 T f X f2f f (x (`) f ) L ( ) T f f (x (`) f ) L ( ) + 1 Standard software packages for L 1 -regularized learning assume we can evaluate objective function and its gradient in closed form. This is possible assuming inference tractable in model with all features. Pseudo-likelihood estimators & variational estimators approximate true likelihood, but can scale to features where exact inference intractable. Can replace L 1 with fancier penalties to encourage blocks of parameters to simultaneously be set to zero.

29 Example: Word Usage in Newsgroups L 1 regularization ion ( = 256, isolated nodes are not plotted). case children bible course christian computer evidence health insurance disk display card fact earth files graphics government god dos format help data image video gun human car president israel jesus drive memory number power law engine dealer jews baseball ftp mac scsi problem rights war religion games fans pc program phone nasa state question hockey software research shuttle league nhl launch moon science orbit players space university world season Schmidt 2010 PhD Thesis: Presence of 100 words across version 16,242 postings to 20 newsgroups. windows system technology driver win won team

30 Example: Word Usage in Newsgroups L 1 regularization ( = 512, isolated nodes are not plotted). baseball players games hockey season bible case israel christian evidence course jews team earth fact children win god number government question jesus religion war human president law world rights phone files computer help disk state research science nasa program software problem data drive Schmidt 2010 PhD Thesis: Presence of 100 words across 16,242 postings to 20 newsgroups. university version dos space windows system card video scsi

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs

CS242: Probabilistic Graphical Models Lecture 4B: Learning Tree-Structured and Directed Graphs Professor Erik Sudderth Brown University Computer Science October 6, 2016 Some figures and materials courtesy