Lecture 2: Parameter Learning in Fully Observed Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan

Size: px

Start display at page:

Download "Lecture 2: Parameter Learning in Fully Observed Graphical Models. Sam Roweis. Monday July 24, 2006 Machine Learning Summer School, Taiwan"

Thomasina Small
5 years ago
Views:

1 Lecture 2: Parameter Learning in Fully Observed Graphical Models Sam Roweis Monday July 24, 2006 Machine Learning Summer School, Taiwan

2 Likelihood Functions So far we have focused on the (log) probability function p( θ) which assigns a probability (density) to any joint configuration of variables given fied parameters θ. But in learning we turn this on its head: we have some fied data and we want to find parameters. Think of p( θ) as a function of θ for fied : Z(θ; ) = p( θ) l(θ; ) = log p( θ) This function is called the (log) likelihood. Chose θ to maimize some loss function L(θ) which includes l(θ): L(θ) = l(θ; D) L(θ) = l(θ; D) + log p(θ) maimum likelihood (ML) maimum a posteriori (MAP)/penalizedML (also cross-validation, Bayesian estimators, BIC, AIC,...)

3 Maimum Likelihood For IID data: p(d θ) = m l(θ; D) = m p( m θ) log p( m θ) Idea of maimum likelihod estimation (MLE): pick the setting of parameters most likely to have generated the data we saw: θ ML = argma θ l(θ; D) Commonly used as a baseline model in statistics. Often leads to intuitive, appealing, or natural estimators. For a start, the IID assumption makes the log likelihood into a sum, so its derivative can be easily taken term by term.

4 Eample: Bernoulli Trials We observe M iid coin flips: D=H,H,T,H,... Model: p(h) = θ p(t ) = (1 θ) Likelihood: l(θ; D) = log p(d θ) = log θ m (1 θ) 1 m m = log θ m m + log(1 θ) m (1 m ) = log θn H + log(1 θ)n T Take derivatives and set to zero: l θ = N H θ N T 1 θ θ ML = N H N H + N T

5 Eample: Univariate Normal We observe M iid real samples: D=1.18,-.25,.78,... Model: p() = (2πσ 2 ) 1/2 ep{ ( µ) 2 /2σ 2 } Likelihood (using probability density): l(θ; D) = log p(d θ) = M 2 log(2πσ2 ) 1 2 m ( m µ) 2 Take derivatives and set to zero: l µ = (1/σ2 ) m ( m µ) l σ 2 = M 2σ σ 4 m ( m µ) 2 µ ML = (1/M) m m σml 2 = (1/M) m 2 m µ 2 ML σ 2

6 Eample: Linear Regression X y Y

7 Eample: Linear Regression At a linear regression node, some parents (covariates/inputs) and all children (responses/outputs) are continuous valued variables. For each child and setting of discrete parents we use the model: p(y, θ) = gauss(y θ, σ 2 ) The likelihood is the familiar squared error cost: l(θ; D) = 1 2σ 2 (y m θ m ) 2 The ML parameters can be solved for using linear least-squares: l θ = (y m θ m ) m m θml = (X X) 1 X Y Sufficient statistics are input correlation matri and input-output cross-correlation vector. m

8 MLE for Directed GMs For a directed GM, the likelihood function has a nice form: log p(d θ) = log p( m i π i, θ i ) = log p( m i π i, θ i ) m m i The parameters decouple; so we can maimize likelihood independently for each node s function by setting θ i. Only need the values of i and its parents in order to estimate θ i. In general, for fully observed data if we know how to estimate params at a single node we can do it for the whole network. i X1 X1 X1 X1 X 2 X 3 X 2 X 3 X 2 X 3 X 4 X 4 (a) (b)

9 Reminder: Classification Given eamples of a discrete class label y and some features. Goal: compute label (y) for new inputs. Two approaches: Generative: model p(, y) = p(y)p( y); use Bayes rule to infer conditional p(y ). Discriminative: model discriminants f(y ) directly and take ma. Generative approach is related to conditional density estimation while discriminative approach is closer to regression

10 Probabilistic Classification: Bayes Classifiers Generative model: p(, y) = p(y)p( y). p(y) are called class priors. p( y) are called class conditional feature distributions. For the prior we use a Bernoulli or multinomial: p(y = k π) = π k with k π k = 1. Classification rules: ML: argma y p( y) (can behave badly if skewed priors) MAP: argma y p(y ) = argma y log p( y) + log p(y) (safer) Fitting: maimize n log p(n, y n ) = n log p(n y n ) + log p(y n ) 1) Sort data into batches by class label. 2) Estimate p(y) by counting size of batches (plus regularization). 3) Estimate p( y) separately within each batch using ML. (also with regularization).

11 Three Key Regularization Ideas To avoid overfitting, we can put priors on the parameters of the class and class conditional feature distributions. We can also tie some parameters together so that fewer of them are estimated using more data. Finally, we can make factorization or independence assumptions about the distributions. In particular, for the class conditional distributions we can assume the features are fully dependent, partly dependent, or independent (!). Y Y Y X1 X 2 X m X1 X 2 X m X (a) (b) (c)

12 Gaussian Class-Conditional Distributions If all features are continuous, a popular choice is a Gaussian class-conditional. { p( y = k, θ) = 2πΣ 1/2 ep 1 } 2 ( µ k)σ 1 ( µ k ) Fitting: use the following amazing and useful fact. The maimum likelihood fit of a Gaussian to some data is the Gaussian whose mean is equal to the data mean and whose covariance is equal to the sample covariance. [Try to prove this as an eercise in understanding likelihood, algebra, and calculus all at once!] Seems easy. And works amazingly well. But we can do even better with some simple regularization...

13 Regularized Gaussians Idea 1: assume all the covariances are the same (tie parameters). This is eactly Fisher s linear discriminant analysis (a) (b) Idea 2: Make independence assumptions to get diagonal or identity-multiple covariances. (Or sparse inverse covariances.) More on this in a few minutes... Idea 3: add a bit of the identity matri to each sample covariance. This fattens it up in directions where there wasn t enough data. Equivalent to using a Wishart prior on the covariance matri.

14 Gaussian Bayes Classifier Maimum likelihood estimates for parameters: priors π k : use observed frequencies of classes (plus smoothing) means µ k : use class means covariance Σ: use data from single class or pooled data ( m µ y m) to estimate full/diagonal covariances Compute the posterior via Bayes rule: p( y = k, θ)p(y = k π) p(y = k, θ) = p( y = j, θ)p(y = j π) j = ep{µ k Σ 1 µ k Σ 1 µ k /2 + log π k } j ep{µ j Σ 1 µ j Σ 1 µ j /2 + log π j } = e β k / j eβ j = ep{β k }/Z where β k = [Σ 1 µ k ; (µ k Σ 1 µ k + log π k )] and we have augmented with a constant component always equal to 1 (bias term).

15 Softma/Logit The squashing function is known as the softma or logit: 1 φ k (z) g(η) = 1 + e η ez k j ez j It is invertible (up to a constant): z k = log φ k + c η = log(g/1 g) Derivative is easy: φ k z j = φ k (δ kj φ j ) dg dη = g(1 g) φ( z ) z

16 Log Linear Geometry Taking the ratio of any two posteriors (the odds ) shows that the contours of equal pairwise probability are linear surfaces in the feature space: p(y = k, θ) p(y = j, θ) = ep { (β k β j ) } The pairwise discrimination contours p(y k ) = p(y j ) are orthogonal to the differences of the means in feature space when Σ = σi. For general Σ shared b/w all classes the same is true in the transformed feature space w = Σ 1. The priors do not change the geometry, they only shift the operating point on the logit by the log-odds log(π k /π j ). Thus, for equal class-covariances, we obtain a linear classifier. If we use different covariances, the decision surfaces are conic sections and we have a quadratic classifier.

17 Discrete Bayesian Classifier If the inputs are discrete (categorical), what should we do? The simplest class conditional model is a joint multinomial (table): p( 1 = a, 2 = b,... y = c) = η c ab... This is conceptually correct, but there s a big practical problem. Fitting: ML params are observed counts: ηab... c = n [y n = c][ 1 = a][ 2 = b][...][...] n [y n = c] Consider the 1616 digits at 256 gray levels. How many entries in the table? How many will be zero? What happens at test time? Doh! We obviously need some regularlization. Smoothing will not help much here. Unless we know about the relationships between inputs beforehand, sharing parameters is hard also. But what about independence?

18 Naive (Idiot s) Bayes Classifier Assumption: conditioned on class, attributes are independent. Y p( y) = i p( i y) X1 X 2 Sounds crazy right? Right! But it works. X m (a) Algorithm: sort data cases into bins according to y n. Compute marginal probabilities p(y = c) using frequencies. For each class, estimate distribution of i th variable: p( i y = c). At test time, compute argma c p(c ) using c() = argma c p(c ) = argma c [log p( c) + log p(c)] = argma c [log p(c) + i log p( i c)]

19 Discrete (Multinomial) Naive Bayes Discrete features i, assumed independent given the class label y. Classification rule: p(y = k, η) = = π k q π q p( i = j y = k) = η ijk p( y = k, η) = i j i i eβ k q eβ q j η[ i=j] ijk j η[ i=j] ijq η [ i=j] ijk ML parameters are classconditional frequency counts: ηijk = m [ i m = j][y m = k] m [ym = k] β k = log[η 11k... η 1jk... η ijk... log π k ] = [ 1 =1; 1 =2;... ; i =j;... ; 1] Log-Linear!

20 Gaussian Naive Bayes This is just a Gaussian Bayes Classifier with a separate diagonal covariance matri for each class. Equivalent to fitting a one-dimensional Gaussian to each input for each possible class. Decision surfaces are quadratics, not linear...

21 Discriminative Models Parametrize p(y ) directly, forget p(, y) and Bayes rule. As long as p(y ) or discriminants f(y ) are linear functions of (or monotone transforms), decision surfaces will be piecewise linear. Don t need to model the density of the features. Some density models have lots of parameters. Many densities give same linear classifier. But we cannot generate new labeled data. Optimize a cost function closer to the one we use at test time

22 Logistic/Softma Regression Model: y is a multinomial random variable whose posterior is the softma of linear functions of any feature vector. p(y = k, θ) = eθ k j eθ j Fitting: now we optimize the conditional likelihood: l(θ; D) = mk [y m = k] log p(y = k m, θ) = mk y m k log pm k l θ i = mk l m k p m k p m k z m i z m i θ i = mk y m k p m k p m k (δ ik p m i )m y = m (y m k pm k )m

23 Chains: Markov Models If variables have some temporal/spatial order, we can model their joint distribution as a dynamical/diffusion system. Simple idea: net output depends only on k previous outputs: y t = f[y t 1, y t 2,..., y t k ] k is called the order of the Markov Model y1 y2 y3 y4 y5 yk y0 Add noise to make the system probabilistic: p(y t y t 1, y t 2,..., y t k )

24 Learning Markov Models The ML parameter estimates for a simple Markov model are easy: T p(y 1, y 2,..., y T ) = p(y 1... y k ) p(y t y t 1, y t 2,..., y t k ) log p({y}) = log p(y 1... y k ) + t=k+1 T t=k+1 log p(y t y t 1, y t 2,..., y t k ) Each window of k + 1 outputs is a training case for the model p(y t y t 1, y t 2,..., y t k ). Eample: for discrete outputs (symbols) and a 2nd-order markov model we can use the multinomial model: p(y t = m y t 1 = a, y t 2 = b) = α mab The maimum likelihood values for α are: α mab = num[t s.t. y t = m, y t 1 = a, y t 2 = b] num[t s.t. y t 1 = a, y t 2 = b]

25 Maimum Entropy Markov Models We can etend this idea to a logistic regression through time type of conditional model called a maimum entropy markov model. s s s s T f f f f T The joint distribution is now a conditional model: p(s T 1 T 1 ) = t p(s t s t 1, f t ( T 1 )) The features f t can be very nonlocal functions of the underlying input sequence, for eample they can consult things in the past and in the future.

26 Directed Tree Graphical Models Directed trees are DAGMs in which each variable i has eactly one other variable as its parent πi ecept the root root which has no parents. Thus, the probability of a variable taking on a certain value depends only on the value of its parent: p() = p( root ) p( i πi ) i root Trees are the net step up from assuming independence. Instead of considering variables in isolation, consider them in pairs. 1,n 3,n NB: each node (ecept root) has eactly one parent, but nodes may have more than one child. 2,n 6,n 4,n 5,n

27 Likelihood function Notation: y i a node i and its single parent πi. V i set of joint configurations of node i and its parent πi (y root root and V root v root ) Directed model likelihood: l(θ; D) = log p( n ) = log p r ( n r ) + log p( n i n πi ) n n i r = [yi n = v] log p i(v) indicator trick n i v V i = N i (v) log p i (v) i v V i where N i (v) = n [yn i = v] and p i(v i ) = p( i πi ). Trees are in the eponential family with y i as sufficient statistics.

28 Maimum Likelihood Parameters Given Structure Trees are just a special case of fully observed graphical models. For discrete data i with values v i, each node stores a conditional probability table (CPT) over its values given its parent s value. The ML parameter estimates are just the empirical histograms of each node s values given its parent: p ( i = v i πi = v j ) = N( i = v i, πi = v j ) v i N( i = v i, πi = v j ) = N i(y i ) N πi (v j ) ecept for the root which uses marginal counts N r (v r )/N. For continuous data, the most common model is a two-dimensional Gaussian at each node. The ML parameters are just to set the mean of p i (y i ) to be the sample mean of [ i ; πi ] and the covariance matri to the sample covariance. In practice we should use some kind of smoothing/regularization.

29 aids health food insurance medicine president msg water patients studies government cancer disease doctor vitamin children jews power rights state war israel religion human law university world god fact gun research science bible christian earth jesus case course evidence question computer orbit satellite solar space launch moon nasa program shuttle technology lunar mars windows mission card dos driver files pc version win graphics video disk format ftp software team won drive memory system image fans games hockey league players puck season car scsi data problem display phone nhl baseball bmw dealer engine honda mac help server number hit oil

30 Unobserved Variables We have been assuming that we observe all the random variables in our model at training time, and all the inputs at test time. But certain variables Q in our models may be unobserved, either some of the time or always, either at training time or at test time. Q1 Q2 3 Q Q 4 Q 5 Q 6 X1 X 2 X 3 X 4 X 5 X 6 (Graphically, we will use shading to indicate observation.)

31 Partially Unobserved (Missing) Variables If variables are occasionally unobserved they are missing data. e.g. undefinied inputs, missing class labels, erroneous target values In this case, we can still model the joint distribution, but we define a new cost function in which we sum out or marginalize the missing values at training or test time: l(θ; D) = log p( c, y c θ) + log p( m θ) complete = complete [Recall that p() = q p(, q).] missing log p( c, y c θ) + missing log y p( m, y θ)

Lecture 2: Simple Classifiers

CSC 412/2506 Winter 2018 Probabilistic Learning and Reasoning Lecture 2: Simple Classifiers Slides based on Rich Zemel s All lecture slides will be available on the course website: www.cs.toronto.edu/~jessebett/csc412