Probabilistic learning

Similar documents
Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details!

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

Vehicle Arrival Models : Headway

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Notes on Kalman Filtering

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

0.1 MAXIMUM LIKELIHOOD ESTIMATION EXPLAINED

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

RANDOM LAGRANGE MULTIPLIERS AND TRANSVERSALITY

10. State Space Methods

Lecture 33: November 29

Notes for Lecture 17-18

Georey E. Hinton. University oftoronto. Technical Report CRG-TR February 22, Abstract

Ensamble methods: Bagging and Boosting

1 Review of Zero-Sum Games

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

MATH 128A, SUMMER 2009, FINAL EXAM SOLUTION

Chapter 2. First Order Scalar Equations

GMM - Generalized Method of Moments

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

23.2. Representing Periodic Functions by Fourier Series. Introduction. Prerequisites. Learning Outcomes

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Ensamble methods: Boosting

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Lecture 20: Riccati Equations and Least Squares Feedback Control

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

An Introduction to Malliavin calculus and its applications

Mathcad Lecture #8 In-class Worksheet Curve Fitting and Interpolation

Lecture Notes 2. The Hilbert Space Approach to Time Series

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

ACE 562 Fall Lecture 4: Simple Linear Regression Model: Specification and Estimation. by Professor Scott H. Irwin

Math 2142 Exam 1 Review Problems. x 2 + f (0) 3! for the 3rd Taylor polynomial at x = 0. To calculate the various quantities:

2.7. Some common engineering functions. Introduction. Prerequisites. Learning Outcomes

Unit Root Time Series. Univariate random walk

Matlab and Python programming: how to get started

Two Coupled Oscillators / Normal Modes

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

Echocardiography Project and Finite Fourier Series

Diebold, Chapter 7. Francis X. Diebold, Elements of Forecasting, 4th Edition (Mason, Ohio: Cengage Learning, 2006). Chapter 7. Characterizing Cycles

Tom Heskes and Onno Zoeter. Presented by Mark Buller

Answers to QUIZ

Asymptotic Equipartition Property - Seminar 3, part 1

An introduction to the theory of SDDP algorithm

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model

Linear Response Theory: The connection between QFT and experiments

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

This document was generated at 1:04 PM, 09/10/13 Copyright 2013 Richard T. Woodward. 4. End points and transversality conditions AGEC

STATE-SPACE MODELLING. A mass balance across the tank gives:

Solutions from Chapter 9.1 and 9.2

Zürich. ETH Master Course: L Autonomous Mobile Robots Localization II

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé

Estimation of Poses with Particle Filters

2. Nonlinear Conservation Law Equations

The Arcsine Distribution

Hamilton- J acobi Equation: Explicit Formulas In this lecture we try to apply the method of characteristics to the Hamilton-Jacobi equation: u t

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Online Convex Optimization Example And Follow-The-Leader

Lecture 2 October ε-approximation of 2-player zero-sum games

Christos Papadimitriou & Luca Trevisan November 22, 2016

Hidden Markov Models

Object tracking: Using HMMs to estimate the geographical location of fish

Math 10B: Mock Mid II. April 13, 2016

3.1 More on model selection

IMPLICIT AND INVERSE FUNCTION THEOREMS PAUL SCHRIMPF 1 OCTOBER 25, 2013

Machine Learning 4771

Chapter 6. Systems of First Order Linear Differential Equations

ODEs II, Lecture 1: Homogeneous Linear Systems - I. Mike Raugh 1. March 8, 2004

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

A Primal-Dual Type Algorithm with the O(1/t) Convergence Rate for Large Scale Constrained Convex Programs

Speech and Language Processing

Final Spring 2007

Y. Xiang, Learning Bayesian Networks 1

References are appeared in the last slide. Last update: (1393/08/19)

System of Linear Differential Equations

GENERALIZATION OF THE FORMULA OF FAA DI BRUNO FOR A COMPOSITE FUNCTION WITH A VECTOR ARGUMENT

EXERCISES FOR SECTION 1.5

Convergence of the Neumann series in higher norms

non -negative cone Population dynamics motivates the study of linear models whose coefficient matrices are non-negative or positive.

Decentralized Stochastic Control with Partial History Sharing: A Common Information Approach

ACE 562 Fall Lecture 8: The Simple Linear Regression Model: R 2, Reporting the Results and Prediction. by Professor Scott H.

Testing the Random Walk Model. i.i.d. ( ) r

Comparing Means: t-tests for One Sample & Two Related Samples

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

20. Applications of the Genetic-Drift Model

MATH 4330/5330, Fourier Analysis Section 6, Proof of Fourier s Theorem for Pointwise Convergence

Biol. 356 Lab 8. Mortality, Recruitment, and Migration Rates

14 Autoregressive Moving Average Models

Understanding the asymptotic behaviour of empirical Bayes methods

Solutions of Sample Problems for Third In-Class Exam Math 246, Spring 2011, Professor David Levermore

5. Stochastic processes (1)

Right tail. Survival function

Announcements: Warm-up Exercise:

Chapter 3 Boundary Value Problem

Computer-Aided Analysis of Electronic Circuits Course Notes 3

On Measuring Pro-Poor Growth. 1. On Various Ways of Measuring Pro-Poor Growth: A Short Review of the Literature

Transcription:

Probabilisic learning Charles Elkan November 8, 2012 Imporan: These lecure noes are based closely on noes wrien by Lawrence Saul. Tex may be copied direcly from his noes, or paraphrased. Also, hese ypese noes lack illusraions. See he classroom lecures for figures and diagrams. 1 Learning in a Bayesian nework A Bayesian nework is a direced graph wih a CPT (condiional probabiliy able) for each node. This secion explains how o learn he CPTs from raining daa. As explained before, he raining daa are a marix where each row is an insance and each column is a feaure. Insances are also called examples, while feaures are also called nodes, random variables, and aribues. One enry in he marix is one value of one feaure, ha is one oucome of one random variable. We consider firs he scenario where each insance is complee, ha is he oucome of every node is observed for every insance. In his scenario, nohing is unknown, or in oher words, here is no missing daa. This scenario is also called fully visible, or no hidden nodes, or no laen variables. We also assume ha he graph of he Bayesian nework is known, ha is nodes X 1 o X n consiue a finie se, and ha each node is random variable wih a discree finie se of alernaive values. In his scenario, wha we need o learn is he CPT of each node. A single enry in one CPT is p(x i = x pa(x i ) = π) where x is a specific oucome of X i and π is a specific se of oucomes, also called a configuraion, of he paren nodes of X i. The raining daa are T insances x, each of which is a complee configuraion of X 1 o X n. We wrie x = (x 1,..., x n ). Remember he convenion ha he firs subscrip refers o rows while he second subscrip refers o columns. To make learning feasible, we need a basic assumpion abou he raining daa. 1

Assumpion. Each example is an independen and idenically disribued (IID) random sample from he join disribuion defined by he Bayesian nework. This assumpion has wo pars. Firs, each x is idenically disribued means ha each sample is generaed using he same CPTs. Second, being independen means ha probabiliies can be muliplied: p(x s, x ) = p(x s )p(x ). Wih he IID assumpion, we are ready o begin o derive a learning procedure. The probabiliy of he raining daa is P = T p(x 1 = x 1,..., X n = x n ). =1 The probabiliy of example is p(x 1 = x 1,..., X n = x n ) = = n p(x i = x i X 1 = x 1,..., X i 1 = x,i 1 ) i=1 n p(x i = x i pa(x i ) = pa i ) i=1 The firs equaion above follows from he chain rule of probabiliies, while he second follows from condiional independence in he Bayesian nework. Learning means choosing values, based on he available daa, for he aspecs of he model ha are unknown. Here, he model is he probabiliy disribuion specified by he Bayesian nework. Is graph is known bu he parameers inside is CPTs are unknown. The principle of maximum likelihood says ha we should choose values for unknown parameers in such a way ha he overall probabiliy of he raining daa is maximized. This principle is no a heorem ha can be proved. I is simply a sensible guideline. One way o argue ha he principle is sensible is o noice ha, essenially, i says ha we should assume ha he raining daa are he mos ypical possible, ha is, ha he observed daa are he mode of he disribuion o be learned. The principle of maximum likelihood says ha we should choose values for he parameers of he CPTs ha make P as large as possible. Le hese parameers be called w. The principle says ha we should choose w = argmax w P. Because he logarihm funcion is monoone sricly increasing, his is equivalen o choosing w = argmax w log P. I is convenien o maximize he log because he log of 2

a produc is a sum, and dealing wih sums is easier. So, he goal is o maximize L = log = T =1 T =1 i=1 n p(x i = x i pa(x i ) = pa i ) n log p(x i = x i pa(x i ) = pa i ) i=1 Swapping he order of he summaions gives L = n i=1 T log p(x i = x i pa(x i ) = pa i ). (1) =1 Now, noice ha each inner sum over involves a differen CPT. These CPTs have parameers whose values can be chosen compleely separaely. Therefore, L can be maximized by maximizing each inner sum separaely. We can decompose he ask of maximizing L ino n separae subasks o maximize M i = T log p(x i = x i pa(x i ) = pa i ) =1 for i = 1 o i = n. Consider one of hese subasks. The sum over reas each raining example separaely. To make progress, we group he raining examples ino equivalence classes. Each class consiss of all examples among he T ha have he same oucome for X i and he same oucome for he parens of X i. Le x range over he oucomes of X i and le π range over he oucomes of he parens of X i. Le coun(x, π) be how many of he T examples have he value x and he configuraion π. Noe ha T = coun(x, π). x π We can wrie M i = x coun(x, π) log p(x i = x pa(x i ) = π). π We wan o choose parameer values for he CPT for node i o maximize his expression. These parameer values are he probabiliies p(x i = x pa(x i ) = π). These values are consrained by he fac ha for each π log p(x i = x pa(x i ) = π) = 1. x 3

However, here is no consrain connecing he values for differen π. Therefore, we can swap he order of he summaions inside he expression for M i and obain a separae subask for each π. Wrie w x = p(x i = x pa(x i ) = π) and c x = coun(x, π). The problem o solve is maximize x c x log w x subjec o w x 0 and x w x = 1. This problem can be solved using Lagrange mulipliers. The soluion is w x = c x x c. x In words, he maximum likelihood esimae of he probabiliy ha X i = x, given ha he parens of X i are observed o be π, is coun(x i = x, pa(x i ) = π) p(x i = x pa(x i ) = π) = x coun(x i = x, pa(x i ) = π) = coun(x i = x, pa(x i ) = π) coun(pa(x i ) = π) = I(x = x i, π = pa i ) I(π = pa i ) where he couns are wih respec o he raining daa. These esimaes make sense inuiively. Each esimaed probabiliy is proporional o he corresponding frequency observed in he raining daa. If he value x is never observed for some combinaion π, hen is condiional probabiliy is esimaed o be zero. Alhough he esimaes are inuiively sensible, only a formal derivaion like he one above can show ha hey are correc (and unique). The derivaion uses several mahemaical manipulaions ha are common in similar argumens. These manipulaions include changing producs ino sums, swapping he order of summaions, and arguing ha maximizaion subasks are separae. End of he lecure on Thursday Ocober 25. 2 Markov models of language Many applicaions involving naural language need a model ha assigns probabiliies o senences. For example, he mos successful ranslaion sysems nowadays 4

for naural language are based on probabilisic models. Le F be a random variable whose values are senences wrien in French, and le E be a similar random variable ranging over English senences. Given a specific French senence f, he machine ranslaion ask is o find e = argmax e p(e = e F = f). One way o decompose he ask ino subasks is o use Bayes rule and wrie e = argmax e p(f = f E = e)p(e = e) p(f = f) = argmax e p(f = f E = e)p(e = e). The denominaor p(f = f) can be ignored because i is he same for all e. Alhough creaing a model of p(f = f E = e) is presumably jus as difficul as creaing a model direcly of p(e = e F = f), he model of p(e = e) can overcome some errors in p(f = f E = e). For example, regardless of he original senence in he foreign language, he English senence Colorless green ideas sleep furiously should no be a high-probabiliy ranslaion. This secion explains how o learn basic models of p(e = e). Clearly he probabiliy of a senence depends on he words in i, and also on he order of he words. Consider a senence ha consiss of he words w 1 o w L in order. Le hese words be he oucomes of random variables W 1 o W L. The chain rule of probabiliies says ha p(w 1, W 2,..., W L ) = p(w 1 )p(w 2 W 1 ) p(w L W L 1,... W 1 ). Words ha occur a long way before w l in he senence presumably influence he probabiliy of w l less, so o simplify his expression i is reasonable o fix a number n of previous words and wrie L p(w l W l n,..., W l 2, W l 1 ) l=1 wih each word depending only on he previous n words. In he special case where n = 0, each word is independen of he previous words. A model of his ype is called a Markov model of order n. A unigram model has order n = 0, a bigram model has order n = 1, and a rigram model has order n = 2. A bigram model corresponds o a Bayesian nework wih nodes W 1 o W L and an edge from each node W l o W l+1. Imporanly, he same CPT p(w l+1 = j W l = i) is used a each node W l+1. Fixing he enries in differen CPTs o be he same is called ying. Noice ha echnically we have a differen Bayesian nework for each differen lengh L, bu ying CPTs les us rea all hese neworks as he same. 5

How can we learn he shared CPT? Each node W l is a discree random variable, bu one wih a very large se of values. The cardinaliy of his se is he size of he vocabulary, ypically beween 10 4 and 10 5 in applicaions. Since mos words never follow each oher, a documen collecion of size smaller han (10 5 ) 2 words can be adequae for raining. Forunaely, nowadays i is easy o assemble and process collecions of 10 8 and more words. The maximum likelihood esimae of he CPT parameers is p(w l = j W l 1 = i) = c ij c i where c i is he number of imes ha word i occurs followed by any oher word, and c ij is he number of imes ha word i occurs followed by word j. A noe on noaion: i is convenien o assume ha each word is an ineger beween 1 and he vocabulary size. Noaion such as w i insead of i for he ih word causes wo difficulies: i leads o double subscrips, and i suggess ha srings are mahemaical objecs. Some issues occur wih n-gram models. The firs issue is ha hey do no handle novel words in an inelligen way. Typically we conver each word no in he predefined vocabulary ino a special fixed oken such as UNK, and hen rea his as an ordinary word. The second issue is ha all sequences of words no seen in he raining collecion are assigned zero probabiliy. For example, he bigram pink flies may be so uncommon ha i occurs zero imes in he raining collecion, bu ha does no mean i is impossible. Is probabiliy should be small, bu above zero. The higher he order of he n-gram model is, he more his second issue is imporan. 3 Linear regression Linear regression is perhaps he mos widely used mehod of modeling daa in classical saisics. Here we see how i fis ino he paradigm of learning he parameers of a Bayesian nework via he principle of maximum likelihood. We have independen nodes X 1 o X d and a dependen node Y, wih an edge X i Y for each i. Inuiively, he value of Y is a linear funcion of he values of X 1 o X d, plus some random noise. Assuming ha he noise has mean zero, we can wrie d E[Y ] = w i x i = w x i=1 6

where w 1 o w d are parameers describing he linear dependence. The sandard choice o model he random noise is a Gaussian disribuion wih mean zero and variance σ 2. The probabiliy densiy funcion of his disribuion is p(z) = 1 z2 exp 2πσ 2 2σ. 2 Combining his wih he expression for E[y] gives p(y = y X = x) = 1 exp 1 2πσ 2 2σ (y w 2 x)2. End of he lecure on Tuesday Ocober 30. To learn he parameers w 1 o w d we have raining examples ( x, y ) for = 1 o = T. Assume ha each x is a column vecor. Given ha hese examples are IID, he log likelihood is L = T log p(y x ) = =1 T =1 1 2 log(2πσ2 ) + 1 2σ (y w x ) 2. 2 We can maximize his expression in wo sages: firs find he opimal w i values, and hen find he opimal σ 2 value. The firs subproblem is o minimize (no maximize) T S = (y w x ) 2. =1 We can solve his by seing he parial derivaives of S o zero. We ge he equaions T S = 2(y w x )x i = 0 w i =1 for i = 1 o i = d, where we wrie x i because x is a column vecor. These yield he sysem of d linear equaions T y x i = =1 T ( w x )x i. =1 Noe ha each of he d equaions involves all of he unknowns w 1 o w d. In marix noaion, he sysem of equaions is b = A w. Here, b is he column vecor of 7

lengh d whose ih enry is b i = y x i, ha is b = y x. The righ side is T =1 x i( x T w) where he superscrip T means ranspose and he do produc has been wrien as a marix produc. This yields T T b = x ( x T w) = ( x x T ) w = A w =1 where he d d square marix A = x x T. The row i column j enry of A is A ij = x ix j. Mahemaically, he soluion o he sysem A w = b is w = A 1 b. Compuaionally, evaluaing he inverse A 1 of A is more expensive han jus solving he sysem of equaions once for a specific vecor b. In pracice, in Malab one uses he backslash operaor, and oher programming environmens have a similar feaure. The inverse of A is no well-defined when A does no have full rank. Since A is he sum of T marices of rank one, his happens when T < d, and can happen when he inpu vecors x are no linearly independen. One way of overcoming his issue is o choose he soluion w wih minimum norm such ha A w = b. Such a w always exiss and is unique. Concreely, his soluion is w = A + b where A + is he Moore-Penrose pseudo inverse of A, which always exiss, and can be compued via he singular value decomposiion (SVD) of A. We said above ha we can maximize he log likelihood in wo sages, firs finding he bes w i values, and hen finding he bes σ 2 value. The second sage is lef as an exercise for he reader. =1 4 The general EM algorihm Suppose ha, in he daa available for raining, he oucomes of some random variables are unknown for some examples. These oucomes are called hidden or laen, and he examples are called incomplee or parial. Concepually, i is no he case ha he hidden oucomes do no exis. Raher, hey do exis, bu hey have been concealed from he observer. Le X be he se of all nodes of he Bayesian nework. As before, suppose ha here are T raining examples, which are independen and idenically disribued. For he h raining example, le V be he se of visible nodes and le H be he se of hidden nodes, so X = V H. Noe ha differen examples may have differen hidden nodes. 8

As before, we wan o maximize he log likelihood of he observed daa: L = = log p(v = v ) log h p(v = v, H = h ) = log h n i=1 p(x i = x i pa(x i ) = pa i ). In he las expression above, each X i belongs o eiher V or H. Because of he sum over h, we canno move he logarihm inside he produc and we do no ge a separae opimizaion subproblem for each node X i. Expecaion-maximizaion (EM) is he name for an approach o solving he combined opimizaion problem. To simplify noaion, assume iniially ha here is jus one raining example, wih one observed random variable X = x and one hidden random variable Z. Le θ be all he parameers of he join model p(x = x, Z = z; θ). Following he principle of maximum likelihood, he goal is o choose θ o maximize he log likelihood funcion, which is L(θ; x) = log p(x; θ). As noed before, p(x; θ) = z p(x, z; θ). Suppose we have a curren esimae θ for he parameers. Muliplying inside his sum by p(z x; θ )/p(z x; θ ) gives ha he log likelihood is D = log p(x; θ) = log z p(x, z; θ) p(z x; θ ) p(z x; θ ). Noe ha z p(z x; θ ) = 1 and p(z x; θ ) 0 for all z. Therefore D is he logarihm of a weighed sum, so we can apply Jensen s inequaliy, 1 which says 1 The mahemaical fac on which he EM algorihm is based is known as Jensen s inequaliy. I is he following lemma. Lemma: Suppose he weighs w j are nonnegaive and sum o one, and le each x j be any real number for j = 1 o j = n. Le f : R R be any concave funcion. Then f( n w j x j ) j=1 n w j f(x j ). Proof: The proof is by inducion on n. For he base case n = 2, he definiion of being concave says ha f(wa + (1 w)b) wf(a) + (1 w)f(b). The logarihm funcion is concave, so Jensen s inequaliy applies o i. j=1 9

log j w jv j j w j log v j, given j w j = 1 and each w j 0. Here, we le he sum range over he values z of Z, wih he weigh w j being p(z x; θ ). We ge D E = z p(z x; θ ) log p(x, z; θ) p(z x; θ ). Separaing he fracion inside he logarihm o obain wo sums gives E = ( p(z x; θ ) log p(x, z; θ) ) ( p(z x; θ ) log p(z x; θ ) ). z Since E D and we wan o maximize D, consider maximizing E. The weighs p(z x; θ ) do no depend on θ, so we only need o maximize he firs sum, which is p(z x; θ ) log p(x, z; θ). z In general, he E sep of an EM algorihm is o compue p(z x; θ ) for all z. The M sep is hen o find θ o maximize z p(z x; θ ) log p(x, z; θ). How do we know ha maximizing E acually leads o an improvemen in he likelihood? Wih θ = θ, E = z p(z x; θ ) log p(x, z; θ ) p(z x; θ ) = z z p(z x; θ ) log p(x; θ ) = log p(x; θ ) which is he log likelihood a θ. So any θ ha maximizes E mus lead o a likelihood ha is beer han he likelihood a θ. 5 EM wih independen raining examples The EM algorihm derived above can be exended o he case where we have a raining se {x 1,..., x n } such ha each x i is independen, and hey all share he same parameers θ. In his case he log likelihood is D = i log p(x i ; θ). Le he auxiliary random variables be a se {Z 1,..., Z n } such ha he disribuion of each Z i is a funcion only of he corresponding x i and θ. Noe ha Z i may be differen for each i. By an argumen similar o above, D = i log z i p(x i, z i ; θ) p(z i x i ; θ ) p(z i x i ; θ ). 10

Using Jensen s inequaliy separaely for each i gives D E = i z i p(z i x i ; θ ) log p(x i, z i ; θ) p(z i x i ; θ ). As before, o maximize E we wan o maximize he sum p(z i x i ; θ ) log p(x i, z i ; θ). i z i The E sep is o compue p(z i x i ; θ ) for all z i for each i. The M sep is hen o find θ +1 = argmax θ p(z i x i ; θ ) log p(x i, z i ; θ). i z i End of he lecure on Thursday November 1. 6 EM for Bayesian neworks Le θ 0 be he curren esimae of he parameers of a Bayesian nework. For raining example, le v be he observed values of he visible nodes. The M sep of EM is o choose new parameer values θ ha maximize F = p(h v ; θ 0 ) log p(h, v ; θ) h where he inner sum is over all possible combinaions h of oucomes of he nodes ha are hidden in he h raining example. We shall show ha insead of summing explicily over all possible combinaions h, we can have a separae summaion for each hidden node. The advanage of his is ha separae summaions are far more efficien compuaionally. By he definiion of a Bayesian nework, F = p(h v ; θ 0 ) log p(x i = x i pa(x i ) = pa i ; θ) i h where each x i and each value in pa i is par of eiher v or h. Convering he log produc ino a sum of logs, hen moving his sum o he ouside, gives F = p(h v ; θ 0 ) log p(x i pa i ; θ). i h 11

For each i, he sum over h can be replaced by a sum over he alernaive values x of X i and π of he parens of X i, yielding F = p(x i = x, pa(x i ) = π v ; θ 0 ) log p(x π; θ). i x,π Noe ha summing over alernaive values for X i and is parens makes sense even if some of hese random variables are observed. If X i happens o be observed for raining example, le is observed value be x. In his case, p(x i = x, pa(x i ) = π v ; θ 0 ) = 0 for all values x x. A similar observaion is rue for parens of X i ha are observed. Changing he order of he sums again gives F = [ p(x, π v ; θ 0 )] log p(x π; θ). i x,π For comparison, he log likelihood in Equaion (1) on page 3 for he fully observed case can be rewrien as [ I(x = x i, π = pa i )] log p(x π; θ). x,π i The argumen following Equaion (1) says ha he soluion ha maximizes his expression is p(x i = x pa(x i ) = π) = I(x = x i, π = pa i ) I(π = pa i ). A similar argumen similar can be applied here o give ha he soluion for he new parameer values θ, in he parially observed case, is p(x π; θ) = p(x i = x pa(x i ) = π) = p(x i = x, pa(x i ) = π v ; θ 0 ) p(pa(x. i) = π v ; θ 0 ) To appreciae he meaning of his resul, remember ha θ is shorhand for all he parameers of he Bayesian nework, ha is all he CPTs of he nework. A single one of hese parameers is one number in one CPT, wrien p(x π; θ). In he special case where X i and is parens are fully observed, heir values x i and pa i are par of v, and p(x i = x, pa(x i ) = π v ; θ 0 ) = I(x = x i, π = pa i ). The maximum likelihood esimaion mehod for θ explained a he end of Secion 1 above is a special case of he expecaion-maximizaion mehod described here. 12

7 Applying EM o modeling language Secion 2 above described n-gram models of language. A major issue wih hese models is ha unigram models underfi he available daa, while higher-order models end o overfi. This secion shows how o use expecaion-maximizaion o fi a model wih inermediae complexiy, ha can rade off beween underfiing and overfiing. The cenral idea is o inroduce a hidden random variable called Z beween he random variable W for a word and he variable W for he following word. Specifically, he Bayesian nework has edges W Z and Z W. The alernaive values of he variable Z can be any discree se. Inuiively, each of hese values idenifies a differen possible linguisic conex. Each conex has a cerain probabiliy depending on he previous word, and each following word has a cerain probabiliy depending on he conex. We can say ha he previous word riggers each conex wih a word-specific probabiliy, while each conex suggess following words wih word-specific probabiliies. Le he number of alernaive conexs be c. Marginalizing ou he variable Z gives c p(w w) = p(w z)p(z w). z=1 This conex model has m(c 1) + c(m 1) parameers where m is he size of he vocabulary. If c = 1, he model reduces o he unigram model, while if c = m, he model has a quadraic number of parameers, like he bigram model. End of he lecure on Tuesday November 6. The following M sep derivaion is he same as in he quiz. The goal for raining is o maximize he log likelihood of he raining daa, which is log p(w, w ). (We ignore he complicaion ha raining examples are no independen, if hey are aken from consecuive ex.) Training he model means esimaing p(z w) and p(w z). Consider he former ask firs. The M sep of EM is o perform he updae p(z = z W = w) := p(z = z, W = w W = w, W = w ) p(w = w W = w, W = w ) = I(w = w)p(z = z W = w, W = w ) I(w = w) :w = p(z = z W = w =w, W = w ). coun(w = w) 13

This M sep is inuiively reasonable. Firs, he denominaor says ha he probabiliy of conex z given curren word w depends only on raining examples which have his word. Second, he numeraor says ha his probabiliy should be high if z is compaible wih he following word as well as wih he curren word. The E sep is o evaluae p(z w, w ) for all z, for each pair of consecuive words w and w. By Bayes rule his is p(z w, w ) = = p(w z, w)p(z w) z p(w z, w)p(z w) p(w z)p(z w) z p(w z )p(z w). This resul is also inuiively reasonable. I says ha he weigh of a conex z is proporional o is probabiliy given w and o he probabiliy of w given i. Finally, consider esimaing p(w z). The M sep for his is o perform he updae p(w = w Z = z) := p(w = w, Z = z W = w, W = w ) p(z = z W = w, W = w ) = I(w = w )p(z = z W = w, W = w ) p(z = z W = w, W = w ) :w p(z = z W = w = =w, W = w ) p(z = z W = w., W = w ) The denominaor here says ha he updae is based on all raining examples, bu each one is weighed according o he probabiliy of he conex z. The numeraor selecs, wih he same weighs, jus hose raining examples for which he second word is w. The E sep is acually he same as above: o evaluae p(z w, w ) for all z, for each pair of consecuive words w and w. 8 Mixure models Suppose ha we have alernaive models p(x; θ j ) for j = 1 o j = k ha are applicable o he same daa poins x. The linear combinaion p(x) = k λ j p(x; θ j ) j=1 14

is a valid probabiliy disribuion if λ j 0 and k j=1 λ j = 1. The combined model is ineresing because i is more flexible han any individual model. I is ofen called a mixure model wih k componens, bu i can also be called an inerpolaion model, or a cluser model wih k clusers. We can formulae he ask of learning he coefficiens from raining examples using a Bayesian nework ha has an observed node X, an unobserved node Z, and one edge Z X. The CPT for Z is simply p(z = j) = λ j, while he CPT for X is p(x z) = p(x; θ z ). The goal is o maximize he log likelihood of raining examples x 1 o x T. Marginalizing over Z, hen using he produc rule, shows ha p(x) = z p(x, z) = z p(z)p(x z) = k λ j p(x; θ j ) which is he same mixure model. The CPT of he node Z can be learned using EM. The E sep is o compue p(z = j x ) for all j, for each raining example x. Using Bayes rule, his is p(z = j x ) = p(x Z = j)p(z = j) p(x ) j=1 = p(x; θ j)λ j k i=1 λ ip(x; θ i ) The general M sep for Bayesian neworks is p(x i = x pa i = π) := p(x i = x, pa i = π v ) x p(x i = x, pa i = π v ). For he applicaion here, X i is Z and he parens of X i are he empy se. We ge he updae p(z = j) = λ j := p(z = j x ) k i=1 p(z = i x ) = p(z = j x ) T where T is he number of raining examples. End of he lecure on Thursday November 8. 9 Inerpolaing language models As a special case of raining a mixure model, consider a linear combinaion of language models of differen orders: p(w l w l 1, w l 2 ) = λ 1 p 1 (w l ) + λ 1 p 1 (w l w l 1 ) + λ 3 p 3 (w l w l 1, w l 2 ) 15

where all hree componen models are rained on he same corpus A. Wha is a principled way o esimae he inerpolaion weighs λ i? The firs imporan poin is ha he weighs should be rained using a differen corpus, say C. Specifically, we can choose he weighs o opimize he log likelihood of C. If he weighs are esimaed on A, he resul will always be λ n = 1 and λ i = 0 for i < n, where n indicaes he highes order model, because his model fis he A corpus he mos closely. When esing he final combined model, we mus use a hird corpus B, since he weighs will overfi C, a leas slighly. We can formulae he ask of learning he λ i weighs using a Bayesian nework. The nework has nodes W l 2, W l 1, W l, and Z, wih edges W l 2 W l, W l 1 W l, and Z W l. The CPT for Z is simply p(z = i) = λ i, while he CPT for W l is p 1 (w l ) if z = 1 p(w l w l 1, w l 2, z) = p 2 (w l w l 1 ) if z = 2 p 3 (w l w l 1, w l 2 ) if z = 3 The goal is o maximize he log likelihood of he uning corpus C. Marginalizing over Z, hen using he produc rule and condiional independence, shows ha p(w l w l 1, w l 2 ) = λ 1 p 1 (w l ) + λ 1 p 1 (w l w l 1 ) + λ 3 p 3 (w l w l 1, w l 2 ) as above. To learn values for he parameers λ i = p(z = i), he E sep is o compue he poserior probabiliy p(z = i w l, w l 1, w l 2 ). Using Bayes rule, his is p(z = i w l, w l 1, w l 2 ) =... The M sep is o updae λ i values. The general M sep for Bayesian neworks is p(x i = x pa i = π) := p(x i = x, pa i = π v ) x p(x i = x, pa i = π v ). For he applicaion here, raining example number is he word riple ending in w l, X i is Z, and he parens of X i are he empy se. We ge he updae l p(z = i) := p(z = i w l, w l 1, w l 2 ) 3 j=1 l p(z = j w l, w l 1, w l 2 ) = l p(z = i w l, w l 1, w l 2 ) L where L is he number of words in he corpus. 16