XII.3 The EM (Expectation-Maximization) Algorithm

XII.3 The EM (Expectaton-Maxzaton) Algorth Toshnor Munaata 3/7/06 The EM algorth s a technque to deal wth varous types of ncoplete data or hdden varables. It can be appled to a wde range of learnng probles ncludng Bayesan networs. Bascs of the EM algorth Assue that we have an ncoplete data set. It s ncoplete snce data for soe or all nstances of certan attrbutes are ssng or soe characterstc values are unnown. The data and assocated paraeters n the EM algorth are classfed nto three categores: X, Z and θ. Let X = {x,, x } be a set of observed data, Z = {z,, z } be a set of unobserved (.e., hdden) data, and Y = X Z be the entre set of data. There s one-to-one correspondence between x n X and z n Z. Also let θ be a set of unnown paraeters that characterzes Y. Gven X, our proble s to deterne Z and θ. We note that the proble s doubly coplcated because we have to deterne both Z and θ. Dependng on the applcaton, X, Z and θ can tae a varety of fors. For exaple, each of x and z can be a scalar or a vector havng ts coponents as, e.g., z, z,... Each paraeter n θ can also be a scalar or a vector. The basc steps of the EM algorths can be descrbed as follows: 0. Intalzaton. Assgn values to paraeters n θ (arbtrarly, at rando, or based on soe nowledge). Repeat (terate) the followng two steps untl the soluton for Z and θ converges.. Expectaton Step (E-Step n short). Assung the current paraeter values are correct, copute expected values of Z.. Maxzaton Step (M-Step n short). Assung that Y = X Z s a truly observed data set (although the Z part s not), calculate new paraeter values of θ. The Step s called the axzaton step because we search for a axu lelhood hypothess n ters of the paraeters wth the data set of Y = X Z. As n any other so-called hll-clbng algorths, t can be stuc n a local axa, partcularly when substantal data are ssng. Prelude: Sple Illustratons As dscussed n eleentary statstcs textboos, the noral dstrbuton f(x) wth ean µ and standard devaton σ s gven by f ( x) = e σ π ( xµ ) σ () f(x) represents the probablty densty functon or the probablty dstrbuton of the noral dstrbuton. The area under the curve f(x) s wth the noralzng factor / σ π (See Fg. 6). Many types of data follow the noral dstrbuton wth approprate scalng factors.

Fgure 6. A noral dstrbuton wth the ean µ and standard devaton σ. Exaple. A 0-pont quz n a class of 00 students. Only gven data s X = {x }, =, 00, where x represents the score receved by the th student. For exaple, X = {x, x, x 3, x 4,, x 00 } = {0,,,,..., 0}. (Note. Usually the noral dstrbuton s used for contnuous values of x such as x = 5.47. We are applyng the dstrbuton for dscrete values of x as a sple llustraton.) If we assue that x are sorted for splcty, the data ndcate that one student receved 0 ponts, two students receved pont, etc. The followng Fg. 7 shows parts of X for x, x, x 3, x 99, and x 00. Each crcle on the x-axs represents a data nstance. Fgure 7. Ponts receved by students for a 0-pont quz n a class of 00 students. Only ponts for fve out of 00 students are shown: the leftost crcle represents x = 0, the next two crcles represent x = and x 3 =, and the two rghtost crcles x 99 = 0 and x 00 = 0. In general, the data need not be sorted. For exaple, x for Adas ay be 8, x for Sth ay be 3, etc. It s a coon practce to tally the data to see the pont dstrbuton easer. The followng s such a tally for our exaple: x: 0 3 4 5 6 7 8 9 0 f(x): 6 7 0 9 7 3 Total = 00 Here x represents quz pont, and f(x) the nuber,.e., the frequency of students who receved score x. Ths s convenent to see the dstrbuton,.e., how any students receved what pont. But the orgnal nforaton of raw data, the pont receved by each student s lost. The above tally can be depcted by a graph, by addng an ordnate representng the nuber of students to Fg. 7 and droppng the crcles representng students resultng to the followng Fg. 8. We see that a graph such as Fg. 8 can be approxated by a noral dstrbuton le equaton () and Fg. 6 (wth an approprate scalng factor).

Fgure 8. Pont dstrbuton for a 0-pont quz n a class of 00 students. In certan cases, data ay not ft well to a noral dstrbuton. Instead, t ay be ore natural to ft the data to a xture of noral dstrbutons. Suppose the pont dstrbuton s gven as follows: x: 0 3 4 5 6 7 8 9 0 f(x): 0 7 6 4 3 7 8 6 5 Total = 00 Fg. 9 depcts ths dstrbuton. We see two peas at x = 3 and 8. It ay be natural to consder ths dstrbuton as a xture of two noral dstrbutons, wth dfferent eans, standard devatons, and heghts. In general, data can be a xture of two, three,..., noral dstrbutons. Ths s the type of the proble the EM algorth addresses gven data, for whch soe are observable whle soe are not, we are to deterne the underlyng xture of noral dstrbutons. Fgure 9. Data that ft a xture of two noral dstrbutons. Case Study: A Mxture of Dstnct Noral Dstrbutons The followng Fg. 0 shows an exaple of two noral dstrbutons f (x) wth µ and σ and f (x) wth µ and σ. The area for each of the two dstrbutons s. In our proble, we want to ft each data nstance x to a probablty dstrbuton f(x) that s a xture of dstnct noral dstrbutons ultpled by weghts, where s a nown postve nteger (for exaple, = n Fg. 0) as: f ( x) p f ( x ; µ, σ ) = () = 3

where f s the th noral dstrbuton for a data nstance x wth the ean µ and the standard devaton σ. For specfc values of x, µ and σ, the consttuent functon f (x; µ, σ ) can be evaluated by substtutng these values nto equaton (). p s the weght or the probablty of the coponent, contrbutng to the total dstrbuton f(x). p s assocated only wth and does not depend on specfc x; = p = holds. The probablty dstrbuton f(x ) for a specfc data nstance x can be represented by sply replacng x wth x n equaton (). Fgure 0. A xture of = noral dstrbutons, f (x) and f (x). Each crcle on the x-axs represents a data nstance. Exaple. Suppose that our probablty dstrbuton f(x) s a xture of = noral dstrbutons f (x) and f (x) gven n Fg. 0, wth p = 0.8 and p = 0.. We note p + p =. Then ( xµ ) ( xµ ) 0.8 σ 0. σ f(x) = 0.8 f (x; µ, σ ) + 0. f (x; µ, σ ) = e + e σ π σ π A graph for ths f(x) can be obtaned fro Fg. 0 as follows. Contract (flatten) f (x) and f (x) along the ordnate drecton by ultplyng 0.8 and 0., respectvely, then add the two graphs (Fg. ). Fgure. Graph f(x) obtaned by superposng = noral dstrbutons, 0.8f (x) and 0.f (x). 4

The EM algorth Let a set of observed data nstances, X = {x,, x }. Our proble s to deterne a set of unobservable data Z and a set of paraeters θ that characterzes Y = X Z. More specfcally, Z and θ are: Z = {p }, =, and =,, where p represents the probablty that x belongs to the th coponent. Here each z n Z = {z,, z } s a vector havng coponents as, z = p,..., z = p. θ ={ p,..., p, µ,..., µ, σ,..., σ }, the paraeter vector. In Fg. 0 exaple where =, {x,, x } are represented by sall crcles on the abscssa. They are the only data that are observable. Our proble s to deterne two dstrbutons le f (x) and f (x) through the paraeter vector θ, and {p }, a easure for whch each data nstance s generated by whch dstrbuton. Our EM algorth can be perfored as follows: Step 0. Intalzaton. Assgn approprate values to θ ={ p,..., p, µ,..., µ, σ,..., σ }. E-step. Assung the current value of θ, copute Z = {p }, =, and =, as follows: p P( x ) ( x ; µ, σ ) (weght) ( th dstrbuton) p f = = (3) (total dstrbuton) f Ths result can also be obtaned by eployng Bayes' rule: ( ) P x ( ) ( ) ( ) P x P p f = = P x f x ( x ; µ, σ ) ( ) ( x ) In the above, we used P( x ) = f ( x ; µ, σ ), P( ) = p, and P( x ) = f ( x ). We note that f ( x ) p = ( ;, ) p f x µ σ = =,.e., the probabltes add up to for each x, and = f ( x ) = f ( x ) p = =. f (x ; µ, σ ) can be deterned by equaton () and f (x ) can be deterned = = = by equaton (). M-step. Fro the above, we can estate new paraeters of θ as follows. p = p (4) = to average the probablty for the th coponent over data ponts. µ = p x (5) p = to average x over data ponts wth weght factors. p = ( x ) σ = p µ (6) Ths s the standard devaton verson of equaton (5) for µ. After ntalzaton, teratons are perfored for the E, M, E,, steps untl Z and θ converge. Exaple. A sple specal case. = and σ σ σ = = s nown. θ ={,,, } p p µ µ. 5

( ) ( ; µ, σ) ( ;, ) f x = p f x + p f x µ σ ( ) E-step. =,. ( xµ ) σ p f( x ; µ, σ) pe p = = ( ) ( xµ ) ( x f x µ ) σ σ pe + p e M-step. Copute new p, u for =,. p = = p (4 ) µ = p x (5 ) p = To perfor teratons, ntalze θ ={,,, } the E-step. (3 ) p p µ µ. Then repeat the E and M steps startng fro In the above, each data nstance x s assued to be a sngle scalar value. As an extenson, each data nstance can be a vector x when there are ultple ndependent varables. For exaple, when there are three ndependent varables, each data nstance would be a vector x = (x, x, x 3 ). The above dscussons can be extended by replacng scalar quanttes such as x, z, and the paraeters by vector quanttes. 6