xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

CSE 455/555 Sprng 2013 Homework 7: Parametrc Technques Jason J. Corso Computer Scence and Engneerng SUY at Buffalo jcorso@buffalo.edu Solutons by Yngbo Zhou Ths assgnment does not need to be submtted and wll not be graded, but students are advsed to work through the problems to ensure they understand the materal. You are both allowed and encouraged to work n groups on ths and other homework assgnments n ths class. These are challengng topcs, and workng together wll both make them easer to decpher and help you ensure that you truly understand them. 1. Maxmum Lkelhood of Bnary/Multnomal Varable Suppose we have a sngle bnary varable x 0, 1} wth x 1 denotes the heads and x 0 denotes the tals of the outcome from flppng a con. We do not make any assumpton on the farness of the con, nstead, we assume the probablty of x 1 wll be denoted by a parameter µ so that p(x 1 µ) µ, where 0 µ 1. (a) Wrte down the probablty dstrbuton of x. p(x 0 µ) 1 µ, so p(x µ) µ x (1 µ) 1 x, ths s known as the Bernoull dstrbuton. (b) Show that ths s a proper probablty dstrbuton,.e. the probablty sum up to 1. What s the expectaton and varance of ths dstrbuton? x 0,1} p(x µ) p(x 0 µ) + p(x 1 µ) 1 µ + µ 1 E(x) x 0,1} xp(x µ) 0 p(x 0 µ) + 1 p(x 1 µ) µ var(x) E(x 2 ) E(x) 2 } x 0,1} x2 p(x µ) µ 2 0 2 p(x 0 µ) + 1 2 p(x 1 µ) µ 2 µ µ 2 (c) ow suppose we have a set of observed values X x 1, x 2,..., x } of x. Wrte the lkelhood functon and estmate the maxmum lkelhood parameter µ ML. p(x µ) p(x µ) µx (1 µ) 1 x The log-lkelhood can be wrtten as: l(x µ) ln p(x µ) x ln µ + (1 x ) ln(1 µ)} Take the dervatve w.r.t. µ and set to zero we get: x µ (1 x ) 1 µ } 0 (1 µ) x µ (1 x ) x µ x µ µ 1 x

f we observe m heads, then µ ML m. µ ML x (d) ow suppose we are rollng a K-sded dce, n other words we have data D x 1, x 2,..., x } whch can take on K values. Assume the generaton of each value of k K s determned by a parameter θ k 0, so that the p(x k θ k ) θ k, and k θ k 1. Wrte down the lkelhood functon. Frst we ntroduce a convenent representaton called 1-of-K scheme, whch the varables s represented by a K-dmensonal bnary vector that one of the element can take on value one. Ths s exactly the case for ths queston, snce at each tme we can only generate one sample taken on one partcular value. Therefore, smlar to the Bernoull form we have the dstrbuton shown as: p(x θ) where K x 1, θ 0 and K θ 1. ow we can wrte the lkelhood functon as: p(d θ) n1 θ x n θ x n θ x n where n n x n s the number of data that take on value, and log-lkelhood: l(d θ) n1 K x n ln θ (e) Wrte the maxmum lkelhood soluton of θ ML k. We have to maxmze the lkelhood w.r.t. constrant, so we have to ntroduce Lagrange multpler, and the new objectve functon s: F (θ, λ) n1 take the dervatve w.r.t. θ j and set to zero we get: F θ j θ n K K x n ln θ + λ( θ 1) n1 n j θ j λ θ j n j λ x nj θ j + λ 0 substtute ths back to our constrant θ 1, we have therefore we get θ n. K K θ λ n λ 1 2

2. ave Bayes In nave Bayes, we assume that the presence of a partcular feature of a class s unrelated to the presence of any other feature. For example, a frut may be consdered to be a watermelon f t s green, round, and more than 10 pounds. we wll consder all these three features as ndependent to each other n nave Bayes. Let our features x, [1 d] be bnary valued and have d dmensons,.e. x 0, 1} and our nput feature vector x [x 1 x 2 x d ] T. For each tranng sample, our target value y 0, 1} s also a bnary-valued varable. Then our model s parameterzed by φ y0 p(x 1 y 0), φ y1 p(x 1 y 1), and φ y p(y 1), and p(y) (φ y ) y (1 φ y ) (1 y) p(x y 0) p(x y 1) p(x y 0) (φ y0 ) x (1 φ y0 ) (1 x ) p(x y 1) (φ y1 ) x (1 φ y1 ) (1 x ) (a) Wrte down the jont log-lkelhood functon l(θ) log n1 p(x(n), y (n) ; θ) n terms of the model parameters gven above. x (n) means the nth data pont, and θ represents all the parameters,.e. φ y, φ y0, φ y1, 1,..., d}. l(θ) log log p(x (n), y (n) ; θ) n1 p(x (n) y y (n) ; θ)p(y (n) ; θ) n1 ( ) log p(x (n) y (n) ; θ) p(y (n) ; θ) n1 ( log p(y (n) ; θ) + n1 d log p(x (n) y (n) ; θ) y (n) log φ y + (1 y (n) log(1 φ y ) + n1 (b) Estmate the paramters usng maxmum lkelhood,.e. φ y1. Take dervatve wth respect to these 3 parameters, we get: ) d ( x (n) ) } log φ y (n) + (1 x (n) ) log(1 φ y (n)) fnd solutons for paramter φ y, φ y0 and φ y n1 y(n) 3

n1 φ y0 (1 y(n) )x (n) n1 (1 y(n) ) φ y1 n1 y(n) x (n) n1 y(n) (c) When a new sample pont x comes, we make the predcton based on the most lkely class estmate generated by our model. Show that the hypothess returned by nave Bayes s lnear,.e. f p(y 0 x) and p(y 1 x) are the class probabltes returned by our model, show that there exst some α so that p(y 1 x) p(y 0 x)f and only f α T x 0 where α [α 0 α 1 α d ] T and x [1 x 1 x 2 x d ] T. p(y 1 x) p(y 0 x) p(x y 0)p(y 0) p(x y 1)p(y 1) (1 φ y ) (φ y0 ) x (1 φ y0 ) (1 x) φ y (φ y1 ) x (1 φ y1 ) (1 x ) log (1 φ y ) + log φ y + d α T x 0 d x log φ y0 + d x log φ y1 + x log φ y0 φ y1 1 φ y1 d (1 x ) log (1 φ y0 ) d (1 x ) log (1 φ y1 ) 1 φ y0 + log ( 1 φ y φ y ( 1 φ y0 1 φ y1 ) d ) 0 Where α 0 log ( 1 φ y ( 1 φ y0 ) d ) φ y 1 φ y1 3. Gaussan Dstrbuton α log φ y0 φ y1 1 φ y1 1 φ y0 Please famlarze yourself wth the maxmum lkelhood estmaton for Gaussan dstrbuton n the class notes and answer the followng questons. (a) What s the jont probablty dstrbuton p(x; µ, σ 2 ) of the samples? p(x; µ, σ 2 ) (x µ, σ 2 ) (b) What s the maxmum lkelhood (ML) estmaton of the paramters, f both µ and σ 2 are unknown? Please refer to the lecture notes. 4

(c) Show that the ML estmaton of the varance s based,.e. show that E(σML) 2 1 σ2 Hnt: you can use the fact that the expectaton of random varable from a Gaussan dstrbutuon s µ,.e. E(x) µ, and var(x) σ 2 E(x 2 ) E(x) 2 E(σML) 2 E( 1 (x µ ML ) 2 ) E( 1 } (x 2 + µ 2 ML 2x µ ML ) ) E( 1 (x 2 + ( 1 x j ) 2 2x ( 1 x j )) ) j1 j1 E( 1 (x 2 + 1 2 x 2 j + 1 2 x j x k 2 x2 2 x x j ) ) j1 j k j 1 E(x 2 ) + 1 2 E(x 2 j) + 1 2 E(x j x k ) 2 E(x 2 ) 2 1 j1 j k (σ 2 + µ 2 ) + σ 2 + µ 2 + ( 1)µ 2 2(σ 2 + µ 2 ) 2( 1)µ 2} E(x x j ) j 1 σ 2 + µ 2 + σ 2 + µ 2 + µ 2 µ 2 2σ 2 2µ 2 2µ 2 + 2µ 2} 1 (σ2 σ 2 ) 1 σ2 (d) Please wrte down the objectve functon of maxmum a posteror (MAP) estmaton of the parameters, f we assume that only µ s the unknown parameter and follow a Gaussan dstrbuton wth mean µ 0 and varance σ0 2. We know that p(µ) (µ µ 0, σ0 2 ), so the posteror probablty of parameter µ would be: p(µ X) p(x µ)p(µ) (x µ, σ 2 )} (µ µ 0, σ0) 2 (e) Please wrte down the Bayesan formulaton of ths problem, f all other assumptons stay the same as n queston d. p(x X) p(x θ)p(θ X)dθ where x s some new data that you want to do predcton and θ s the set of parameters from the model, n ths case θ µ}, snce we assume µ s the only unknown parameter. 5