Advanced Introduction to Machine Learning Homework 1 Solutions October 6, 2014

Size: px

Start display at page:

Download "Advanced Introduction to Machine Learning Homework 1 Solutions October 6, 2014"

Frank Cummings
5 years ago
Views:

1 Advanced Introducton to Machne Learnng Homewor 1 Solutons October 6, Regresson Samy 1.1 Mult-Tas Regreson 1. The Cost functon can be decomposed as, J 0 Θ Y j,: Θ j,: X 2 j1 where Y j,:, Θ j,: refer to the j th rows of Y, Θ respectvely. Snce ths essentally decouples the parameters nvolved wth each tas, we can solve them separately. 2. a Indepent: Yes. Convex: Yes. Ths s the usual L 2 regularzaton to control the varance. b Indepent: No. Convex: Yes. Here we are tryng to model all our outputs as a functon of a sparse subset of the covarates. c Indepent: No. Convex: No. Here, by encouragng Θ to be low ran we are tryng to create lnear depence across multple tass. e.g. Say we are tryng to predct precptaton n dfferent regons based on dfferent weather features. We want dfferent models for each regon snce a unversal model may not be sutable. However, all these tass are lely to be related and so we want to encourage depence. In dong so, we reduce the sample complexty of learnng all tass snce data from one regon wll be useful n estmatng the parameters of another regon. Some of you also ponted out that a ran penalty s ntractable. Ths s true. A commonly used convex relaxaton s to use a nuclear norm penalty. 1.2 Shrnage n Rdge Regresson 1. The soluton to the Rdge Regresson problem s β X X +λi 1 X y. Usng the SVD X UΣV, β V Σ 2 + λi 1 ΣU y x β z V β z Σ 2 + λi 1 ΣU y d z σ σ 2 + λu y 2. Snce X has zero mean, the drectons v 1,..., v n are the egenvectors of the emprcal covarance matrx. The expresson z σ ndcates that the drectons along whch the emprcal covarance s lowest are σ 2 +λ shrned the most. In the drectons where data s more spread out emprcal covarance hgh we can estmate the gradents of our lnear functon well snce t would be less susceptble to nose. In the drectons where there s less spread, there s hgh varance n the estmate of the gradent. Rdge 1

2 regresson helps us control the varance by mposng dfferent penaltes along dfferent prncpal axes. Some of you also made the equvalent argument that f X was poorly condtoned then t would blow up the varance n the drectons n whch σ was small. The penalty prevents ths from happenng. 1.3 Local/Weghted Lnear Regresson 1. Usng the gven notaton, we can express β as follows, β argmn W 1/2 y Xβ 2 β By settngs ts gradent to zero we get β X W X 1 X W y. Substtutng ˆfx x β yelds the requred answer. 2. By settng β θ, x 1, X 1 R n and ˆfx β x θ we get the same problem as above. Then X W X w, X W y w y whch yelds ˆfx w xy w α y where α w / j w j. For w x x x /h we get precsely the Nadaraya-Watson Estmator. Snce the predcton at any pont s a convex combnaton of the observed labels t always les n between the maxmum and the mnmum. 1.4 Least Norm Soluton 1. The soluton may be obtaned by solvng the problem, mnmze β 2, subject to Xβ y The Lagrangan for the problem s Lν, β β β + ν Xβ yb. By settng β L 0 and then substtutng bac we get, β ln X XX 1 y. 2. Let J y Xβ 2 and β X y. β J 2X Xβ 2X y. When β β, β J 2X XX y 2X y 2X y 2X y 0. Snce J s convex n β and β satsfes the statonarty condton we have that β s a least squares soluton. Let β be any other least squares soluton: y Xβ y Xβ. Then, y Xβ 2 y Xβ Xδ 2 y Xβ 2 + Xδ 2 The last step follows by observng that y Xβ Xδ y Xδ y X X Xδ 0. Hence δ N X and β δ. Therefore β 2 β 2 + δ 2 β 2. Some of you presented alternatve arguments, mostly based on the SVD characterzaton of the MP nverse. 2

3 2 Pólya Dscrmnant Analyss Samy 2.1 Model 1. Condtoned on y, m the dstrbuton of x corresponds to a Drchlet Multnomal wth parameters m, α. Its mass functon and the logarthm s p dm x ; α ΓA Γm + A V s1 Γx s + α s Γα s log p dm x ; α log ΓA log Γm + A + where A V s1 αs and m V s1 xs. V s1 log Γx s + α s The lelhood and log lelhood of the data D x, y n s then gven by, pd; θ, α 1,..., α K lθ, α 1,..., α K ; D n K 1 K 1 1y θ p dm x ; α log Γαs 1y log θ + log p dm x ; α 2. We need to maxmze the above log lelhood w.r.t θ subject to the constrant θ 1. The correspondng Lagrangan s, K n K L log θ 1y + ν θ 1 1 Solvng ths for θ yelds the MLE estmate n ˆθ : 1y n 1 # tranng nstances n class n 3. The soluton to ths part are based on deas from [Mn00]. The frst and second partal dervatves of the log lelhood are, l 2 l n 2 2 l αt 1y ΨA Ψm + A + Ψx s 1y Ψ A Ψ m + A + Ψ x s 1y Ψ A Ψ m + A + α s Ψαs + α s Ψ α s where Ψ, Ψ are the d-gamma and tr-gamma functons respectvely. The gradent g R V for optmzng α s gven by g s l. The Hessan H R V V can be wrtten as, H D + z11 where D a dagonal matrx and z R are gven by, D ss [] Ψ x s + α s n Ψ α s z n Ψ A [] Ψ m + A 3

4 Here [] refers to the set of tranng nstances n class and n []. By the Sherman Morrson formula, H 1 can be computed as H 1 D 1 D 1 11 D 1 1/z + 1 D 1 1 The Newton s method update s then gven by α new α old H 1 g. To analyse the complexty, note that we frst need to compute and store g, D and z. Ths requres only ÕV tme and space complexty. Snce we can wrte, [H 1 g ] g the nverson can be done n ÕV tme. /D, j gj /Dj,j 1/z + 1/D, j 1/Dj,j 4. We choose the class that maxmzes the posteror py x px ypy px, y argmax py x argmax px y py argmax p dm x ; α θ {1,...,K} {1,...,K} {1,...,K} 5. In the gven Bayesan formulaton, we can wrte the jont and log-jont probablty as, K pd, θ, α 1,..., α K pθ pα pd θ, α 1,..., α K ld, θ, α 1,..., α K 1 Γ θ 0 K Γθ K 1 θ log θ + θ θ 0 1 K 1 1 2π K/2 1/2λ exp λ α K/2 2 pd θ, α 1,..., α K K λ α 2 + lθ, α 1,..., α K ; D + Cθ 0, λ where Cθ 0, λ s a constant term. As before, by wrtng out the Lagrangan and optmzng for θ we get, As for α, the partal dervatves are, l l 2λα, ˆθ : 1 n + θ 0 1 n + j θj 0 K 2 l 2 2 l 2 2λ 2 l αt 2 l αt We can perform the Newton s step effcently usng the same trc by settng z to be the same as before and g s n ΨA D ss [] Ψ x s Ψm + A + + α s Ψx s n Ψ α s 2λ + α s n Ψα s 2λα 4

5 2.2 Experment Ths s our Matlab mplementaton. functon [theta, alpha] tranpdax, y, theta0, lambda % Prelms K numelunquey; V szex, 2; numdata szex, 1; % MAP for theta table tabulatey; adjustedfreqs table:,2 + theta0-1; theta adjustedfreqs/sumadjustedfreqs; % MAP for alpha alpha zerosk, V; for 1:K X X y, :; alpha, : newtonraphsonpdax, lambda; functon [alpha_] newtonraphsonpdax, lambda % Prelms numnriters 10; % Just use 5 teratons of NR n szex, 1; % number of tranng data n ths class m sumx, 2; % number of words n each documents ntpt sumx; ntpt ntpt/sumntpt; % Intalzaton % Now perform Newton s alpha_ ntpt; % alpha n the current teraton for nriter 1:numNRIters % Compute the followng A sumalpha_; XplusAlpha bsxfun@plus, X, alpha_; % The gradent g n * psa - sumpsm + A + sum psxplusalpha... - n * psalpha_ - 2 * lambda * alpha_; % The value z see solutons z n * ps1, A - sumps1, m + A; % The dagonal of the Hessan D sumps1, XplusAlpha - n * ps1, alpha_ - 2*lambda; % Newton s step update Hnvg g./d - 1./D * sumg./d / 1/z + sum1./d; alpha_ alpha_ - 1*Hnvg; functon logl classloglelhoodsx, alpha 5

6 % Prelms A sumalpha; m sumx, 2; % number of words n each documents XplusAlpha bsxfun@plus, X, alpha; % Compute the log lelhood logl gammalna - gammalnm + A +... sumgammalnxplusalpha, 2 - sum gammalnalpha ; functon [preds, classlogjonts] predctpdax, theta, alpha % prelms n szex, 1; K numeltheta; % Frst obtan the class log jont probabltes classlogls zerosn, K; for 1:K classlogls:, classloglelhoodsx, alpha, :; classlogjonts bsxfun@plus, classlogls, logtheta ; % Fnally obtan the predctons [~, preds] maxclasslogjonts, [], 2; 3 Dualty 3.1 Wea Dualty 1. Lx, λ, u fx + λh 1 x + uh 2 x 2. gλ, u nf x R d Lx, λ, u 3. Let P denote the feasble regon of the prmal. If x P, that s, h 1 x 0, h 2 x 0, then for any λ 0, u R, we have fx fx + λh 1 x + uh 2 x Lx, λ, u. Tang nfmums over x P, nf fx nf Lx, λ, u nf Lx, λ, u gλ, u x P x P x R d The last nequalty holds because P R d. The requred result follows from the observaton that the above nequalty holds for λ 0, u. 6

7 3.2 Optmal Codng 1. Let P denote the feasble regon. It s suffcent to show that x α αx + 1 αy P gven x, y P, for α 0, 1. Usng weghted AM-GM nequalty and the feasblty of x, y, we can wrte 2 αx+1 αy α2 x + 1 α2 y α 2 x + 1 α 2 y α + 1 α 1 So x α satsfes the frst nequalty constrant. Further t s clear that x, y 0 x α 0. So x α P, whch proves the convexty of P. 2. Suppose for the purpose of contradcton that an optmal soluton x satsfes the strct nequalty, that s, n 2 x < 1. j, x j > 0, because otherwse, n 2 x > 1. So, one of the xj s can be reduced so that the objectve s reduced whle stll mantanng feasblty. Ths means x s not an optmal soluton, whch s a contradcton. 3. For λ R, u 0, the Lagrange functon s Lx, λ, u p x + λ 2 x 1 u x 1 Let x, λ, u satsfy KKT condtons. From complementary slacness, [n], we have u x 0. As shown n the prevous part, x > 0 for any feasble pont, whch means u 0. Lx, λ, u s convex n x as the frst and thrd terms n 1 are lnear and the Hessan of the second term s postve defnte. So the statonarty condton 0 Lx becomes 0 Lx when L s treated as a functon of x alone. [n], as u 0 L x 0 L x p λ2 x log 2 u p λ2 x log 2. Summng over, and notng that n 2 x 1, we get p λ log 22 x. 2 p λ log 2 1 λ log 2 So λ 1/ log 2 and from 2, we have p 2 x and hence x log 2 p. It s easy to verfy that x log 2 p, λ 1/ log 2, u 0 satsfy the KKT condtons and hence t s the optma. 2 x 4 SVM and Perceptron Veeru 4.1 Start wth the prmal and wrte the KKT condtons. For notaton, I wll use equatons from Chrs Burges tutoral on SVMhttp:// ramanath/svm.pdf. α 0 50 µ C 56 ξ 0 51 y fx 1 0 < α < C 50,55 µ > 0, y fx 1 + ξ 0 56 ξ 0, y fx 1 + ξ 0 y fx 1. α C 55 y fx 1 + ξ 0 52 y fx 1 7

8 4.2 Let me now f you have any dffculty wth ths. 4.3 Mstae bound for Perceptron Let x, y be the datapont for whch the perceptron fals n the th step, N. That s, w 1, y x < 0. We have w w 1 + y x from the algorthm. 1. Usng ths, and the fact that [n], y x, w δ, we can wrte w, w w 1, w + y x, w w 1, w + δ 2. Telescopng and usng w 0 0, we get w, w δ. 3. w 2 w 1 + y x 2 w w 1, y x + y x 2 w x 2 w M 2 We used w 1, y x < 0 and y ±1 to get the frst nequalty. Agan telescopng and usng w 0 0, we arrve at w 2 M 2. M 2 w 2 w, w 2 2 δ 2. We used the second part n the frst nequalty and the frst part n the thrd nequalty. The second nequalty s obtaned by notng that w 1 and usng Cauchy-Schwartz nequalty. From M 2 2 δ 2, t easly follows that M 2 /δ 2. References [Mn00] Thomas P. Mna. Estmatng a Drchlet Dstrbuton. Techncal report,

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Solutions to exam in SF1811 Optimization, Jan 14, 2015 Solutons to exam n SF8 Optmzaton, Jan 4, 25 3 3 O------O -4 \ / \ / The network: \/ where all lnks go from left to rght. /\ / \ / \ 6 O------O -5 2 4.(a) Let x = ( x 3, x 4, x 23, x 24 ) T, where the varable