CSE 546 Midterm Exam, Fall 2014(with Solution)

Size: px

Start display at page:

Download "CSE 546 Midterm Exam, Fall 2014(with Solution)"

Rhoda Fitzgerald
5 years ago
Views:

1 CSE 546 Mdterm Exam, Fall 014(wth Soluton) 1. Personal nfo: Name: UW NetID: Student ID:. There should be 14 numbered pages n ths exam (ncludng ths cover sheet). 3. You can use any materal you brought: any book, class notes, your prnt outs of class materals that are on the class webste, ncludng my annotated sldes and relevant readngs. You cannot use materals brought by other students. 4. If you need more room to work out your answer to a queston, use the back of the page and clearly mark on the front of the page f we are to look at what s on the back. 5. Work effcently. Some questons are easer, some more dffcult. Be sure to gve yourself tme to answer all of the easy ones, and avod gettng bogged down n the more dffcult ones before you have answered the easer ones. 6. You have 80 mnutes. 7. Good luck! Queston Topc Max score Score 1 Short Answer 4 Decson Trees 16 3 Logstc Regresson 0 4 Boostng 16 5 Elastc Net 4 Total 100 1

2 1 [4 ponts] Short Answer 1. [6 ponts] On the D dataset below, draw the decson boundares learned by the followng algorthms (usng the features x/y). Be sure to mark whch regons are labeled postve or negatve, and assume that tes are broken arbtrarly. (a) Logstc regresson (λ = 0) (b) 1-NN (c) 3-NN. [6 ponts] A random varable follows an exponental dstrbuton wth parameter λ (λ > 0) f t has the followng densty: p(t) = λe λt, t [0, ) Ths dstrbuton s often used to model watng tmes between events. Imagne you are gven..d. data T = (t 1,..., t n ) where each t s modeled as beng drawn from an exponental dstrbuton wth parameter λ. (a) [3 ponts] Compute the log-probablty of T gven λ. (Turn products nto sums when possble).

3 ln p(t ) = ln p(t ) ln p(t ) = ln(λe λt ) ln p(t ) = ln λ λt ln p(t ) = n ln λ λ t (b) [3 ponts] Solve for ˆλ MLE. λ (n ln λ λ t ) = 0 n λ t = 0 ˆλ MLE = n t 3. [6 ponts] A B C Y F F F F T F T T T T F T T T T F (a) [3 ponts] Usng the dataset above, we want to buld a decson tree whch classfes Y as T/F gven the bnary varables A, B, C. Draw the tree that would be learned by the greedy algorthm wth zero tranng error. You do not need to show any computaton. 3

4 (b) [3 ponts] Is ths tree optmal (.e. does t get zero tranng error wth mnmal depth)? Explan n less than two sentences. If t s not optmal, draw the optmal tree as well. Although we get a better nformaton gan by frst splttng on A, Y s just a functon of B/C: Y = BxorC. Thus, we can buld a tree of depth whch classfes correctly, and s optmal. 4. [6 ponts] In class we used decson trees and ensemble methods for classfcaton, but we can use them for regresson as well (.e. learnng a functon from features to real values). Let s magne that our data has 3 bnary features A, B, C, whch take values 0/1, and we want to learn a functon whch counts the number of features whch have value 1. (a) [ ponts] Draw the decson tree whch represents ths functon. How many leaf nodes does t have? 4

5 We can represent the functon wth a decson tree contanng 8 nodes. (b) [ ponts] Now represent ths functon as a sum of decson stumps (e.g. sgn(a)). How many terms do we need? f(x) = sgn(a) + sgn(b) + sgn(c) Usng a sum of decson stumps, we can represent ths functon usng 3 terms. (c) [ ponts] In the general case, magne that we have d bnary features, and we want to count the number of features wth value 1. How many leaf nodes would a decson tree need to represent ths functon? If we used a sum of decson stumps, how many terms would be needed? No explanaton s necessary. Usng a decson tree, we wll need d nodes. Usng a sum of decson stumps, we wll need d terms. 5

6 [16 ponts] Decson Trees We wll use the dataset below to learn a decson tree whch predcts f people pass machne learnng (Yes or No), based on ther prevous GPA (Hgh, Medum, or Low) and whether or not they studed. GPA Studed Passed L F F L T T M F F M T T H F T H T T For ths problem, you can wrte your answers usng log, but t may be helpful to note that log [4 ponts] What s the entropy H(Passed)? H(P assed) = ( 6 log log 4 6 ) H(P assed) = ( 1 3 log log 3 ) H(P assed) = log [4 ponts] What s the entropy H(Passed GPA)? H(P assed GP A) = 1 3 (1 log H(P assed GP A) = 1 3 (1) (1) (0) H(P assed GP A) = log 1 ) 1 3 (1 log log 1 ) 1 3 (1 log 1) 3. [4 ponts] What s the entropy H(Passed Studed)? 6

7 H(P assed Studed) = 1 (1 3 log H(P assed Studed) = 1 (log 3 3 H(P assed Studed) = 1 log log 3 ) 1 (1 log 1) 4. [4 ponts] Draw the full decson tree that would be learned for ths dataset. You do not need to show any calculatons. We want to splt frst on the varable whch maxmzes the nformaton gan H(P assed) H(P assed A). Ths s equvalent to mnmzng H(P assed A), so we should splt on Studed? frst. 7

8 3 [0 ponts] Logstc Regresson for Sparse Data In many real-world scenaros our data has mllons of dmensons, but a gven example has only hundreds of non-zero features. For example, n document analyss wth word counts for features, our dctonary may have mllons of words, but a gven document has only hundreds of unque words. In ths queston we wll make l regularzed SGD effcent when our nput data s sparse. Recall that n l regularzed logstc regresson, we want to maxmze the followng objectve (n ths problem we have excluded w 0 for smplcty): F (w) = 1 N N l(x (j), y (j), w) λ j=1 d =1 w where l(x (j), y (j), w) s the logstc objectve functon l(x (j), y (j), w) = y (j) ( d =1 w x (j) ) ln(1 + exp( d =1 w x (j) )) and the remanng sum s our regularzaton penalty. When we do stochastc gradent descent on pont (x (j), y (j) ), we are approxmatng the objectve functon as F (w) l(x (j), y (j), w) λ d w Defnton of sparsty: Assume that our nput data has d features,.e. x (j) R d. In ths problem, we wll consder the scenaro where x (j) s sparse. Formally, let s be average number of nonzero elements n each example. We say the data s sparse when s << d. In the followng questons, your answer should take the sparsty of x (j) nto consderaton when possble. Note: When we use a sparse data structure, we can terate over the non-zero elements n O(s) tme, whereas a dense data structure requres O(d) tme. 1. [ ponts] Let us frst consder the case when λ = 0. Wrte down the SGD update rule for w when λ = 0, usng step sze η, gven the example (x (j), y (j) ). =1 The update rule can be wrtten as ( w (t+1) w (t) + ηx (j) y (j) exp( k w kx (j) k ) ). [4 ponts] If we use a dense data structure, what s the average tme complexty to update w when λ = 0? What f we use a sparse data structure? Justfy your answer n one or two sentences. 8

9 The tme complexty to calculate k w kx (j) k s O(d) when the data structure s dense, and O(s) when the data structure s sparse. Note that even f we update w for all, we only need to calculate k w kx (j) k once, and then update the w such that x (j) 0. So the answer s O(d) for the dense case, and O(s) for the sparse case. 3. [ ponts] Now let us consder the general case when λ > 0. Wrte down the SGD update rule for w when λ > 0, usng step sze η, gven the example (x (j), y (j) ). w (t+1) w (t) ηλw (t) + ηx (j) ( y (j) exp( k w kx (j) k ) ) 4. [ ponts] If we use a dense data structure, what s the average tme complexty to update w when λ > 0? The tme complexty s O(d) 5. [4 ponts] Let w (t) be the weght vector after t-th update. Now magne that we perform k SGD updates on w usng examples (x (t+1), y (t+1) ),, (x (t+k), y (t+k) ), where x (j) = 0 for every example n the sequence. (.e. the -th feature s zero for all of the examples n the sequence). Express the new weght, w (t+k) n terms of w (t), k, η, and λ. When x (j) = 0, w (t+1) = w (t) ηλw (t) = w (t) (1 ηλ) so the answer s w (t+k) = w (t) (1 ηλ) k 6. [6 ponts] Usng your answer n the prevous part, come up wth an effcent algorthm for regularzed SGD when we use a sparse data structure. What s the average tme complexty per example? (Hnt: when do you need to update w?) 9

10 Algorthm 1: Sparse SGD Algorthm for Logstc Regresson wth Regularzaton Intalze c 0 for {1,,, d} for j {1,, n} do ˆp 1 1+exp( k w kx (j) k ) for such that x (j) 0 do k j c ; auxlary varable c holds the ndex of last tme we see x (j) 0 w w (1 ηλ) k ; apply all the regularzaton updates w w + ηx (j) ( y (j) ˆp ) ; regularzaton s done n prevous step c j; remember last tme we see x (j) 0 end end The dea s to only update w when x (j) 0. Before we do the update, we apply all the regularzaton updates we skpped before, usng the answer from prevous queston. You can checkout Algorthm 1 for detals. Usng ths trck, each update takes O(s) tme. (Note: we can use the same trck apples for SGD wth l 1 regularzaton) 10

11 4 [16 ponts] Boostng Recall that Adaboost learns a classfer H usng a weghted sum of weak learners h t as follows ( T ) H(x) = sgn α t h t (x) In ths queston we wll use decson trees as our weak learners, whch classfy a pont as {1, 1} based on a sequence of threshold splts on ts features (here x, y). In the questons below, be sure to mark whch regons are marked postve/negatve, and assume that tes are broken arbtrarly. COMMENT: The solutons below are one of several possble answers. You got full credt as long as you were consstent and found a boundary wth the lowest tranng error for h 1, h. t=1 1. [ ponts] Assume that our weak learners are decson trees of depth 1 (.e. decson stumps), whch mnmze the weghted tranng error. Usng the dataset below, draw the decson boundary learned by h 1.. [3 ponts] On the dataset below, crcle the pont(s) wth the hghest weghts on the second teraton, and draw the decson boundary learned by h. ncorrectly by h 1. The ponts wth the hghest weght wll be those that were classfed 11

12 3. [3 ponts] On the dataset below, draw the decson boundary of H = sgn (α 1 h 1 + α h ). (Hnt, you do not need to explctly compute the α s). Although h 1 and h msclassfy the same number of ponts, the ponts whch h gets wrong have been downweghted. Thus, we have ɛ 1 > ɛ, so α > α 1, and h wll domnate over h 1 whenever there s a conflct. In other words, H has the same boundary as h. 4. [ ponts] Now assume that our weak learners are decson trees of maxmum depth, whch mnmze the weghted tranng error. Usng the dataset below, draw the decson boundary learned by h 1. 1

13 5. [3 ponts] On the dataset below, crcle the pont(s) wth the hghest weghts on the second teraton, and draw the decson boundary learned by h. Agan, the ponts wth the hghest weght wll be those that were classfed ncorrectly by h [3 ponts] On the dataset below, draw the decson boundary of H = sgn (α 1 h 1 + α h ). (Hnt, you do not need to explctly compute the α s). 13

14 a conflct. Agan, we have α > α 1, so h wll domnate over h 1 whenever there s 14

15 5 [4 ponts] Elastc Net In class we dscussed two types of regularzaton, l 1 and l. Both are useful, and sometmes t s helpful combne them, gvng the objectve functon below (n ths problem we have excluded w 0 for smplcty): F (w) = 1 n (y (j) j=1 d =1 w x (j) ) + α d w + λ =1 d w (1) Here (x (j), y (j) ) s j-th example n the tranng data, w s a d dmensonal weght vector, λ s a regularzaton parameter for the l norm of w, and α s a regularzaton parameter for the l 1 norm of w. Ths approach s called the Elastc Net, and you can see that t s a generalzaton of Rdge and Lasso regresson: It reverts to Lasso when λ = 0, and t reverts to Rdge when α = 0. In ths queston, we are gong to derve the coordnate descent (CD) update rule for ths objectve. Let g, h, c be real constants, and consder the functon of x f 1 (x) = c + gx + 1 hx (h > 0) () 1. [ ponts] What s the x that mnmzes f 1 (x)? (.e. calculate x = arg mn f 1 (x)) =1 Take the gradent of f 1 (x) and set t to 0. g + hx = 0 x = g h Let α be an addtonal real constant, and consder another functon of x f (x) = c + gx + 1 hx + α x (h > 0, α > 0) (3) Ths s a pecewse functon, composed of two quadratc functons: and f (x) = c + gx + 1 hx αx f + (x) = c + gx + 1 hx + αx Let x = arg mn x R f (x) and x + = arg mn x R f + (x). (Note: The argmn s taken over (, + ) here) 15

16 (a) x > 0, x + > 0, the mnma s (b) x < 0, x + < 0, the mnma s (c) x > 0, x + < 0, the mnma s at x + at x at 0 Fgure 1: Example pcture of three cases. [3 ponts] What are x + and x? Show that x x +. Usng the answer from part 1), we get x + = g+α h Snce x x + = α 0, we have h x x. and x = g α h. 3. [6 ponts] Draw a pcture of f (x) n each of the three cases below. (a) x > 0, x + > 0 (b) x < 0, x + < 0 (c) x > 0, x + < 0. For each case, mark the mnmum as ether 0, x, or x +. (You do not need to draw perfect curves, just get the rough shape and the relatve locatons of the mnma to the x-axs) An example of the curves s shown n Fgure 1. To understand the answer, note the followng: f + (0) = f (0), ths means the two pece of f match each other at x = 0. When x 0, the mnmum of left part s at 0,.e. arg mn x (,0] f (x) = 0 When x 0, the mnmum of left part s at x,.e. arg mn x (,0] f (x) = x Smlar rules holds for the rght part of the curve. 4. [4 ponts] Wrte x, the mnmum of f (x), as a pecewse functon of g. (Hnt: what s g n the cases above?) 16

17 Algorthm : Coordnate Descent for Elastc Net whle not converged do for k {1,, d} do g = n ( k w x (j) y ) end end j=1 x(j) k h = λ + n w k = j=1 (x(j) k ) g+α h g < α 0 g [ α, α] g > α g α h Summarzng the result from the prevous queston, we have x + f x + > 0 x = x f x < 0 0 f x + < 0, x > 0 Ths s equvalent to the soft-threshold functon g+α f g < α x h = 0 f g [ α, α] g α f g > α h 5. [4 ponts] Now let s derve the update rule for w k. Fxng all parameters but w k, express the objectve functon n the form of Eq (3) (here w k s our x). You do not need to explcty wrte constant factors, you can put them all together as c. What are g and h? g = n j=1 x (j) k ( w x (j) y ) (4) k h = λ + n j=1 (x (j) k ) (5) 6. [5 ponts] Puttng t all together, wrte down the coordnate descent algorthm for the Elastc Net (agan excludng w 0 ). You can wrte the update for w k n terms of the g and h you calculated n the prevous queston. The algorthm s shown n Algorthm. You can compare t to the CD algorthm for Lasso, the only dfference s addng λ to h. 17

CSE 546 Midterm Exam, Fall 2014

CSE 546 Midterm Exam, Fall 2014 CSE 546 Midterm Eam, Fall 2014 1. Personal info: Name: UW NetID: Student ID: 2. There should be 14 numbered pages in this eam (including this cover sheet). 3. You can use an material ou brought: an book,