10-701/ Machine Learning Mid-term Exam Solution

Size: px

Start display at page:

Download "10-701/ Machine Learning Mid-term Exam Solution"

Samuel Harper
5 years ago
Views:

1 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID:

2 True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it holds that 0 p(x) for all x. 2. (F) Decisio tree is leared by miimizig iformatio gai. 3. (F) Liear regressio estimator has the smallest variace amog all ubiased estimators. 4. (T) The coefficiets α assiged to the classifiers assembled by AdaBoost are always o-egative. 5. (F) Maximizig the likelihood of logistic regressio model yields multiple local optimums. 6. (F) No classifier ca do better tha a aive Bayes classifier if the distributio of the data is kow. 7. (F) The back-propagatio algorithm lears a globally optimal eural etwork with hidde layers. 8. (F) The VC dimesio of a lie should be at most 2, sice I ca fid at least oe case of 3 poits that caot be shattered by ay lie. 9. (F) Sice the VC dimesio for a SVM with a Radial Base Kerel is ifiite, such a SVM must be worse tha a SVM with polyomial kerel which has a fiite VC dimesio. 0. (F) A two layer eural etwork with liear activatio fuctios is essetially a weighted combiatio of liear separators, traied o a give dataset; the boostig algorithm built o liear separators also fids a combiatio of liear separators, therefore these two algorithms will give the same result. 2

3 2 Liear Regressio (0%) We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples (x ; y ),..., (x ; y ) where x i ad y i are real umbers for all i. Let w = [w 0, w ]T be the least squares solutio we are after. I other words, w miimizes J(w) = (y i w 0 w x i ) 2. You ca assume for our purposes here that the solutio is uique. i=. (5%) Check each statemet that must be true if w = [w 0, w ]T is ideed the least squares solutio. i= (y i w 0 w x i)y i = 0 ( ) i= (y i w 0 w x i)(y i ȳ) = 0 ( ) i= (y i w 0 w x i)(x i x) = 0 ( ) i= (y i w 0 w x i)(w 0 + w x i) = 0 ( ) where x ad ȳ are the sample meas based o the same dataset. (hit: take the derivative of J(w) with respect to w 0 ad w ) (sol.) Takig the derivative with respect to w ad w 0 gives us the followig coditios of optimality w 0 J(w) = 2 w J(w) = 2 (y i w 0 w x i ) = 0 i= (y i w 0 w x i )x i = 0 i= This meas that the predictio error (y i w 0 w x i ) does ot co-vary with ay liear fuctio of the iputs (has a zero mea ad does ot co-vary with the iputs). (x i x) ad (w 0 + w x i) are both liear fuctios of iputs. 2. (5%) There are several umbers (statistics) computed from the data that we ca use to estimate w. There are x = i= x i ( ) ȳ = i= y i ( ) C xx = i= (x i x) 2 ( ) C xy = i= (x i x)(y i ȳ) ( ) C yy = i= (y i ȳ) 2 ( ) Suppose we oly care about the value of w. We d like to determie w o the basis of ONLY two umbers (statistics) listed above. Which two umbers do we eed for this? (hit: use the aswers to the previous questio) 3

4 (sol.) We eed C xx (spread of x) ad C xy (liear depedece betwee x ad y). No justificatio was ecessary as these basic poits have appeared i the course. If we wat to derive these more mathematically, we ca, for example, look at oe of the aswers to the previous questio: (y i w0 wx i )(x i x) = 0, i= which we ca rewritte as [ ] y i (x i x) w0 i= [ ] (x i x) w i= By usig the fact that / i (x i x) = 0 we see that [ ] x i (x i x) = 0 i= y i (x i x) = i= x i (x i x) = i= (y i ȳ)(x i x) = C xy i= (x i x)(x i x) = C xx i= Substitutig these back ito our equatio above gives C xy w C xx = 0 4

5 3 AdaBoost (5%) Cosider buildig a esemble of decisio stumps G m with the AdaBoost algorithm, ( M ) f(x) = sig α m G m (x). m= Figure dispalys a few labeled poit i two dimesios as well as the first stump we have chose. A stump predicts biary ± values, ad depeds oly o oe coordiate value (the split poit). The little arrow i the figure is the ormal to the stump decisio boudary idicatig the positive side where the stump predicts +. All the poits start with uiform weights. x x Figure : Labeled poits ad the first decisio stump. The arrow poits i the positive directio from the stump decisio boudary.. (5%) Circle all the poit(s) i Figure whose weight will icrease as a result of icorporatig the first stump (the weight update due to the first stump). (sol.) The oly misclassified egative sample. 2. (5%) Draw i the same figure a possible stump that we could select at the ext boostig iteratio. You eed to draw both the decisio boudary ad its positive orietatio. (sol.) The secod stump will also be a vertical split betwee the secod positive sample (from left to right) ad the misclassified egative smaple, as draw i the figure. 3. (5%) Will the secod stump receive higher coefficiet i the esemble tha the first? I other words, will α 2 > α? Briefly explai your aswer. (o calculatio should be ecessary). (sol.) α 2 > α because the poit that the secod stump misclassifies will have a smaller relative weight sice it is classified correctly by the first stump. 5

6 4 Neural Nets (5%) Cosider a eural et for a biary classificatio which has oe hidde layer as show i the figure. We use a liear activatio fuctio h(z) = cz at hidde uits ad a sigmoid activatio fuctio g(z) = +e z at the output uit to lear the fuctio for P (y = x, w) where x = (x, x 2 ) ad w = (w, w 2,..., w 9 ). x w 3 w 4 bias bias w w2 w 7 w 8 x 2 w 5 w 6 w 9. (5%) What is the output P (y = x, w) from the above eural et? Express it i terms of x i, c ad weights w i. What is the fial classificatio boudary? (sol.) g(w 7 + w 8 h(w + w 3 x + w 5 x 2 ) + w 9 h(w 2 + w 4 x + w 6 x 2 )) = + exp( (w 7 + cw 8 w + cw 9 w 2 + (cw 8 w 3 + cw 9 w 4 )x + (cw 8 w 5 + cw 9 w 6 )x 2 )) The classificatio boudary is : w 7 + cw 8 w + cw 9 w 2 + (cw 8 w 3 + cw 9 w 4 )x + (cw 8 w 5 + cw 9 w 6 )x 2 = 0 2. (5%) Draw a eural et with o hidde layer which is equivalet to the give eural et, ad write weights w of this ew eural et i terms of c ad w i. (sol.) 3. (5%) Is it true that ay multi-layered eural et with liear activatio fuctios at hidde layers ca be represeted as a eural et without ay hidde layer? Briefly explai your aswer. (sol.) Yes. If liear activatio fuctios are used for all the hidde uits, output from hidde uits will be writte as liear combiatio of iput features. Sice these itermediate output serves as iput for the fial output layer, we ca always fid a equivalet eural et which does ot have ay hidde layer as see i the example above. 6

7 7

8 5 Kerel Method (20%) Suppose we have six traiig poits from two classes as i Figure (a). Note that we have four poits from class : (0.2, 0.4), (0.4, 0.8), (0.4, 0.2), (0.8, 0.4) ad two poits from class 2: (0.4, 0.4), (0.8, 0.8). Ufortuately, the poits i Figure (a) caot be separated by a liear classifier. The kerel trick is to fid a mappig of x to some feature vector φ(x) such that there is a fuctio K called kerel which satisfies K(x, x ) = φ(x) T φ(x ). Ad we expect the poits of φ(x) to be liearly separable i the feature space. Here, we cosider the followig ormalized kerel: K(x, x ) = xt x x T x. (5%) What is the feature vector φ(x) correspodig to this kerel? Draw φ(x) for each traiig poit x i Figure (b), ad specify from which poit it is mapped. φ(x) = 2. (5%) You ow see that the feature vectors are liearly separable i the feature space. The maximum-margi decisio boudary i the feature space will be a lie i R 2, which ca be writte as w x + w 2 y + c = 0. What are the values of the coefficiets w ad w 2? (Hit: you do t eed to compute them.) x x (sol.) (w, w 2 ) = (, ) 3. (3%) Circle the poits correspodig to support vectors i Figure (b). 4. (7%) Draw the decisio boudary i the origial iput space resultig from the ormalized liear kerel i Figure (a). Briefly explai your aswer. 8

9 6 VC Dimesio ad PAC Learig(0%) The VC dimesio, V C(H), of hypothesis space H defied over istace space X is the size of the largest largest umber of poits (i some cofiguratio) that ca be shattered by H. Suppose with probability ( δ), a PAC learer outputs a hypothesis withi error ɛ of the best possible hypothesis i H. It ca be show that the lower boud o the umber of traiig examples m sufficiet for successful learig, stated i terms of V C(H) is m ɛ (4 log 2(2/δ) + 8V C(H) log 2 (3/ɛ)). Cosider a learig problem i which X = R is the set of real umbers, ad the hypothesis space is the set of itervals H = {(a < x < b) a, b R}. Note that the hypothesis labels poits iside the iterval as positive, ad egative otherwise.. (5%) What is the VC dimesio of H? (sol.) V C(H) = 2. Suppose we have two poits x ad x 2, ad x < x 2. They ca always be shattered by H, o matter how they are labled. (a) if x positive ad x 2 egative, choose a < x < b < x 2 ; (b) if x egative ad x 2 positive, choose x < a < x 2 < b; (c) if both x ad x 2 positive, choose a < x < x 2 < b; (d) if both x ad x 2 egative, choose a < b < x < x 2 ; However, if we have three poits x < x 2 < x 3 ad if they are labeled as x (positive) x 2 (egative) ad x 3 (positive), the they caot be shattered by H. 2. (5%) What is the probability that a hypothesis cosistet with m examples will have error at least ɛ? (sol.) Use the above result. Substitute V C(H) = 2 ito the iequality m ɛ (4 log 2(2/δ) log 2 (3/ɛ)), we have ɛm 4 log 2 (2/δ) log 2 (3/ɛ) ɛm 6 log 2 (3/ɛ) 4 log 2 (2/δ) 2 ɛm/4 (3/ɛ) 4 2/δ δ ( 3 ) 4 ɛ 2 ɛm/4 9

10 7 Logistic Regressio (0%) We cosider the followig models of logistic regressio for a biary classificatio with a sigmoid fuctio g(z) = +e z : Model : P (Y = X, w, w 2 ) = g(w X + w 2 X 2 ) Model 2: P (Y = X, w, w 2 ) = g(w 0 + w X + w 2 X 2 ) We have three traiig examples: x () = [, ] T x (2) = [, 0] T x (3) = [0, 0] T y () = y (2) = y (3) =. (5%) Does it matter how the third example is labeled i Model? i.e., would the leared value of w = (w, w 2 ) be differet if we chage the label of the third example to -? Does it matter i Model 2? Briefly explai your aswer. (Hit: thik of the decisio boudary o 2D plae.) (sol.) It does ot matter i Model because x (3) = (0, 0) makes w x + w 2 x 2 always zero ad hece the likelihood of the model does ot deped o the value of w. But it does matter i Model (5%) Now, suppose we trai the logistic regressio model (Model 2) based o the traiig examples x (),..., x () ad labels y (),..., y () by maximizig the pealized log-likelihood of the labels: log P (y (i) x (i), w) λ 2 w 2 = log g(y (i) w T x (i) ) λ 2 w 2 i i For large λ (strog regularizatio), the log-likelihood terms will behave as liear fuctios of w. log g(y (i) w T x (i) )) 2 y(i) w T x (i) Express the pealized log-likelihood usig this approximatio (with Model ), ad derive the expressio for MLE ŵ i terms of λ ad traiig data {x (i), y (i) }. Based o this, explai how w behaves as λ icreases. (We assume each x (i) = (x (i), x(i) 2 )T ad y (i) is either or - ) (sol.) log l(w) 2 y(i) w T x (i) λ 2 w 2 i log l(w) y (i) x (i) w 2 λw = 0 i log l(w) y (i) x (i) 2 w 2 2 λw 2 = 0 i w = 2λ y (i) x (i) i 0

12 2

6.867 Machine learning

6.867 Machine learning 6.867 Machie learig Mid-term exam October, ( poits) Your ame ad MIT ID: Problem We are iterested here i a particular -dimesioal liear regressio problem. The dataset correspodig to this problem has examples