Massachusetts Institute of Technology

Size: px

Start display at page:

Download "Massachusetts Institute of Technology"

Bridget Bates
5 years ago
Views:

1 Massachusetts Istitute of Techology Machie Learig, Fall 6 Problem Set : Solutios. (a) (5 poits) From the lecture otes (Eq 4, Lecture 5), the optimal parameter values for liear regressio give the matrix of traiig examples X ad the correspodig respose variables y is: θ = (X T X) X T y The quatity (X T X) X T is also kow as the pseudo-iverse of X, ad ofte occurs whe dealig with liear systems of equatios. Whe X is a square matrix ad ivertible, it is exactly the same as the iverse of X. MATLAB provides may ways of achievig our desired goal. We ca directly write out the expressio above or use the fuctio piv. Here are the fuctios liear regress ad liear pred: fuctio theta = liear_regress(y,x) theta = piv(x)*y; fuctio y_hat = liear_pred(theta,x) y_hat = X*theta; I liear regress, ote that we are ot calculatig θ separately. This differs from the descriptio i the lecture otes where the traiig examples were explicitly padded with s, allowig us to itroduce a offset θ. Istead, we will use a feature mappig to achieve the same effect. (b) ( poits) Before we describe the solutio, we first describe how the dataset was created. This may help you appreciate why some feature mappigs may work better tha others. The x values were created by samplig each coordiate uiformly at radom from (-,): X i = rad(,3)* - Give a particular x, the correspodig y true ad y oisy values were created as follows: y true = log x 5 log x 7.5 log x + 3 y oisy = y true + ɛ ɛ N(, ) To evaluate the performace of liear regressio o give traiig ad test sets, we created the fuctio test liear reg which combies the regressio, predictio, ad evaluatio steps. You may, of course, do it i some other way: fuctio err = test_liear_reg(y_trai, X_trai, y_test, X_test) % trai liear regressio usig X_trai ad y_trai % evaluate the mea squared predictio error o X_test ad y_test theta = liear_regress(y_trai, X_trai); y_hat = liear_pred(theta,x_test); yd = y_hat - y_test; err = sum(yd.^)/legth(yd); %err = Mea Squared Predictio Error Cite as: Tommi Jaakkola, course materials for Machie Learig, Fall 6. MIT OpeCourseWare

2 Usig this, we ca ow calculate the mea squared predictio error (MSPE) for the two feature mappigs: >> X = feature_mappig(x_i,); >> test_liear_reg(y_oisy, X, y_true, X) as =.5736e+3 >> X = feature_mappig(x_i,); >> test_liear_reg(y_oisy, X, y_true, X) as =.56 The two sets of errors are (φ ) ad.56 (φ ), respectively. Usurprisigly, the mappig φ performs much better tha φ it is exactly the space i which the relatioship betwee X ad y is liear. (c) ( poits) Recall from the otes (Eq 8, lecture 6), that the desired quatity we eed to maximize is v T AAv + v T Av where A = (X T X). I the otes, a offset parameter is explicitly assumed so that v = [x T, ] T. However, i our case this is ot ecessary ad so v = x. Active learig may, i geeral, select the same poit x to be observed repeatedly. Each of these observatios correspods to a differet y oisy. However, due to practical limitatios, we had supplied you with oly oe set of y oisy values. Thus, if some x i occurs repeatedly i idx, you will eed to use the same y oisy, i for each occurrece of x i. Alteratively, you could chage your code so as to disallow reptitios i idx. This is the optio we have chose here. Give ay feature space, the criterio above will aim to fid poits far apart i that space. However, these poits may ot be far apart i the feature space where classificatio actually occurs. This is of particular cocer whe the latter feature space might ot be easily accessible (e.g., whe usig a kerel like the radial basis fuctio kerel). Here s the active learig code: fuctio idx = active_lear(x,k,k) idx = :k; N = size(x,); for i=:k var_reductio = zeros(n,); X = X(idx,:); A = iv(x *X); AA = A*A; for j=:n if ismember(j,idx) %this is the part where we disallow repititios cotiue; ed v = X(j,:); a = v*aa*v / ( + (v*a*v )); var_reductio(j) = a; ed [a, aidx] = max(var_reductio); idx(ed+) = aidx; ed Cite as: Tommi Jaakkola, course materials for Machie Learig, Fall 6. MIT OpeCourseWare

3 Usig it, we ca ow compute the desired quatities: >> idx = active_lear(x,5,) idx = >> test_liear_reg(y_oisy(idx), X(idx,:), y_true, X) as =.5734e+3 >> idx = active_lear(x,5,) idx = >> test_liear_reg(y_oisy(idx), X(idx,:), y_true, X) as = Thus, the MSPE whe usig φ for active learig chooses poits that are ot well-placed (for this particular dataset); φ performs much better. Note that the feature mappig used to perform the regressio (ad evaluatio) is the same for both (φ ) so the performace differece is due to the poits chose by active learig. The aswers chage whe repititios are allowed (MSPE for φ = 7.47 ; for φ = 5.5), but they still illustrate our cocept. For completeess sake, the aswers for the origial versio of the problem are: (φ : 68.9 (re peats), 68.6 (o repeats)) ad (φ : (o repeats) ad (repeats)). (d) (4 poits) This error will vary, depedig upo the umber of iteratios you perform ad the radom poits selected. I my simulatios, the value of error (usig mappig φ for regressio ad evaluatio) was Whe this simulatio was ru for a rus, the error was close to Clearly, it seems to be much better to perform radom samplig tha perform active learig i φ s space. This may seem surprisig at first, but is completely uderstadable: i the space where regressio is performed (φ ) the poits chose by performig active learig i the φ space are ot far apart at all, ad are thus particularly bad poits to be sampled. For completeess sake, the aswer for the origial versio of the problem is: (φ : 8.9, φ : ). (e) (4 poits) The figures are show i Fig. For clarity s sake, we have oly plotted the last poits of idx (sice the first 5 are the same across all cases). I the origial feature space (fig a), the poits selected by active learig i the φ space are spread far apart, as expected. However, as part (b) showed, a better fit to data is obtaied by usig φ. I this space (fig b), the poits selected by active learig i the φ space are very closely buched together. The poits selected by active learig i the φ space are, i cotrast, spread far apart. This helps explai why the poits leared usig active learig o the φ space did ot lead to good performace i the regressio step.. (a) 5 poits The fuctio f(t) = βt / mootoically decreases with t (for β > ). The fuctio g(t) = e t mootoically icreases with t. Thus, the fuctio g o f(t) = h(t) = e βt / mootoically decreases with t. As such, the RBF kerel K(x, x ) = e β x x defies a Gram matrix that satisfies the coditios of Michelli s theorem ad is hece ivertible. (b) 5 poits Let A = (λi + K) y. The, lim A = K y λ Sice K is always ivertible, this limit is always well-defied. Now, we have ˆα t = λa t where A t is the t-th elemet of A. We the have: Cite as: Tommi Jaakkola, course materials for Machie Learig, Fall 6. MIT OpeCourseWare

4 (a) I φ space (b) I φ space Figure : The red circles correspod to poits chose by performig active learig o the φ space ad the blue oes correspod to those chose by performig active learig o the φ space. Oly the last poits of idx are show for each; the first 5 poits are the same. We the have, y(x) = t= (ˆα t/λ)k(x t, x), or y(x) = t = (λa t/λ)k(x t, x), or y(x) = t= A tk(x t, x) lim λ y(x) = t=(lim λ A t )K(x t, x), or lim λ y(x) = t= B tk(x t, x), where B = K y Thus, i the required limit, the fuctio y(x) = i= B tk(x t, x) (c) poits To prove that the traiig error is zero, we eed to prove that y(x t ) = y t for t =,...,. From part (b), we have y(x t ) = i= B i K(x i, x t ), or y(x t ) = ( j= y j K (i, j))k(x i, x t ) or i= y(x t ) = j= y j i= K (i, j)k(x i, x t ), or y(x t ) = j= y j δ(j, t), or y(x t ) = y t where K (i, j) = (i, j)-th etry of K ad { for i = j δ(i, j) = for i = j Here, we made use of the fact that K(x i, x t ) = K(x t, x i ) ad that if A = B the i A(t, i)b(i, j) = δ(t, j). (d) 5 poits Sample code for this problem is show below: Ntrai = size(xtrai,); Ntest = size(xtest,); Cite as: Tommi Jaakkola, course materials for Machie Learig, Fall 6. MIT OpeCourseWare

5 for i=:legth(lambda), lmb = lambda(i); alpha = lmb * ((lmb*eye(ntrai) + K)^-) * Ytrai; Atrai = (/lmb) * repmat( alpha, Ntrai, ); yhat_trai = sum(atrai.*k,); Atest = (/lmb) * repmat( alpha, Ntest, ); yhat_test = sum(atest.*(ktrai_test ), ); E(i,:) = [mea((yhat_trai-ytrai).^),mea((yhat_test-ytest).^)]; ed; Trai Error Test Error β Figure : Traiig ad test error for Prob # (e) The resultig plot is show i Fig. As ca be see the traiig error is zero at λ = ad icreases as λ icreases. The test error iitially decreases, reaches a miimum aroud., ad the icreases agai. This is exactly as we would expect. λ results i over-fittig (the model is too powerful). Our regressio fuctio has a low bias but high variace. By icreasig λ we costrai the model, thus icreasig the traiig error. While the regularizatio icreases bias, the variace decreases faster, ad we geeralize better. High values of λ result i uder-fittig (high bias, low variace) ad both traiig error ad test errors are high. 3. (a) Observe that θ is oly a sum of y t φ(x t ) s, so we ca just store the coefficiets: θ = w t y t φ(x t ). t= We ca update w t s by icremetig w t by oe whe a mistake is made o example t. Classifyig ew examples meas evaluatig: which oly ivolves kerel operatios. θ T φ(x) = w t y t K(X t, X), t= Cite as: Tommi Jaakkola, course materials for Machie Learig, Fall 6. MIT OpeCourseWare

6 (b) The most straightforward proof: use the regressio argumet to show that you ca fit the poits exactly ad ot oly achieve the correct sig, but the value of the discrimiat fuctio ca be made ± for every traiig example. (c) Here is my solutio. % kerel= ( + traspose(x i )x j ) d ; d = 5; % polyomial kerel % kerel= exp( traspose(x i x j )(x i x j )/(s )) ; s = 3; % radial basis fuctio kerel fuctio α=trai kerel perceptro(x, y, kerel), d = size(x); K = []; for i = : x i = X(i, :) ; for j = : x j = X(j, :) ; K(i, j) = eval(kerel); ed ed α = zeros(, ); mistakes = ; while mistakes > mistakes = ; for i = : if α K(:, i)y(i) α(i) = α(i) + y(i); mistakes = mistakes + ; ed ed ed fuctio f =discrimiat fuctio(α, X, kerel, X test ), d = size(x); K = []; for i = : x i = X(i, :) ; x j = X test ; K(i) = eval(kerel); ed f = α K; (d) The origial dataset requires d = 4; the ew dataset requires d =. A RBF will easily separate either dataset. load p3 a X ad y loaded. kerel = ( + traspose(x i ) x j ) ; α = trai kerel perceptro(x, y, kerel); figure hold o plot(x( :, ), X( :, ), rs ) plot(x( :, ), X( :, ), bo ) plot dec boudary(α, X, kerel, [ 4, ], [.5,.5], [, 4]) Cite as: Tommi Jaakkola, course materials for Machie Learig, Fall 6. MIT OpeCourseWare

7 kerel = exp( traspose(x i x j ) (x i x j )/8) ; α = trai kerel perceptro(x, y, kerel); figure hold o plot(x( :, ), X( :, ), rs ) plot(x( :, ), X( :, ), bo ) plot dec boudary(α, X, kerel, [ 4, ], [.5,.5], [, 4]) 3 3 Cite as: Tommi Jaakkola, course materials for Machie Learig, Fall 6. MIT OpeCourseWare

6.867 Machine learning, lecture 7 (Jaakkola) 1

6.867 Machine learning, lecture 7 (Jaakkola) 1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit