The exam is closed book, closed notes except your one-page cheat sheet.

Size: px

Start display at page:

Download "The exam is closed book, closed notes except your one-page cheat sheet."

Steven Alexander
5 years ago
Views:

1 CS 89 Fall 206 Introducton to Machne Learnng Fnal Do not open the exam before you are nstructed to do so The exam s closed book, closed notes except your one-page cheat sheet Usage of electronc devces s forbdden If we see you usng an electronc devce (phone, laptop, etc) you wll get a zero You have 3 hours Wrte your ntals at the top rght of each page (eg, wrte BR f you are Ben Recht) Mark your answers on the exam tself n the space provded Do not attach any extra sheets Frst name Last name SID Frst and last name of student to your left Frst and last name of student to your rght

2 Q [0 pts] Inverse Gaussan Dstrbuton The nverse Gaussan dstrbuton, denoted IG(µ, λ), s a contnuous dstrbuton supported on (0, ) and parameterzed by two scalars µ > 0 and λ > 0 For any x (0, ), the probablty densty functon f(x; µ, λ) s gven by } λ f(x; µ, λ) = { 2πx 3 exp λ(x µ)2 2µ 2 x (a) [3 pts] Suppose we draw n ndependent samples x,, x n from IG(µ, λ) Wrte down the log-lkelhood L(x,, x n ; µ, λ) We have that for x (0, ), Hence, L(x,, x n ; µ, λ) = log f(x; µ, λ) = 2 log λ 2 log 2πx3 λ(x µ)2 2µ 2 x log f(x ; µ, λ) = n n 2 log λ 2 log 2πx3 λ(x µ) 2 2µ 2 x (b) [2 pts] Is the functon φ : (0, ) R defned as φ(λ) = L(x,, x n ; µ, λ) convex, concave, or both? No need to justfy your answer Concave φ(λ) = a log λ + bλ + c for scalars a, b, c wth a > 0, and log λ s concave on (0, ) (c) [5 pts] Now assume that µ s known Derve the maxmum lkelhood estmator λ gven n ndependent samples x,, x n from IG(µ, λ) Solvng for the statonary pont of d dλ L(x,, x n ; µ, λ) = 0, 0 = d dλ L(x,, x n ; µ, λ) = n 2λ n (x µ) 2 2µ 2 x = λ = n (x µ) 2 µ 2 x Snce the functon φ(λ) s concave on (0, λ) and the proposed maxmzer s postve wth probablty one, then the soluton to d dλ L(x,, x n ; µ, λ) = 0 s both a necessary and suffcent condton for a global maxmum 2

3 Q2 [25 pts] Multvarate Gaussans Let x be a d-dmensonal random vector dstrbuted as a Gaussan wth mean µ R d and postve defnte covarance matrx Σ R d d : that s, x N(µ, Σ) (a) [5 pts] Let v R d What are the mean and varance of v T x? v T µ and v T Σv (b) [5 pts] What s the dstrbuton of v T x? N(v T µ, v T Σv) (c) [5 pts] Suppose we get to choose v from the unt sphere (e v 2 = ) Wrte down a (unt-norm) choce of v that maxmzes the varance of v T x Leadng egenvector of Σ (d) [5 pts] Let z N(0, I d ) Fnd a matrx A and a vector b such that Az + b has the same dstrbuton as x A = Σ /2, b = µ (e) [5 pts] Now fnd a matrx A and a vector b such that Ax + b has the same dstrbuton as z A = Σ /2, b = Σ /2 µ 3

4 Q3 [20 pts] Data Geometry Consder the followng two-dmensonal data set: Note that the mean of ths data set s 0 (a) [5 pts] Suppose someone assgns a label y = + or y = to every data pont and we solve a Support Vector Machne (SVM) classfcaton problem: mnmze w R 2,b R n n max( y (w T x + b), 0) + λ w 2 2 Here, b s an unregularzed bas term, and λ > 0 Show that both components of w must be equal to each other The optmal w must le n the span of the data Therefore, snce all of the data ponts have both components equal to each other, w wll as well (b) [5 pts] Suppose λ = 0 Gven the nformaton provded, what can be sad about the possble values for w? The optmal w could be any drecton n R 2 based on the nformaton provded Wthout regularzaton, w can roam free (c) [5 pts] Draw the prncpal components of ths data set on the plot Soluton: (d) [5 pts] What s the rato of the smallest egenvalue to the largest egenvalue of the covarance matrx of ths data? 0 4

5 Q4 [20 pts] Resdual Neural Network Consder the followng neural network, whch operates on scalars x z a z 2 a 2 z 3 ŷ w w 2 w 3 In ths network, w, w 2, and w 3 are scalars The network takes the scalar x as nput, and computes z = w x The ReLU nonlnearty s then appled: a = ReLU(z ) Next, the network computes z 2 = w 2 a and apples the ReLU nonlnearty: a 2 = ReLU(z 2 ) Next, z 3 = a + a 2 Fnally, ŷ = w 3 z 3 We wll use the mean squared error loss functon L = 2 (y ŷ)2 to tran ths network You may use R(x) and R (x) to denote ReLU and the dervatve of ReLU respectvely (a) [5 pts] Fnd dl dw 3 dl dw 3 = (y ŷ)z 3 (b) [5 pts] Fnd dl dw 2 dl dw 2 = (y ŷ)w 3 dr(w 2 a ) dw 2 = (y ŷ)w 3 R (z 2 )a (c) [5 pts] Fnd dl dw dl dw = (y ŷ)w 3 d(r(w x) + R(w 2 R(w x))) dw = (y ŷ)w 3 (R (w x)x + R (w 2 R(w x))w 2 R (w x)x) = (y ŷ)w 3 R (z )x( + R (z 2 )w 2 ) (d) [5 pts] How would the network change f we added a ReLU nonlnearty to unt z 3 such that a 3 = ReLU(z 3 ), ŷ = w 3 a 3 Brefly explan your reasonng We already have that a 0, a 2 0, so t would be redundant to apply ReLU to z 3 5

6 Q5 [35 pts] Feature Engneerng In ths class, we have tred to emphasze the mportance of good features In ths queston, we ll take ths a step further and see that good features need to be predctve on the tranng set, but also generalze well to unseen data Let n denote the number of UC Berkeley students, and let us assgn an arbtrary orderng to all students Let y, y 2,, y n R denote the ages of all the students Our goal s to predct the age y of a new transfer student on campus based on some lst of features A natural feature choce could be, for example, year of study But suppose we only have access to student ID (SID) numbers You are a bt unsure how to use ths nformaton, so you ask some of your frends Note: For ths queston, assume that n < 0 9 One-hot encodng You frst ask your frend from Stanford how to encode a student ID nto a feature vector Use one-hot encodng, your frend responds You are a bt skeptcal, but you proceed onwards Snce there are 9 dgts n an SID, you one-hot encode each SID nto an element of {0, } d where d = 0 9 Let x,, x n R d denote the encoded features of the tranng set stacked row-wse nto a matrx X R n d, and Y = (y,, y n ) R n denote the age vector of students n the tranng set (note that n < d) You proceed to use regularzed least squares and solve mnmze w R d 2 Xw Y λ 2 w 2 2 () (a) [3 pts] How does the regularzaton term ( λ 2 w 2 2) ensure a unque soluton to ()? The matrx X s of dmenson n 0 9 wth n < 0 9, so the matrx X has a non-trval null-space (b) [5 pts] Compute the soluton w R 09 to (), assumng that x = e R d where e denotes the -th standard bass vector n R d That s, someone sorted your data such that the -th student has SID Your answer should only be expressed n terms of Y and λ Hnt: You may perform ths calculaton ether drectly, or usng the kernel trck (Drect) We know that w = (X T X + λi d ) X T Y Hence, X = [ I n 0 n,d n ], X T X = [ ] e e T In 0 = n,d n 0 d n,n 0 d n,d n Therefore, w = (X T X + λi n ) X T Y = = [ +λ I n 0 n,d n 0 d n,n λ I d n,d n ] [ [ +λ I m 0 d n,n Y = +λ Y 0 d n ] [ In ] 0 d n,n ] Y (Kernel trck) It s not hard to see that the gram matrx XX [ T = I n Recall that w = X T α, and α = (XX T + λi n ) Y Clearly then α = +λ Y and w = X T α = +λ Y ] 0 d n (c) [3 pts] Compute predct(x ) = w, x for x n the tranng set predct(x ) = + λ y (d) [3 pts] Suppose x s the feature vector for the transfer student who s not n the tranng set predct(x) = w, x for x not n the tranng set Compute 6

7 Snce x s not n the tranng set, ts one-hot encodng s orthogonal to every feature vector n the tranng set Thus predct(x) = 0 Logcal features Unsatsfed wth the performance of one-hot encodng, you ask another frend from MIT what features to use Your frend tells you to use the feature that s equal to f the student goes to Berkeley, and 0 otherwse You are even more skeptcal, but you proceed onwards (e) [5 pts] Usng the features x = for all n, wrte the soluton w to the followng unregularzed problem: mnmze w R 2 Xw Y 2 2 Our features n ths case are scalars We know that X = n and hence X T X = T n n = n The soluton s smply w = n n y (f) [3 pts] Now compute predct(x ) = w, x usng the w from part (e) for x n the tranng set predct(x ) = n y (g) [3 pts] Snce the transfer student s a Berkeley student, the value of ther feature wll be x = Compute predct(x) = w, x usng the w when x s the transfer student who s not n the tranng set Agan, predct(x ) = n y 7

8 Combnng features Stll unsatsfed wth prevous suggestons, we consder combnng both the one-hot encoded features and the constant feature (h) [0 pts] We now consder x = (e, ) R d+, where e R d as before For smplcty, we set λ = 0, and consder the followng problem mnmze w R d+ 2 Xw Y 2 2 Compute a mnmzer w of ths problem, and predct(x) for x n and not n the tranng set Hnt: The Sherman-Morrson formula states that for nvertble A and column vectors u, v such that + v T A u 0, (A + uv T ) = A A uv T A + v T A u Snce we are allowed to compute any mnmzer, t s easest to compute the mnmum norm one by employng the kernel trck Observe that By the Sherman-Morrson formula, we have that X = [ I n 0 n,d n n ], XX T = I n + n T n (XX T ) = (I n + n T n) = I n n T n + n Hence, α = (XX T ) Y = ( I n n T ) n Y + n Furthermore, the mnmum norm mnmzer w s gven by ( I n ( w = X T α = 0 d n,n I n n T ) n Y = T + n n I n nt n +n 0 d n ) Y T ny n n+ T ny = We are now n a poston to compute the predct(x) functon On the tranng set For x not n the tranng set, predct(x ) = y predct(x) = n + y ( I n nt n +n 0 d n n+ n y ) Y 8

9 Q6 [20 pts] Calbraton Suppose we attempt to solve a bnary classfcaton task wth bnary feature vectors Specfcally, we are gven n data ponts {(x, y )} n wth x {0, } d (e each of the d feature values s ether 0 or ) and y {0, } Suppose we try to ft the standard lnear logstc regresson model: P (y = x, w, b) = g(w T x + b) = + exp( w T x b) We say that a model s calbrated f n P (y = x, w, b) = n y That s, f the average probablty of labelng the data n class s equal to the average number of occurrences of class on the tranng set Calbrated models are nterestng because ther predcted probabltes are more useful as an ndcator of confdence Consder tranng the logstc regresson model where we solve the optmzaton problem maxmze w Rd,b R n log P (y x, w, b) Let w, b denote the maxmzng parameters (a) [5 pts] Show that the logstc model correspondng to the optmal w, b s calbrated Note that the log-lkelhood can be wrtten as y (w T x + b) log( + exp(w T x + b)) Takng a dervatve wth respect to b shows that b must satsfy 0 = { } y exp(wt x + b ) + exp(w T = x + b ) y P (y = x, w, b ) Another (smpler) method: Denote L as the log-lkelhood, µ = P (y = x, w, b) Reformulate w T x + b as β T x, wth β = [w, ] T, x := [x, ] T R d+ From lecture/homework we know: β L = (y µ )x = 0 Notng that x,d = (the bas component added to the end of x), we have: y µ = 0 = y = µ Many students made the mstake of clamng that under the optmal w, b, P (y x, w, b) = y Ths s only true f the data s lnearly separable, and even then t s only true as w (b) [5 pts] Let S j denote the set of ndces where X j = We say a model s condtonally calbrated wth respect to feature j f P (y = x, w, b) = y S j S j S j S j 9

10 Show that the logstc model correspondng to the optmal w, b s calbrated wth respect to every feature Takng a dervatve wth respect to w j shows that w,j must satsfy 0 = { y x j x j exp(w T } x + b ) + exp(w T = x + b ) y x j x j P (y = x, w, b ) = S j y S j P (y = x, w, b ) Here, the last lne follows because x j = only f S j, and otherwse x j = 0 Another (smpler) method: Usng the same notaton as the smpler method n the prevous part, we see that: (y µ )x = 0 = (y µ )x j = 0, j = y µ = 0 S j (c) [5 pts] Suppose we add a regularzer for the w parameters to the optmzaton problem: maxmze w Rd,b R n log P (y x, w, b) + λ w 2 2 Is the logstc model correspondng to the optmal w, b calbrated? Is t calbrated wth respect to every feature? The model remans calbrated, as the gradent for b does not change, and hence the same calculaton as n part (a) apples However, for w, we now have cost w j = S j y S j P (y = x, w, b ) + w j whch wll not be calbrated wth respect to feature j when w j 0 (d) [5 pts] Suppose someone gves you a classfyng functon f : R d R that nputs x and outputs a real number We can produce a probablty from ths classfer by settng P (y = x, f, α, β) = + exp(αf(x) + β) Show that one can always fnd values α and β such that the resultng probablty estmates are calbrated If one fts the log lkelhood over the data, the optmalty condtons wth respect to β wll guarantee calbraton Another way to thnk about ths: you can thnk of f as essentally a feature transformaton Replace your entre tranng set wth f(x) Then tran logstc regresson on ths tranng set The optmalty condtons we already proved guarantee calbraton 0

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume