6.867 Machine learning, lecture 7 (Jaakkola) 1

Size: px

Start display at page:

Download "6.867 Machine learning, lecture 7 (Jaakkola) 1"

Rosalind Gregory
6 years ago
Views:

1 6.867 Machie learig, lecture 7 (Jaakkola) 1 Lecture topics: Kerel form of liear regressio Kerels, examples, costructio, properties Liear regressio ad kerels Cosider a slightly simpler model where we omit the offset parameter θ 0, reducig the model to y = θ T φ(x) + ɛ where φ(x) is a particular feature expasio (e.g., polyomial). Our goal here is to tur both the estimatio problem ad the subsequet predictio task ito forms that ivolve oly ier products betwee the feature vectors. We have already emphasized that regularizatio is ecessary i cojuctio with mappig examples to higher dimesioal feature vectors. The regularized least squares objective to be miimized, with parameter λ, is give by J(θ) = ( yt θ T φ(x t ) ) 2 + λ θ 2 This form ca be derived from pealized log-likelihood estimatio (see previous lecture otes). The effect of the regularizatio pealty is to pull all the parameters towards zero. So ay liear dimesios i the parameters that the traiig feature vectors do ot pertai to are set explicitly to zero. We would therefore expect the optimal parameters to lie i the spa of the feature vectors correspodig to the traiig examples. This is ideed the case. As before, the optimality coditio for θ follows from settig the gradiet to zero: dj(θ) dθ α t ( {}} = 2 y t θ T φ(x t ) ){ φ(x t ) + 2λθ = 0 (2) We ca therefore costruct the optimal θ i terms of predictio differeces α t ad the feature vectors: 1 λ (1) θ = α t φ(x t ) (3) The implicatio is that the optimal θ (however high dimesioal) will lie i the spa of the feature vectors correspodig to the traiig examples. This is due to the regularizatio

2 6.867 Machie learig, lecture 7 (Jaakkola) 2 pealty we added. But how do we set α t? The values for α t ca be foud by isistig that they ideed ca be iterpreted as predictio differeces: 1 λ t =1 α t = y t θ T φ(x t ) = y t α t φ(x t ) T φ(x t ) (4) Thus α t depeds oly o the actual resposes y t ad the ier products betwee the traiig examples, the Gram matrix : φ(x 1 ) T φ(x 1 ) φ(x 1 ) T φ(x ) K = (5) φ(x ) T φ(x 1 )... φ(x ) T φ(x ) I a vector form, a = [α 1,..., α ] T, (6) y = [y 1,..., y ] T, (7) a = 1 y Ka λ (8) the solutio is ( ) 1 â = λ λi + K y (9) Note that fidig the estimates ˆα t requires ivertig a matrix. This is the cost of dealig with ier products as opposed to hadig feature vectors directly. I some cases, the beefit is substatial sice the feature vectors i the ier products may be ifiite dimesioal but ever eeded explicitly. As a result of fidig ˆα t we ca cast the predictios for ew examples also i terms of ier products: y = θˆt φ(x) = (ˆα t /λ)φ(x t ) T φ(x) = αˆtk(x t, x) (10) where we view K(x t, x) as a kerel fuctio, a fuctio of two argumets x t ad x. Kerels So we have ow successfully tured a regularized liear regressio problem ito a kerel form. This meas that we ca simply substitute differet kerel fuctios K(x, x ) ito the estimatio/predictio equatios. This gives us a easy access to a wide rage of possible regressio fuctios. Here are a couple of stadard examples of kerels:

3 6.867 Machie learig, lecture 7 (Jaakkola) 3 Polyomial kerel K(x, x ) = (1 + x T x ) p, p = 1, 2,... (11) Radial basis kerel ( ) β K(x, x ) = exp x x 2, β > 0 (12) 2 We have already discussed the feature vectors correspodig to the polyomial kerel. The compoets of these feature vectors were polyomial terms up to degree p with specifically chose coefficiets. The restricted choice of coefficiets was ecessary i order to collapse the ier product calculatios. The feature vectors correspodig to the radial basis kerel are ifiite dimesioal! The compoets of these vectors are idexed by z R d where d is the dimesio of the origial iput x. More precisely, the feature vectors are fuctios: φ z (x) = c(β, d) N(z; x, 1/2β) (13) where N(z; x, (1/β)) is a ormal pdf over z ad c(β, d) is a costat. Roughly speakig, the radial basis kerel measures the probability that you would get the same sample z (i the same small regio) from two ormal distributios with meas x ad x ad a commo variace 1/2β. This is a reasoable measure of similarity betwee x ad x ad kerels are ofte defied from this perspective. The ier product givig rise to the radial basis kerel is defied through itegratio K(x, x ) = φ z (x)φ z (x )dz (14) We ca also costruct various types of kerels from simpler oes. Here are a few rules to guide us. Assume K 1 (x, x ) ad K 2 (x, x ) are valid kerels (correspod to ier products of some feature vectors), the 1. K(x, x ) = f(x)k 1 (x, x )f(x ) for ay fuctio f(x), 2. K(x, x ) = K 1 (x, x ) + K 2 (x, x ), 3. K(x, x ) = K 1 (x, x )K 2 (x, x )

4 6.867 Machie learig, lecture 7 (Jaakkola) 4 are all valid kerels. While simple, these rules are quite powerful. Let s first uderstad these rules from the poit of view of the implicit feature vectors. For each rule, let φ(x) be the feature vector correspodig to K ad φ (1) (x) ad φ (2) (x) the feature vectors associated with K 1 ad K 2, respectively. The feature mappig for the first rule is give simply by multiplyig with the scalar fuctio f(x): φ(x) = f(x)φ (1) (x) (15) so that φ(x) T φ(x ) = f(x)φ (1) (x) T φ (1) (x )f(x ) = f(x)k 1 (x, x )f(x ). The secod rule, addig kerels, correspods to just cocateatig the feature vectors [ ] φ (1) (x) φ(x) = φ (2) (16) (x) The third ad the last rule is a little more complicated but ot much. Suppose we use a double idex i, j to idex the compoets of φ(x) where i rages over the compoets of φ (1) (x) ad j refers to the compoets of φ (2) (x). The It is ow easy to see that (1) (2) φ i,j (x) = φ i (x)φ j (x) (17) K(x, x ) = φ(x) T φ(x ) (18) = φ i,j (x)φ i,j (x ) (19) i,j = φ (1) i (x)φ (2) j (x)φ (1) i (x )φ (2) j (x ) (20) i,j = [ φ (1) i (x)φ (1) i (x )][ φ (2) j (x)φ (2) j (x )] (21) i j = [φ (1) (x) T φ (1) (x )][φ (2) (x) T φ (2) (x )] (22) = K 1 (x, x )K 2 (x, x ) (23) These costructio rules ca also be used to verify that somethig is a valid kerel. As a example, let s figure out why a radial basis kerel K(x, x ) = exp{ 2 1 x x 2 } (24)

5 6.867 Machie learig, lecture 7 (Jaakkola) 5 is a valid kerel. exp{ 1 2 x x 2 } = exp{ 1 2 x T x + x T x 1 2 x T x } (25) f(x) f(x {}}{{ ) }}{ = exp{ 1 2 x T x} exp{x T x } exp{ 1 2 x T x } (26) Here exp{x T x } is a sum of simple products x T x ad is therefore a kerel based o the secod ad third rules; the first rule allows us to icorporate f(x) ad f(x ). Strig kerels. It is ofte ecessary to make predictios (classify, assess risk, determie user ratigs) o the basis of more complex objects such as variable legth sequeces or graphs that do ot ecessarily permit a simple descriptio as poits i R d. The idea of kerels exteds to such objects as well. Cosider, for example, the case where the iputs x are variable legth sequeces (e.g., documets or biosequeces) with elemets from some commo alphabet A (e.g., letters or protei residues). Oe way to compare such sequeces is to cosider subsequeces that they may share. Let u A k deote a legth k sequece from this alphabet ad i a sequece of k idexes. So, for example, we ca say that u = x[i] if u 1 = x i1, u 2 = x i2,..., u k = x ik. I other words, x cotais the elemets of u i positios i 1 < i 2 < < i k. If the elemets of u are foud i successive positios i x, the i k i 1 = k 1. A simple strig kerel correspods to feature vectors with couts of occureces of legth k subsequeces: φ u (x) = δ(i k i 1, k 1) (27) i:u=x[i] I other words, the compoets are idexed by subsequeces u ad the value of u- compoet is the umber of times x cotais u as a cotiguous subsequece. For example, φ o (the commo costruct) = 2 (28) The umber of compoets i such feature vectors is very large (expoetial i k). Yet, the ier product φ u (x)φ u (x ) (29) u A k ca be computed efficietly (there are oly a limited umber of possible cotiguous subsequeces i x ad x ). The reaso for this differece, ad the argumet i favor of kerels

6 6.867 Machie learig, lecture 7 (Jaakkola) 6 more geerally, is that the feature vectors have to aggregate the iformatio ecessary to compare ay two sequeces while the ier product is evaluated for two specific sequeces. We ca also relax the requiremet that matches must be cotiguous. To this ed, we defie the legth of the widow of x where u appears as l(i) = i k i 1. The feature vectors i a weighted gapped substrig kerel are give by φ u (x) = λ l(i) (30) i:u=x[i] where the parameter λ (0, 1) specifies the pealty for o-cotiguous matches to u. The resultig kerel K(x, x ) = φ u (x)φ u (x ) = λ l(i) λ l(i) (31) u A k u A k i:u=x[i] i:u=x [i] ca be computed recursively. It is ofte useful to ormalize such a kerel so as to remove ay immediate effect from the sequece legth: K (x, x K(x, x ) ) = K(x, x) K(x, x ) (32) Appedix (optioal): Kerel liear regressio with offset Give a feature expasio specified by φ(x) we try to miimize ( ) 2 J(θ, θ 0 ) = y t θ T φ(x t ) θ 0 + λ θ 2 (33) where we have chose ot to regularize θ 0 to preserve the similarity to classificatio discussed later o. Not regularizig θ 0 meas, e.g., that we do ot care whether all the resposes have a costat added to them; the value of the objective, after optimizig θ 0, would remai the same with or without such costat. Settig the derivatives with respect to θ 0 ad θ to zero gives the followig optimality coditios: dj(θ, θ 0 ) ( ) = 2 y t θ T φ(x t ) θ 0 = 0 (34) dθ 0 dj(θ, θ 0 ) dθ α t = 2λθ 2 { ( }} ) { yt θ T φ(x t ) θ 0 φ(x t ) = 0 (35)

7 6.867 Machie learig, lecture 7 (Jaakkola) 7 We ca therefore costruct the optimal θ i terms of predictio differeces α t ad the feature vectors as before: 1 λ θ = α t φ(x t ) (36) Usig this form of the solutio for θ ad Eq.(34) we ca also express the optimal θ 0 as a fuctio of the predictio differeces α t : ( ) 1 ( ) 1 1 θ 0 = y t θ T φ(x t ) = y t α t φ(x t ) T φ(x t ) (37) λ t =1 We ca ow costrai α t to take o values that ca ideed be iterpreted as predictio differeces: α i = y i θ T φ(x i ) θ 0 (38) 1 = y i α t φ(x t ) T φ(x i ) θ 0 λ (39) t =1 ( ) = y i α t φ(x t ) T φ(x i ) y t α t φ(x t ) T φ(x t ) (40) λ λ t =1 t ( =1 ) = y i y t α t φ(x t ) T φ(x i ) φ(x t ) T φ(x t ) (41) λ t =1 With the same matrix otatio as before, ad lettig 1 = [1,..., 1] T, we ca rewrite the above coditio as C {}}{ 1 a = (I 11 T /) y (I 11 T /)Ka (42) λ where C = I 11 T / is a ceterig matrix. Ay solutio to the above equatio has to satisfy 1 T a = 0 (just left multiply the equatio with 1 T ). Note that this is exactly the optimality coditio for θ 0 i Eq.(34). Usig this summig to zero property of the solutio we ca rewrite the above equatio as 1 a = Cy CKCa (43) λ

8 6.867 Machie learig, lecture 7 (Jaakkola) 8 where we have itroduced a additioal ceterig operatio o the right had side. This caot chage the solutio sice Ca = a wheever 1 T a = 0. The solutio â is the â = λ (λi + CKC) 1 Cy (44) Oce we have â we ca recostruct θˆ0 from Eq.(37). θˆt φ(x) reduces to the kerel form as before.

Support vector machine revisited

Support vector machine revisited 6.867 Machie learig, lecture 8 (Jaakkola) 1 Lecture topics: Support vector machie ad kerels Kerel optimizatio, selectio Support vector machie revisited Our task here is to first tur the support vector