6.891: Lecture 24 (December 8th, 2003) Kernel Methods

Size: px

Start display at page:

Download "6.891: Lecture 24 (December 8th, 2003) Kernel Methods"

Anne Ross
5 years ago
Views:

1 6.891: Lecture 24 (December 8th, 2003) Kernel Methods

2 Overview ffl Recap: global linear models ffl New representations from old representations ffl computational trick ffl Kernels for NLP structures ffl onclusions: 10 Ideas from the ourse

3 Three omponents of Global Linear Models Φ is a function that maps a structure (x; y) to a feature vector ffl y) 2 R d Φ(x; GEN is a function that maps an input x toasetofcandidates ffl GEN(x) ffl W is a parameter vector (also a member of R d ) ffl Training data is used to set the value of W

4 ffl Φ maps a candidate to a feature vector 2 R d omponent 1: Φ ffl Φ defines the representation of a candidate S She announced a program to promote safety PP in trucks and vans Φ + 0; 2; 0; 0; 15; 5i h1;

5 She S announced a program to promote safety in PP trucks and vans She announced a S program to promote safety in PP trucks and vans She S announced a program to promote safety in PP trucks and vans She S announced a program to promote safety in PP trucks and vans She announced a S program to promote safety in PP trucks and vans She announced S a program to promote safety in PP trucks and vans omponent 2: GEN ffl GEN enumerates a set of candidates for a sentence She announced a program to promote safety in trucks and vans + GEN

6 omponent 2: GEN ffl GEN enumerates a set of candidates for an input x Some examples of how GEN(x) can be defined: ffl GEN(x) Parsing: is the set of parses x for under a grammar ny task: GEN(x) is the top N most probable parses under a history-based model Tagging: GEN(x) is the set of all possible tag sequences with the same length as x Translation: GEN(x) is the set of all possible English translations for the French sentence x

7 ffl W is a parameter vector 2 R d omponent 3: W ffl Φ and W together map a candidate to a real-valued score S She announced a program to promote safety PP in trucks and vans Φ + 0; 2; 0; 0; 15; 5i h1; Φ W + 0; 2; 0; 0; 15; 5i h1:9; 0:3; 0:2; 1:3; 0; 1:0; 2:3i = 5:8 h1;

8 Putting it all Together ffl X is set of sentences, Y is set of possible outputs (e.g. trees) ffl Need to learn a function F : X! Y ffl GEN, Φ, W define (x) = arg max F y) y2gen(x)φ(x; W hoose the highest scoring candidate as the most plausible structure ffl Given examples (x i ; y i ),howtosetw?

9 She S announced a program to promote safety in PP trucks and vans She announced a S program to promote safety in PP trucks and vans She S announced a program to promote safety in PP trucks and Φ + Φ + Φ + Φ + Φ + Φ + 1; 3; 5i h2; 0; 0; 5i h1; 0; 1; 5i h0; 0; 3; 0i h0; 1; 0; 5i h0; 0; 1; 5i h1; vans She S announced a program to promote arg max S + safety in PP trucks and vans She announced a S program to promote safety in PP trucks and vans She announced S a program to promote safety in PP trucks and vans She announced a program to promote safety in trucks and vans + GEN Φ W Φ W Φ W Φ W Φ W Φ W She announced a program to promote safety PP in trucks and vans

10 Variant of the Perceptron lgorithm Inputs: Training set (x i ; y i ) for i = 1 : : : n Initialization: W = 0 Define: F (x) = argmax y2gen(x) Φ(x; y) W lgorithm: For t = 1 : : : T, i = 1 : : : n z i = F (x i ) If (z i 6= y i ) W = W + Φ(x i ; y i ) Φ(x i ; z i ) Output: Parameters W

11 Theory Underlying the lgorithm ffl Definition: GEN(x i ) = GEN(x i ) fy i g Definition: The training set is separable with margin ffi, ffl if there is a U 2 R vector with jjujj = 1 such that d 8i; 8z 2 GEN(x i ) U Φ(x i ; y i ) U Φ(x i ; z) ffi

12 GEOMETRI INTUITION EHIND SEPRTION = orrect candidate = Incorrect candidates

13 GEOMETRI INTUITION EHIND SEPRTION U 01 = orrect candidate = Incorrect candidates

14 LL EXMPLES RE SEPRTED = orrect candidate (2) U = Incorrect candidates (2)

15 THEORY UNDERLYING THE LGORITHM Theorem: For any training sequence (x i ; y i ) which is separable with margin ffi, then for the perceptron algorithm Number of» mistakes R2 2 ffi where R is a constant such that 8i; 8z 2 GEN(x i ) jjφ(x i ;y i ) Φ(x i ;z)jj» R Proof: Direct modification of the proof for the classification case.

16 ) jjw k+1 jj 2» kr 2 kffi ) k 2 ffi 2» jjw k+1 jj 2» kr 2 ) k» R 2 =ffi 2 Proof: W Let be the weights before the k th mistake. W 1 = 0 If the k th mistake is made at i th example, k z and = argmax y2gen(x i ) Φ(y) W k i, then k+1 = W k + Φ(y i ) Φ(z i ) W U W k+1 = U W k + U Φ(y i ) U Φ(z i ) ) U W k + ffi lso, ) jjw k+1 jj kffi k+1 jj 2 = jjw k jj 2 + jjφ(y i ) Φ(z i )jj 2 + 2W k (Φ(y i ) Φ(z i )) jjw jjw k jj 2 + R 2»

17 Overview ffl Recap: global linear models ffl New representations from old representations ffl computational trick ffl Kernels for NLP structures

18 ffl Say we have an existing representation Φ(x; y) New Representations from Old Representations ffl Our global linear model will learn parameters W such that F (x) = argmax y2gen(x)φ(x; y) W This is a linear model: ffl but perhaps the linearity assumption is bad?

19 ffl Say we have an existing representation of size d = 2 Φ 0 (x; y) = fφ 1 (x; y); Φ 2 (x; y); Φ 1 (x; y) 2 ; Φ 2 (x; y) 2 ; Φ 1 (x; y)φ 2 (x; y)g New Representations from Old Representations Φ(x; y) = fφ 1 (x; y); Φ 2 (x; y)g We define a new representation Φ 0 (x; y) of dimension d 0 ffl contains every quadratic term Φ(x; y) in = O(d 2 ), that global linear model under Φ representation is linear in the new space ffl,butnon-linear 0 in the old space Φ: 0 Φ W0 1 Φ 1 (x; y)+w0 2 Φ 2 (x; y)+w0 3 Φ 1 (x; y) 2 +W0 4 Φ 2 (x; y) 2 +W0 5 Φ 1 (x; y)φ 2 (x; y) asic idea: explicitly form new feature vectors Φ 0 from Φ, and run the perceptron in the new space Φ0(x; y) W0 =

20 More Generally Say we have an existing representation ffl Φ (writing i instead Φ of (x; y) i for brevity): Φ(x; y) = fφ 1 ; Φ 2 ;:::;Φ d g We define a new representation Φ 0 (x; y) of dimension d 0 ffl contains every quadratic term Φ(x; y) in = O(d 2 ), that 0 (x; y) = fφ 0 1 ; Φ0 2 ;:::;Φ0 d 0g Φ fφ 1 ; Φ 2 ;:::;Φ d = 2 1 ; Φ2 2 ;:::;Φ2 d ; fφ 1 Φ 2 ; Φ 1 Φ 3 ;:::;Φ 1 Φ d ; fφ 2 Φ 1 ; Φ 2 Φ 3 ;:::;Φ 2 Φ d ; fφ f::: fφ d Φ 1 ; Φ d Φ 2 ;:::;Φ d Φ d 1; g Problem: size of Φ 0 quickly gets very large ) computational efficiency becomes a real problem

21 Overview ffl Recap: global linear models ffl New representations from old representations ffl computational trick ffl Kernels for NLP structures

22 K((x; y); (v; w)) = (1 + Φ(x; y) Φ(v; w)) 2 = (1 + Φ 1 ρ 1 + Φ 2 ρ 2 ) 2 omputational Trick (Part 1) Now, take feature vectors for a first example (x; y), ffl and for a second (v; example w): Φ(x; y) = fφ 1 ; Φ 2 g Φ(v; w) = fρ 1 ; ρ 2 g ffl onsider a function K: ffl For example, if Φ(x; y) = f1; 3g Φ(v; w) = f2; 4g then K((x; y); (v; w)) = ( ) 2 = 225

23 K((x; y); (v; w)) = (1 + Φ(x; y) Φ(v; w)) 2 = (1 + Φ 1 ρ 1 + Φ 2 ρ 2 ) 2 2 ; ρ 2 1 p2ρ ρ2 2 ; p 2ρ 1 ρ 2 g ; ffl onsider a function K: omputational Trick (Part 1) ffl Key point: It can be shown that where K((x; y); (v; w)) = Φ 0 (x; y) Φ 0 (v; w) Φ 0 (x; y) = f1; p 1 ; p2φ 2 ; Φ 2 1 ; Φ2 2 ; p 2Φ 1 Φ 2 g 2Φ Φ 0 (v; w) = f1; p 1 ; 2ρ So: K is an inner product in a new space that contains all quadratic terms in the original space Φ

24 = (1 + Φ(x; y) Φ(v; w)) 2 = (1 + Φ 1 ρ 1 + Φ 2 ρ 2 ) 2 = 1 + Φ 2 1 ρ2 1 + Φ2 2 ρ Φ 1ρ 1 Φ 2 ρ 2 + 2Φ 1 ρ 1 + 2Φ 2 ρ 2 p ; p2φ 2 ; Φ 2 1 ; Φ2 2 ; p p 2Φ 1 Φ 2 g f1; 2ρ 1 ; p2ρ 2 ; ρ 2 1 ; ρ2 2 ; p 2ρ 1 ρ 2 g 1 2Φ Proof: K((x; y); (v; w)) = f1;

25 K((x; y); (v; w)) = (1 + Φ(x; y) Φ(v; w)) 2 fφ 2 1 ; Φ2 2 ;:::;Φ2 d ; More Generally ffl Say we have an existing representation and we take Φ(x; y) = fφ 1 ; Φ 2 ;:::;Φ d g ffl Then it can be shown that K((x; y); (v; w)) = Φ 0 (x; y) Φ 0 (v; w) where Φ 0 (x; y) = fφ 0 1 ; Φ0 2 ;:::;Φ0 d 0g p p p 2Φ 2 ;:::; 2Φ d ; ; 1 2Φ = f1; p p Φ 3 ;:::; 2Φ 1 Φ d ; 1 2Φ f p 2Φ 1 Φ 2 ; p p f p 2Φ 2 Φ 1 ; 2Φ 2 Φ 3 ;:::; 2Φ 2 Φ d ; f::: f p 2Φ d Φ 1 ; p p Φ 2 ;:::; 2Φ d Φ d 1; g d 2Φ

26 Variant of the Perceptron lgorithm Inputs: Training set (x i ; y i ) for i = 1 : : : n Initialization: W = 0 Define: F (x) = argmax y2gen(x) Φ(x; y) W lgorithm: For t = 1 : : : T, i = 1 : : : n z i = F (x i ) If (z i 6= y i ) W = W + Φ(x i ; y i ) Φ(x i ; z i ) Output: Parameters W

27 omputational Trick (Part 2) ffl In standard perceptron, we store a parameter vector W, and F (x) = arg max Φ(x; y) W y2gen(x) ffl In dual form perceptron, we store weights i;y for all i, andforally 2 GEN(x i ) ff and assume the equivalence: W = ff i;z Φ(x i ;z) X i;z2gen(xi) ffl We then have (x) = arg max F Φ(x; y) W y2gen(x) 1 arg max = y2gen(x) X ff i;z Φ(x i ;z) Φ(x; y) i;z2gen(xi)

28 F (x) = argmax y2gen(x) P ) ff i;z Φ(x i ; z) Φ(x; y) i i;z2gen(x Dual Form of the Perceptron lgorithm Inputs: Training set (x i ; y i ) for i = 1 : : : n Initialization: ff i;y = 0 for all i, for all y 2 GEN(x i ) Define: lgorithm: For t = 1 : : : T, i = 1 : : : n z i = F (x i ) If (z i 6= y i ) ff i;yi = ff i;yi + 1 Output: Parameters ff i;y ff i;zi = ff i;zi 1

29 Equivalence: W = P i;z2gen(x i ) ff i;z Φ(x i ; z) Original Form W = 0 Initialization: F (x) = argmax Define: Φ(y) W y2gen(x) lgorithm: For t = 1 :::T, i = 1 :::n Dual Form z i = F (x i ) If (z i 6= y i ) W = W + Φ(y i ) Φ(z i ) Initialization: ff i;y = 0 for all i, for all y 2 GEN(x i ) Define: F (x) = argmax y2gen(x) P i;zφ(x i ;z) Φ(x; y) ff i;z2gen(xi) lgorithm: For t = 1 :::T, i = 1 :::n z i = F (x i ) If (z i 6= y i ) ff i;yi = ff i;yi + 1, xxx ff i;zi = ff i;zi 1

30 F (x) = argmax y2gen(x) P ) ff i;z K((x i ; z); (x; y)) i i;z2gen(x Dual (Kernel) Form of the Perceptron lgorithm Inputs: Training set (x i ; y i ) for i = 1 : : : n Initialization: ff i;y = 0 for all i, for all y 2 GEN(x i ) Define: lgorithm: For t = 1 : : : T, i = 1 : : : n z i = F (x i ) If (z i 6= y i ) ff i;yi = ff i;yi + 1 Output: Parameters ff i;y ff i;zi = ff i;zi 1

31 Dual (Kernel) Form of the Perceptron lgorithm ffl For example, if we choose K((x; y); (v; w)) = (1 + Φ(x; y) Φ(v; w)) 2 then the kernel form learns a global linear model F (x) = arg max Φ 0 (x; y) W y2gen(x) where Φ 0 is a representation that contains all quadratic terms of the original representation Φ

32 (x) = arg max F Φ 0 (x; y) W y2gen(x) ffl The algorithm returns coefficients ff i;y which implicitly define W = ff i;z Φ 0 (x i ;z) X and i;z2gen(xi) 1 arg max = y2gen(x) X ff i;z K((x i ;z); (x; y)) i;z2gen(x i ) We never have to manipulate parameter W vectors or representations 0 (x; y) directly: everything in training and testing is computed Φ indirectly, through inner products or kernels

33 I Say is the time taken to K((x; y); (v; w)) calculate N = Say i jgen(x i)j is size of the training set P ffl omputational efficiency: In taking T passes over the training set, at most 2Tn values of ff i;y can take values other than 0,! X ff i;z K((x i ;z); (x; y)) i;z2gen(x i ) takes O(nT I) time nd T passes over the training set takes O(nT 2 IN) time

34 Kernels ffl kernel K is a function of two objects, for example, two sentence/tree pairs (x1; y1) and (x2; y2) K((x1; y1); (x2; y2)) Intuition: K((x1; y1); (x2; y2)) is a measure of the similarity ffl (x1; y1) between (x2; y2) and Formally: K((x1; y1); (x2; y2)) is a kernel if it can be shown ffl that there is some feature vector Φ(x; y) mapping such that for all x1; y1; x2; y2 K((x1; y1); (x2; y2)) = Φ(x1; y1) Φ(x2; y2)

35 (Trivial) Example of a Kernel ffl Given an existing feature vector representation Φ,define K((x1; y1); (x2; y2)) = Φ(x1; y1) Φ(x2; y2)

36 K((x 1 ;y 1 ); (x 2 ;y 2 )) = (1 + Φ(x 1 ;y 1 ) Φ(x 2 ;y 2 )) 2 More Interesting Kernel ffl Given an existing feature vector representation Φ, define This can be shown to be an inner product in a new space Φ 0,whereΦ 0 contains all quadratic terms of Φ ffl More generally, K((x 1 ;y 1 ); (x 2 ;y 2 )) = (1 + Φ(x 1 ;y 1 ) Φ(x 2 ;y 2 )) p can be shown to be an inner product in a new space Φ 0,whereΦ 0 contains all polynomial terms of Φ up to degree p Question: can we come up with specialized kernels for NLP structures?

37 Overview ffl Recap: global linear models ffl New representations from old representations ffl computational trick ffl Kernels for NLP structures

38 ffl Trees NLP Structures S John saw Mary Tagged sequences, e.g., named entity tagging ffl S N N N S j j j j j j Napoleon onaparte was exiled to Elba S = Start entity = ontinue entity N = Not an entity

39 ffl Φ maps a structure to a feature vector 2 R d Feature Vectors: Φ ffl Φ defines the representation of a structure S She announced a program to promote safety PP in trucks and vans Φ + 0; 2; 0; 0; 15; 5i h1;

40 Features ffl feature is a function on a structure, e.g., h(x) = Number of times is seen in x T 1 D d E e F f G g T 2 D d E e F h b c h(t 1 ) = 1 h(t 2 ) = 2

41 T 1 T 2 Feature Vectors ffl set of functions h1 : : : h d define a feature vector Φ(x) = hh1(x); h2(x) : : : h d (x)i D E F G D E F d e f g d e h b c Φ(T 2 ) = h2; 0; 1; 1i Φ(T 1 ) = h1; 0; 0; 3i

42 ll Subtrees Representation [od, 1998] Given: Non-Terminal symbols f; ; : : :g ffl Terminal fa; b; c : : :g symbols ffl n infinite set of subtrees ::: E b b b ffl n infinite set of features, e.g., h3(x; y) = Number of times is seen in (x; y) b

43 ll Sub-fragments for Tagged Sequences Terminal symbols fa; b; c; : : :g ffl Given: State symbols fs; ; N g S S ffl n infinite set of sub-fragments j a S S j b : : : ffl n infinite set of features, e.g., h3(x) = Number of times S j b is seen in x

44 dx Inner Products ffl Φ(x) = hh1(x); h2(x) : : : h d (x)i ffl Inner product ( Kernel ) between two structures T1 and T2: Φ(T1) Φ(T2) = h i (T1)h i (T2) i=1 T2 T1 D E F G D E F d e f g d e h b c Φ(T1) = h1; 0; 0; 3i Φ(T2) = h2; 0; 1; 1i Φ(T1) Φ(T2) = = 5

45 ll Subtrees Representation Given: Non-Terminal symbols f; ; : : :g ffl Terminal fa; b; c : : :g symbols ffl n infinite set of subtrees ::: E b b b Step 1: ffl hoose an (arbitrary) mapping from subtrees to integers h i (x) = Number of times subtree i is seen in x Φ(x) = hh1(x); h2(x); h3(x) : : :i

46 ffl Φ is now huge ll Subtrees Representation ut inner product Φ(T 1 ) Φ(T 2 ) can be computed ffl efficiently using dynamic programming.

47 omputing the Inner Product Define N1 and N2 are sets of nodes in T1 and T2 respectively. I i (x) = ( if i th subtree is rooted at x. 1 otherwise: 0 Follows that: h i (T1) = P n 1 2N 1 I i (n1) and h i (T2) = P n 2 2N 2 I i (n2) Φ(T1) Φ(T2) = P i h i (T1)h i (T2) = P i (P n 1 2N 1 I i (n1)) ( P n 2 2N 2 I i (n2)) P P = 2N 1 n 2 2N 2 Pi I i (n1)i i (n2) n 1 = P n 1 2N 1 Pn 2 2N 2 (n1; n2) where (n1; n2) = P i I i (n1)i i (n2) is the number of common subtrees at n1; n2

48 n Example T 1 D d E e F f G g T 2 D d E e F h G i Φ(T 1 ) Φ(T 2 ) = (; )+ (; ) :::+ (; )+ (; ) :::+ (G; G) ffl Most of these terms are 0 (e.g. (; )). ffl Some are non-zero, e.g. (; ) = 4 D E D d E D E e D d E e

49 Recursive Definition of (n1; n2) ffl If the productions at n1 and n2 are different (n1; n2) = 0 ffl Else if n1; n2 are pre-terminals, (n1; n2) = 1 ffl Else 1 ) Y nc(n (n1; n2) = (1 + (ch(n1; j); ch(n2; j))) j=1 is number of children of node n1; nc(n1) j) is the j th child of n1. ch(n1;

50 Illustration of the Recursion D d E e F f G g D d E e F h G i How many subtrees do nodes and have in common? i.e., What is (; )? (; ) = 4 (; ) = 1 D E D d E D E e D d E e F G (; ) = ( (; ) + 1) ( (; ) + 1) = 10

51 D E D d E D E e D d E e F G D E F G D d E F G D E e F G D d E e F G

52 The Inner Product for Tagged Sequences ffl Define N1 and N2 to be sets of states in T1 and T2 respectively. ffl y a similar argument, where (n1; n2) is number of common sub-fragments at n1; n2 Φ(T1) Φ(T2) = P n 1 2N 1 Pn 2 2N 2 (n1; n2) e.g., T1 = D j j 2 j = j a b c d T E j j j j a b e e 1 ) Φ(T 2 ) = (; )+ (; ) :::+ (; )+ (; ) :::+ (D; E) Φ(T (; ) = e.g., 4, j b j b

53 The Recursive Definition for Tagged Sequences ffl Define N (n) = state following n, W (n) = word at state n Define ß[W (n1); W (n2)] = 1 iff W (n1) = W (n2) ffl Then if labels at n1 and n2 are the same, ffl (n1; n2) = (1+ß[W (n1); W (n2)]) (1+ (N (n1); N (n2)) e.g., T1 = D j j 2 j = j a b c d T E j j j j a b e e (; ) = (1 + ß[a; a]) (1 + (; )) = (1 + 1) (1 + 4) = 10

54 Refinements of the Kernels ffl Include log probability from the baseline model: Φ(T1) is representation under all sub-fragments kernel L(T1) is log probability under baseline model New representation Φ 0 where Φ 0 (T1) Φ 0 (T2) = fil(t1)l(t2) + Φ(T1) Φ(T2) (includes L(T1) as an additional component with weight p fi) ffl llows the perceptron to use original ranking as default

55 dx Refinements of the Kernels ffl Downweighting larger sub-fragments SIZE i h i (T1)h i (T2) where 0 <» 1, i=1 SIZE i is number of states/rules in i th fragment ffl Simple modification to recursive definitions, e.g., (n1; n2) = (1+ß[W (n1); W (n2)]) (1+ (N (n1); N (n2))

56 (n1; n2) = (1+ß[W (n1); W (n2)]) (1+ (N (n1); N (n2)) Refinement of the Tagging Kernel Sub-fragments sensitive to spelling features ffl (e.g., apitalization) Define ß[x; y] = 1 if x and y are identical, ffl y] = 0:5 if x and y share same capitalization features ß[x; N N S j j j exiled to Elba N N S j j j exiled to ap ffl Sub-fragments now include capitalization features N N S j j j No cap to ap N N S j j j No cap No cap ap

57 Parsing Wall Street Journal Experimental Results» 0 2 MODEL 100 Words (2416 sentences) LR LP s s s O % 88.3% % 85.1% 88.6% 88.9% % 86.3% gives 5.1% relative reduction in error (O99 = my thesis parser) Named Entity Tagging on Web Data P R F Max-Ent 84.4% 86.3% 85.3% Perc. 86.1% 89.1% 87.6% Improvement 10.9% 20.4% 15.6% gives 15.6% relative reduction in error

58 Summary For any representation Φ(x), ffl Efficient computation Φ(x) Φ(y) ) of Efficient learning through kernel form of the perceptron Dynamic programming can be used to calculate Φ(x) Φ(y) ffl under all sub-fragments representations Several refinements of the inner products: ffl Including probabilities from baseline model Downweighting larger sub-fragments Sensitivity to spelling features

59 onclusions: 10 Ideas from the ourse 1. Smoothed estimation 2. Probabilistic ontext-free Grammars, and history-based models ) lexicalized parsing 3. Feature-vector representations, and log-linear models ) log-linear models for tagging, parsing 4. The EM algorithm: hidden structure 5. Machine translation: making use of the EM algorithm 6. Global linear models: new representations (global features) 7. Global linear models: new learning algorithms (perceptron, boosting) 8. Partially supervised methods: applications to word sense disambiguation, named entity recognition, and relation extraction 9. Structured models for information extraction, and dialogue systems 10. final representational trick, kernels

Kernels. ffl A kernel K is a function of two objects, for example, two sentence/tree pairs (x1; y1) and (x2; y2)

Kernels. ffl A kernel K is a function of two objects, for example, two sentence/tree pairs (x1; y1) and (x2; y2) Krnls krnl K is a function of two ojcts, for xampl, two sntnc/tr pairs (x1; y1) an (x2; y2) K((x1; y1); (x2; y2)) Intuition: K((x1; y1); (x2; y2)) is a masur of th similarity (x1; y1) twn (x2; y2) an ormally: