6.891: Lecture 24 (December 8th, 2003) Kernel Methods
|
|
- Anne Ross
- 5 years ago
- Views:
Transcription
1 6.891: Lecture 24 (December 8th, 2003) Kernel Methods
2 Overview ffl Recap: global linear models ffl New representations from old representations ffl computational trick ffl Kernels for NLP structures ffl onclusions: 10 Ideas from the ourse
3 Three omponents of Global Linear Models Φ is a function that maps a structure (x; y) to a feature vector ffl y) 2 R d Φ(x; GEN is a function that maps an input x toasetofcandidates ffl GEN(x) ffl W is a parameter vector (also a member of R d ) ffl Training data is used to set the value of W
4 ffl Φ maps a candidate to a feature vector 2 R d omponent 1: Φ ffl Φ defines the representation of a candidate S She announced a program to promote safety PP in trucks and vans Φ + 0; 2; 0; 0; 15; 5i h1;
5 She S announced a program to promote safety in PP trucks and vans She announced a S program to promote safety in PP trucks and vans She S announced a program to promote safety in PP trucks and vans She S announced a program to promote safety in PP trucks and vans She announced a S program to promote safety in PP trucks and vans She announced S a program to promote safety in PP trucks and vans omponent 2: GEN ffl GEN enumerates a set of candidates for a sentence She announced a program to promote safety in trucks and vans + GEN
6 omponent 2: GEN ffl GEN enumerates a set of candidates for an input x Some examples of how GEN(x) can be defined: ffl GEN(x) Parsing: is the set of parses x for under a grammar ny task: GEN(x) is the top N most probable parses under a history-based model Tagging: GEN(x) is the set of all possible tag sequences with the same length as x Translation: GEN(x) is the set of all possible English translations for the French sentence x
7 ffl W is a parameter vector 2 R d omponent 3: W ffl Φ and W together map a candidate to a real-valued score S She announced a program to promote safety PP in trucks and vans Φ + 0; 2; 0; 0; 15; 5i h1; Φ W + 0; 2; 0; 0; 15; 5i h1:9; 0:3; 0:2; 1:3; 0; 1:0; 2:3i = 5:8 h1;
8 Putting it all Together ffl X is set of sentences, Y is set of possible outputs (e.g. trees) ffl Need to learn a function F : X! Y ffl GEN, Φ, W define (x) = arg max F y) y2gen(x)φ(x; W hoose the highest scoring candidate as the most plausible structure ffl Given examples (x i ; y i ),howtosetw?
9 She S announced a program to promote safety in PP trucks and vans She announced a S program to promote safety in PP trucks and vans She S announced a program to promote safety in PP trucks and Φ + Φ + Φ + Φ + Φ + Φ + 1; 3; 5i h2; 0; 0; 5i h1; 0; 1; 5i h0; 0; 3; 0i h0; 1; 0; 5i h0; 0; 1; 5i h1; vans She S announced a program to promote arg max S + safety in PP trucks and vans She announced a S program to promote safety in PP trucks and vans She announced S a program to promote safety in PP trucks and vans She announced a program to promote safety in trucks and vans + GEN Φ W Φ W Φ W Φ W Φ W Φ W She announced a program to promote safety PP in trucks and vans
10 Variant of the Perceptron lgorithm Inputs: Training set (x i ; y i ) for i = 1 : : : n Initialization: W = 0 Define: F (x) = argmax y2gen(x) Φ(x; y) W lgorithm: For t = 1 : : : T, i = 1 : : : n z i = F (x i ) If (z i 6= y i ) W = W + Φ(x i ; y i ) Φ(x i ; z i ) Output: Parameters W
11 Theory Underlying the lgorithm ffl Definition: GEN(x i ) = GEN(x i ) fy i g Definition: The training set is separable with margin ffi, ffl if there is a U 2 R vector with jjujj = 1 such that d 8i; 8z 2 GEN(x i ) U Φ(x i ; y i ) U Φ(x i ; z) ffi
12 GEOMETRI INTUITION EHIND SEPRTION = orrect candidate = Incorrect candidates
13 GEOMETRI INTUITION EHIND SEPRTION U 01 = orrect candidate = Incorrect candidates
14 LL EXMPLES RE SEPRTED = orrect candidate (2) U = Incorrect candidates (2)
15 THEORY UNDERLYING THE LGORITHM Theorem: For any training sequence (x i ; y i ) which is separable with margin ffi, then for the perceptron algorithm Number of» mistakes R2 2 ffi where R is a constant such that 8i; 8z 2 GEN(x i ) jjφ(x i ;y i ) Φ(x i ;z)jj» R Proof: Direct modification of the proof for the classification case.
16 ) jjw k+1 jj 2» kr 2 kffi ) k 2 ffi 2» jjw k+1 jj 2» kr 2 ) k» R 2 =ffi 2 Proof: W Let be the weights before the k th mistake. W 1 = 0 If the k th mistake is made at i th example, k z and = argmax y2gen(x i ) Φ(y) W k i, then k+1 = W k + Φ(y i ) Φ(z i ) W U W k+1 = U W k + U Φ(y i ) U Φ(z i ) ) U W k + ffi lso, ) jjw k+1 jj kffi k+1 jj 2 = jjw k jj 2 + jjφ(y i ) Φ(z i )jj 2 + 2W k (Φ(y i ) Φ(z i )) jjw jjw k jj 2 + R 2»
17 Overview ffl Recap: global linear models ffl New representations from old representations ffl computational trick ffl Kernels for NLP structures
18 ffl Say we have an existing representation Φ(x; y) New Representations from Old Representations ffl Our global linear model will learn parameters W such that F (x) = argmax y2gen(x)φ(x; y) W This is a linear model: ffl but perhaps the linearity assumption is bad?
19 ffl Say we have an existing representation of size d = 2 Φ 0 (x; y) = fφ 1 (x; y); Φ 2 (x; y); Φ 1 (x; y) 2 ; Φ 2 (x; y) 2 ; Φ 1 (x; y)φ 2 (x; y)g New Representations from Old Representations Φ(x; y) = fφ 1 (x; y); Φ 2 (x; y)g We define a new representation Φ 0 (x; y) of dimension d 0 ffl contains every quadratic term Φ(x; y) in = O(d 2 ), that global linear model under Φ representation is linear in the new space ffl,butnon-linear 0 in the old space Φ: 0 Φ W0 1 Φ 1 (x; y)+w0 2 Φ 2 (x; y)+w0 3 Φ 1 (x; y) 2 +W0 4 Φ 2 (x; y) 2 +W0 5 Φ 1 (x; y)φ 2 (x; y) asic idea: explicitly form new feature vectors Φ 0 from Φ, and run the perceptron in the new space Φ0(x; y) W0 =
20 More Generally Say we have an existing representation ffl Φ (writing i instead Φ of (x; y) i for brevity): Φ(x; y) = fφ 1 ; Φ 2 ;:::;Φ d g We define a new representation Φ 0 (x; y) of dimension d 0 ffl contains every quadratic term Φ(x; y) in = O(d 2 ), that 0 (x; y) = fφ 0 1 ; Φ0 2 ;:::;Φ0 d 0g Φ fφ 1 ; Φ 2 ;:::;Φ d = 2 1 ; Φ2 2 ;:::;Φ2 d ; fφ 1 Φ 2 ; Φ 1 Φ 3 ;:::;Φ 1 Φ d ; fφ 2 Φ 1 ; Φ 2 Φ 3 ;:::;Φ 2 Φ d ; fφ f::: fφ d Φ 1 ; Φ d Φ 2 ;:::;Φ d Φ d 1; g Problem: size of Φ 0 quickly gets very large ) computational efficiency becomes a real problem
21 Overview ffl Recap: global linear models ffl New representations from old representations ffl computational trick ffl Kernels for NLP structures
22 K((x; y); (v; w)) = (1 + Φ(x; y) Φ(v; w)) 2 = (1 + Φ 1 ρ 1 + Φ 2 ρ 2 ) 2 omputational Trick (Part 1) Now, take feature vectors for a first example (x; y), ffl and for a second (v; example w): Φ(x; y) = fφ 1 ; Φ 2 g Φ(v; w) = fρ 1 ; ρ 2 g ffl onsider a function K: ffl For example, if Φ(x; y) = f1; 3g Φ(v; w) = f2; 4g then K((x; y); (v; w)) = ( ) 2 = 225
23 K((x; y); (v; w)) = (1 + Φ(x; y) Φ(v; w)) 2 = (1 + Φ 1 ρ 1 + Φ 2 ρ 2 ) 2 2 ; ρ 2 1 p2ρ ρ2 2 ; p 2ρ 1 ρ 2 g ; ffl onsider a function K: omputational Trick (Part 1) ffl Key point: It can be shown that where K((x; y); (v; w)) = Φ 0 (x; y) Φ 0 (v; w) Φ 0 (x; y) = f1; p 1 ; p2φ 2 ; Φ 2 1 ; Φ2 2 ; p 2Φ 1 Φ 2 g 2Φ Φ 0 (v; w) = f1; p 1 ; 2ρ So: K is an inner product in a new space that contains all quadratic terms in the original space Φ
24 = (1 + Φ(x; y) Φ(v; w)) 2 = (1 + Φ 1 ρ 1 + Φ 2 ρ 2 ) 2 = 1 + Φ 2 1 ρ2 1 + Φ2 2 ρ Φ 1ρ 1 Φ 2 ρ 2 + 2Φ 1 ρ 1 + 2Φ 2 ρ 2 p ; p2φ 2 ; Φ 2 1 ; Φ2 2 ; p p 2Φ 1 Φ 2 g f1; 2ρ 1 ; p2ρ 2 ; ρ 2 1 ; ρ2 2 ; p 2ρ 1 ρ 2 g 1 2Φ Proof: K((x; y); (v; w)) = f1;
25 K((x; y); (v; w)) = (1 + Φ(x; y) Φ(v; w)) 2 fφ 2 1 ; Φ2 2 ;:::;Φ2 d ; More Generally ffl Say we have an existing representation and we take Φ(x; y) = fφ 1 ; Φ 2 ;:::;Φ d g ffl Then it can be shown that K((x; y); (v; w)) = Φ 0 (x; y) Φ 0 (v; w) where Φ 0 (x; y) = fφ 0 1 ; Φ0 2 ;:::;Φ0 d 0g p p p 2Φ 2 ;:::; 2Φ d ; ; 1 2Φ = f1; p p Φ 3 ;:::; 2Φ 1 Φ d ; 1 2Φ f p 2Φ 1 Φ 2 ; p p f p 2Φ 2 Φ 1 ; 2Φ 2 Φ 3 ;:::; 2Φ 2 Φ d ; f::: f p 2Φ d Φ 1 ; p p Φ 2 ;:::; 2Φ d Φ d 1; g d 2Φ
26 Variant of the Perceptron lgorithm Inputs: Training set (x i ; y i ) for i = 1 : : : n Initialization: W = 0 Define: F (x) = argmax y2gen(x) Φ(x; y) W lgorithm: For t = 1 : : : T, i = 1 : : : n z i = F (x i ) If (z i 6= y i ) W = W + Φ(x i ; y i ) Φ(x i ; z i ) Output: Parameters W
27 omputational Trick (Part 2) ffl In standard perceptron, we store a parameter vector W, and F (x) = arg max Φ(x; y) W y2gen(x) ffl In dual form perceptron, we store weights i;y for all i, andforally 2 GEN(x i ) ff and assume the equivalence: W = ff i;z Φ(x i ;z) X i;z2gen(xi) ffl We then have (x) = arg max F Φ(x; y) W y2gen(x) 1 arg max = y2gen(x) X ff i;z Φ(x i ;z) Φ(x; y) i;z2gen(xi)
28 F (x) = argmax y2gen(x) P ) ff i;z Φ(x i ; z) Φ(x; y) i i;z2gen(x Dual Form of the Perceptron lgorithm Inputs: Training set (x i ; y i ) for i = 1 : : : n Initialization: ff i;y = 0 for all i, for all y 2 GEN(x i ) Define: lgorithm: For t = 1 : : : T, i = 1 : : : n z i = F (x i ) If (z i 6= y i ) ff i;yi = ff i;yi + 1 Output: Parameters ff i;y ff i;zi = ff i;zi 1
29 Equivalence: W = P i;z2gen(x i ) ff i;z Φ(x i ; z) Original Form W = 0 Initialization: F (x) = argmax Define: Φ(y) W y2gen(x) lgorithm: For t = 1 :::T, i = 1 :::n Dual Form z i = F (x i ) If (z i 6= y i ) W = W + Φ(y i ) Φ(z i ) Initialization: ff i;y = 0 for all i, for all y 2 GEN(x i ) Define: F (x) = argmax y2gen(x) P i;zφ(x i ;z) Φ(x; y) ff i;z2gen(xi) lgorithm: For t = 1 :::T, i = 1 :::n z i = F (x i ) If (z i 6= y i ) ff i;yi = ff i;yi + 1, xxx ff i;zi = ff i;zi 1
30 F (x) = argmax y2gen(x) P ) ff i;z K((x i ; z); (x; y)) i i;z2gen(x Dual (Kernel) Form of the Perceptron lgorithm Inputs: Training set (x i ; y i ) for i = 1 : : : n Initialization: ff i;y = 0 for all i, for all y 2 GEN(x i ) Define: lgorithm: For t = 1 : : : T, i = 1 : : : n z i = F (x i ) If (z i 6= y i ) ff i;yi = ff i;yi + 1 Output: Parameters ff i;y ff i;zi = ff i;zi 1
31 Dual (Kernel) Form of the Perceptron lgorithm ffl For example, if we choose K((x; y); (v; w)) = (1 + Φ(x; y) Φ(v; w)) 2 then the kernel form learns a global linear model F (x) = arg max Φ 0 (x; y) W y2gen(x) where Φ 0 is a representation that contains all quadratic terms of the original representation Φ
32 (x) = arg max F Φ 0 (x; y) W y2gen(x) ffl The algorithm returns coefficients ff i;y which implicitly define W = ff i;z Φ 0 (x i ;z) X and i;z2gen(xi) 1 arg max = y2gen(x) X ff i;z K((x i ;z); (x; y)) i;z2gen(x i ) We never have to manipulate parameter W vectors or representations 0 (x; y) directly: everything in training and testing is computed Φ indirectly, through inner products or kernels
33 I Say is the time taken to K((x; y); (v; w)) calculate N = Say i jgen(x i)j is size of the training set P ffl omputational efficiency: In taking T passes over the training set, at most 2Tn values of ff i;y can take values other than 0,! X ff i;z K((x i ;z); (x; y)) i;z2gen(x i ) takes O(nT I) time nd T passes over the training set takes O(nT 2 IN) time
34 Kernels ffl kernel K is a function of two objects, for example, two sentence/tree pairs (x1; y1) and (x2; y2) K((x1; y1); (x2; y2)) Intuition: K((x1; y1); (x2; y2)) is a measure of the similarity ffl (x1; y1) between (x2; y2) and Formally: K((x1; y1); (x2; y2)) is a kernel if it can be shown ffl that there is some feature vector Φ(x; y) mapping such that for all x1; y1; x2; y2 K((x1; y1); (x2; y2)) = Φ(x1; y1) Φ(x2; y2)
35 (Trivial) Example of a Kernel ffl Given an existing feature vector representation Φ,define K((x1; y1); (x2; y2)) = Φ(x1; y1) Φ(x2; y2)
36 K((x 1 ;y 1 ); (x 2 ;y 2 )) = (1 + Φ(x 1 ;y 1 ) Φ(x 2 ;y 2 )) 2 More Interesting Kernel ffl Given an existing feature vector representation Φ, define This can be shown to be an inner product in a new space Φ 0,whereΦ 0 contains all quadratic terms of Φ ffl More generally, K((x 1 ;y 1 ); (x 2 ;y 2 )) = (1 + Φ(x 1 ;y 1 ) Φ(x 2 ;y 2 )) p can be shown to be an inner product in a new space Φ 0,whereΦ 0 contains all polynomial terms of Φ up to degree p Question: can we come up with specialized kernels for NLP structures?
37 Overview ffl Recap: global linear models ffl New representations from old representations ffl computational trick ffl Kernels for NLP structures
38 ffl Trees NLP Structures S John saw Mary Tagged sequences, e.g., named entity tagging ffl S N N N S j j j j j j Napoleon onaparte was exiled to Elba S = Start entity = ontinue entity N = Not an entity
39 ffl Φ maps a structure to a feature vector 2 R d Feature Vectors: Φ ffl Φ defines the representation of a structure S She announced a program to promote safety PP in trucks and vans Φ + 0; 2; 0; 0; 15; 5i h1;
40 Features ffl feature is a function on a structure, e.g., h(x) = Number of times is seen in x T 1 D d E e F f G g T 2 D d E e F h b c h(t 1 ) = 1 h(t 2 ) = 2
41 T 1 T 2 Feature Vectors ffl set of functions h1 : : : h d define a feature vector Φ(x) = hh1(x); h2(x) : : : h d (x)i D E F G D E F d e f g d e h b c Φ(T 2 ) = h2; 0; 1; 1i Φ(T 1 ) = h1; 0; 0; 3i
42 ll Subtrees Representation [od, 1998] Given: Non-Terminal symbols f; ; : : :g ffl Terminal fa; b; c : : :g symbols ffl n infinite set of subtrees ::: E b b b ffl n infinite set of features, e.g., h3(x; y) = Number of times is seen in (x; y) b
43 ll Sub-fragments for Tagged Sequences Terminal symbols fa; b; c; : : :g ffl Given: State symbols fs; ; N g S S ffl n infinite set of sub-fragments j a S S j b : : : ffl n infinite set of features, e.g., h3(x) = Number of times S j b is seen in x
44 dx Inner Products ffl Φ(x) = hh1(x); h2(x) : : : h d (x)i ffl Inner product ( Kernel ) between two structures T1 and T2: Φ(T1) Φ(T2) = h i (T1)h i (T2) i=1 T2 T1 D E F G D E F d e f g d e h b c Φ(T1) = h1; 0; 0; 3i Φ(T2) = h2; 0; 1; 1i Φ(T1) Φ(T2) = = 5
45 ll Subtrees Representation Given: Non-Terminal symbols f; ; : : :g ffl Terminal fa; b; c : : :g symbols ffl n infinite set of subtrees ::: E b b b Step 1: ffl hoose an (arbitrary) mapping from subtrees to integers h i (x) = Number of times subtree i is seen in x Φ(x) = hh1(x); h2(x); h3(x) : : :i
46 ffl Φ is now huge ll Subtrees Representation ut inner product Φ(T 1 ) Φ(T 2 ) can be computed ffl efficiently using dynamic programming.
47 omputing the Inner Product Define N1 and N2 are sets of nodes in T1 and T2 respectively. I i (x) = ( if i th subtree is rooted at x. 1 otherwise: 0 Follows that: h i (T1) = P n 1 2N 1 I i (n1) and h i (T2) = P n 2 2N 2 I i (n2) Φ(T1) Φ(T2) = P i h i (T1)h i (T2) = P i (P n 1 2N 1 I i (n1)) ( P n 2 2N 2 I i (n2)) P P = 2N 1 n 2 2N 2 Pi I i (n1)i i (n2) n 1 = P n 1 2N 1 Pn 2 2N 2 (n1; n2) where (n1; n2) = P i I i (n1)i i (n2) is the number of common subtrees at n1; n2
48 n Example T 1 D d E e F f G g T 2 D d E e F h G i Φ(T 1 ) Φ(T 2 ) = (; )+ (; ) :::+ (; )+ (; ) :::+ (G; G) ffl Most of these terms are 0 (e.g. (; )). ffl Some are non-zero, e.g. (; ) = 4 D E D d E D E e D d E e
49 Recursive Definition of (n1; n2) ffl If the productions at n1 and n2 are different (n1; n2) = 0 ffl Else if n1; n2 are pre-terminals, (n1; n2) = 1 ffl Else 1 ) Y nc(n (n1; n2) = (1 + (ch(n1; j); ch(n2; j))) j=1 is number of children of node n1; nc(n1) j) is the j th child of n1. ch(n1;
50 Illustration of the Recursion D d E e F f G g D d E e F h G i How many subtrees do nodes and have in common? i.e., What is (; )? (; ) = 4 (; ) = 1 D E D d E D E e D d E e F G (; ) = ( (; ) + 1) ( (; ) + 1) = 10
51 D E D d E D E e D d E e F G D E F G D d E F G D E e F G D d E e F G
52 The Inner Product for Tagged Sequences ffl Define N1 and N2 to be sets of states in T1 and T2 respectively. ffl y a similar argument, where (n1; n2) is number of common sub-fragments at n1; n2 Φ(T1) Φ(T2) = P n 1 2N 1 Pn 2 2N 2 (n1; n2) e.g., T1 = D j j 2 j = j a b c d T E j j j j a b e e 1 ) Φ(T 2 ) = (; )+ (; ) :::+ (; )+ (; ) :::+ (D; E) Φ(T (; ) = e.g., 4, j b j b
53 The Recursive Definition for Tagged Sequences ffl Define N (n) = state following n, W (n) = word at state n Define ß[W (n1); W (n2)] = 1 iff W (n1) = W (n2) ffl Then if labels at n1 and n2 are the same, ffl (n1; n2) = (1+ß[W (n1); W (n2)]) (1+ (N (n1); N (n2)) e.g., T1 = D j j 2 j = j a b c d T E j j j j a b e e (; ) = (1 + ß[a; a]) (1 + (; )) = (1 + 1) (1 + 4) = 10
54 Refinements of the Kernels ffl Include log probability from the baseline model: Φ(T1) is representation under all sub-fragments kernel L(T1) is log probability under baseline model New representation Φ 0 where Φ 0 (T1) Φ 0 (T2) = fil(t1)l(t2) + Φ(T1) Φ(T2) (includes L(T1) as an additional component with weight p fi) ffl llows the perceptron to use original ranking as default
55 dx Refinements of the Kernels ffl Downweighting larger sub-fragments SIZE i h i (T1)h i (T2) where 0 <» 1, i=1 SIZE i is number of states/rules in i th fragment ffl Simple modification to recursive definitions, e.g., (n1; n2) = (1+ß[W (n1); W (n2)]) (1+ (N (n1); N (n2))
56 (n1; n2) = (1+ß[W (n1); W (n2)]) (1+ (N (n1); N (n2)) Refinement of the Tagging Kernel Sub-fragments sensitive to spelling features ffl (e.g., apitalization) Define ß[x; y] = 1 if x and y are identical, ffl y] = 0:5 if x and y share same capitalization features ß[x; N N S j j j exiled to Elba N N S j j j exiled to ap ffl Sub-fragments now include capitalization features N N S j j j No cap to ap N N S j j j No cap No cap ap
57 Parsing Wall Street Journal Experimental Results» 0 2 MODEL 100 Words (2416 sentences) LR LP s s s O % 88.3% % 85.1% 88.6% 88.9% % 86.3% gives 5.1% relative reduction in error (O99 = my thesis parser) Named Entity Tagging on Web Data P R F Max-Ent 84.4% 86.3% 85.3% Perc. 86.1% 89.1% 87.6% Improvement 10.9% 20.4% 15.6% gives 15.6% relative reduction in error
58 Summary For any representation Φ(x), ffl Efficient computation Φ(x) Φ(y) ) of Efficient learning through kernel form of the perceptron Dynamic programming can be used to calculate Φ(x) Φ(y) ffl under all sub-fragments representations Several refinements of the inner products: ffl Including probabilities from baseline model Downweighting larger sub-fragments Sensitivity to spelling features
59 onclusions: 10 Ideas from the ourse 1. Smoothed estimation 2. Probabilistic ontext-free Grammars, and history-based models ) lexicalized parsing 3. Feature-vector representations, and log-linear models ) log-linear models for tagging, parsing 4. The EM algorithm: hidden structure 5. Machine translation: making use of the EM algorithm 6. Global linear models: new representations (global features) 7. Global linear models: new learning algorithms (perceptron, boosting) 8. Partially supervised methods: applications to word sense disambiguation, named entity recognition, and relation extraction 9. Structured models for information extraction, and dialogue systems 10. final representational trick, kernels
Kernels. ffl A kernel K is a function of two objects, for example, two sentence/tree pairs (x1; y1) and (x2; y2)
Krnls krnl K is a function of two ojcts, for xampl, two sntnc/tr pairs (x1; y1) an (x2; y2) K((x1; y1); (x2; y2)) Intuition: K((x1; y1); (x2; y2)) is a masur of th similarity (x1; y1) twn (x2; y2) an ormally:
More information6.891: Lecture 16 (November 2nd, 2003) Global Linear Models: Part III
6.891: Lecture 16 (November 2nd, 2003) Global Linear Models: Part III Overview Recap: global linear models, and boosting Log-linear models for parameter estimation An application: LFG parsing Global and
More informationProbabilistic Context Free Grammars. Many slides from Michael Collins
Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted
More informationDeviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation
Deviations from linear separability Kernel methods CSE 250B Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Systematic deviation
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW1 due next lecture Project details are available decide on the group and topic by Thursday Last time Generative vs. Discriminative
More informationKernel methods CSE 250B
Kernel methods CSE 250B Deviations from linear separability Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Deviations from
More informationKernels and the Kernel Trick. Machine Learning Fall 2017
Kernels and the Kernel Trick Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels
More informationSUPPORT VECTOR MACHINE
SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2010 Lecture 22: Nearest Neighbors, Kernels 4/18/2011 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements On-going: contest (optional and FUN!)
More informationLog-Linear Models, MEMMs, and CRFs
Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationSupport Vector Machine
Support Vector Machine Kernel: Kernel is defined as a function returning the inner product between the images of the two arguments k(x 1, x 2 ) = ϕ(x 1 ), ϕ(x 2 ) k(x 1, x 2 ) = k(x 2, x 1 ) modularity-
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2010 Lecture 24: Perceptrons and More! 4/22/2010 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements W7 due tonight [this is your last written for
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationPerceptron Revisited: Linear Separators. Support Vector Machines
Support Vector Machines Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes in feature space: w T x + b > 0 w T x + b = 0 w T x + b < 0 Department
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Parsing with Probabilistic Context-Free Grammar Ulf Leser Content of this Lecture Phrase-Structure Parse Trees Probabilistic Context-Free Grammars Parsing with PCFG Other
More informationOutline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22
Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems
More informationAnnouncements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning
CS 188: Artificial Intelligence Spring 21 Lecture 22: Nearest Neighbors, Kernels 4/18/211 Pieter Abbeel UC Berkeley Slides adapted from Dan Klein Announcements On-going: contest (optional and FUN!) Remaining
More informationLearning Methods for Linear Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2011/2012 Lesson 20 27 April 2012 Contents Learning Methods for Linear Detectors Learning Linear Detectors...2
More informationKernel Methods. Konstantin Tretyakov MTAT Machine Learning
Kernel Methods Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning So far Supervised machine learning Linear models Non-linear models Unsupervised machine learning Generic scaffolding So far Supervised
More informationMulti-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University
Multi-class SVMs Lecture 17: Aykut Erdem April 2016 Hacettepe University Administrative We will have a make-up lecture on Saturday April 23, 2016. Project progress reports are due April 21, 2016 2 days
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationA Context-Free Grammar
Statistical Parsing A Context-Free Grammar S VP VP Vi VP Vt VP VP PP DT NN PP PP P Vi sleeps Vt saw NN man NN dog NN telescope DT the IN with IN in Ambiguity A sentence of reasonable length can easily
More informationPenn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark
Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument
More informationOnline Learning With Kernel
CS 446 Machine Learning Fall 2016 SEP 27, 2016 Online Learning With Kernel Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Stochastic Gradient Descent Algorithms Regularization Algorithm Issues
More informationKernel Methods. Konstantin Tretyakov MTAT Machine Learning
Kernel Methods Konstantin Tretyakov (kt@ut.ee) MTAT.03.227 Machine Learning So far Supervised machine learning Linear models Least squares regression, SVR Fisher s discriminant, Perceptron, Logistic model,
More informationLow-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park
Low-Dimensional Discriminative Reranking Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Discriminative Reranking Useful for many NLP tasks Enables us to use arbitrary features
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 29 April, SoSe 2015 Support Vector Machines (SVMs) 1. One of
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationMachine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)
Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x3: +1 x2: -1 Concatenated (combined) features XOR:
More informationSupport Vector Machines
Support Vector Machines Reading: Ben-Hur & Weston, A User s Guide to Support Vector Machines (linked from class web page) Notation Assume a binary classification problem. Instances are represented by vector
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationSupport Vector Machines
Support Vector Machines Hypothesis Space variable size deterministic continuous parameters Learning Algorithm linear and quadratic programming eager batch SVMs combine three important ideas Apply optimization
More informationMachine Learning (CSE 446): Multi-Class Classification; Kernel Methods
Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 12 Announcements HW3 due date as posted. make sure
More informationStructured Prediction
Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationLab 12: Structured Prediction
December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?
More informationKernel Methods. Foundations of Data Analysis. Torsten Möller. Möller/Mori 1
Kernel Methods Foundations of Data Analysis Torsten Möller Möller/Mori 1 Reading Chapter 6 of Pattern Recognition and Machine Learning by Bishop Chapter 12 of The Elements of Statistical Learning by Hastie,
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationAlgorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,
More informationKernel Methods. Charles Elkan October 17, 2007
Kernel Methods Charles Elkan elkan@cs.ucsd.edu October 17, 2007 Remember the xor example of a classification problem that is not linearly separable. If we map every example into a new representation, then
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationStatistical Methods for NLP
Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification
More informationAlgorithms for Predicting Structured Data
1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure
More informationProbabilistic Context-free Grammars
Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John
More informationLecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012
CS-62 Theory Gems September 27, 22 Lecture 4 Lecturer: Aleksander Mądry Scribes: Alhussein Fawzi Learning Non-Linear Classifiers In the previous lectures, we have focused on finding linear classifiers,
More informationMachine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015
Machine Learning Classification, Discriminative learning Structured output, structured input, discriminative function, joint input-output features, Likelihood Maximization, Logistic regression, binary
More information9.2 Support Vector Machines 159
9.2 Support Vector Machines 159 9.2.3 Kernel Methods We have all the tools together now to make an exciting step. Let us summarize our findings. We are interested in regularized estimation problems of
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationML (cont.): SUPPORT VECTOR MACHINES
ML (cont.): SUPPORT VECTOR MACHINES CS540 Bryan R Gibson University of Wisconsin-Madison Slides adapted from those used by Prof. Jerry Zhu, CS540-1 1 / 40 Support Vector Machines (SVMs) The No-Math Version
More information6.036 midterm review. Wednesday, March 18, 15
6.036 midterm review 1 Topics covered supervised learning labels available unsupervised learning no labels available semi-supervised learning some labels available - what algorithms have you learned that
More informationAnnouncements - Homework
Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationA* Search. 1 Dijkstra Shortest Path
A* Search Consider the eight puzzle. There are eight tiles numbered 1 through 8 on a 3 by three grid with nine locations so that one location is left empty. We can move by sliding a tile adjacent to the
More information6.891: Lecture 20 (November 19th, 2003) Information Extraction, and Partially Supervised Approaches
6.891: Lecture 20 (November 19th, 2003) Information Extraction, and Partially Supervised Approaches Information extraction Overview A partially supervised method: cotraining and coboosting (applied to
More informationSVMs, Duality and the Kernel Trick
SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today
More informationStatistical Methods for NLP
Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition
More informationCS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning
CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we
More information6.891: Lecture 8 (October 1st, 2003) Log-Linear Models for Parsing, and the EM Algorithm Part I
6.891: Lecture 8 (October 1st, 2003) Log-Linear Models for Parsing, and EM Algorithm Part I Overview Ratnaparkhi s Maximum-Entropy Parser The EM Algorithm Part I Log-Linear Taggers: Independence Assumptions
More informationOutline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space
to The The A s s in to Fabio A. González Ph.D. Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá April 2, 2009 to The The A s s in 1 Motivation Outline 2 The Mapping the
More informationSupport'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan
Support'Vector'Machines Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan kasthuri.kannan@nyumc.org Overview Support Vector Machines for Classification Linear Discrimination Nonlinear Discrimination
More informationShort Course Robust Optimization and Machine Learning. 3. Optimization in Supervised Learning
Short Course Robust Optimization and 3. Optimization in Supervised EECS and IEOR Departments UC Berkeley Spring seminar TRANSP-OR, Zinal, Jan. 16-19, 2012 Outline Overview of Supervised models and variants
More information10/17/04. Today s Main Points
Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points
More informationSequence Labeling: HMMs & Structured Perceptron
Sequence Labeling: HMMs & Structured Perceptron CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu HMM: Formal Specification Q: a finite set of N states Q = {q 0, q 1, q 2, q 3, } N N Transition
More informationA short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie
A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The Expectation Maximization (EM) algorithm is one approach to unsupervised, semi-supervised, or lightly supervised learning. In this kind of learning either no labels are
More information26 : Spectral GMs. Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G.
10-708: Probabilistic Graphical Models, Spring 2015 26 : Spectral GMs Lecturer: Eric P. Xing Scribes: Guillermo A Cidre, Abelino Jimenez G. 1 Introduction A common task in machine learning is to work with
More informationParsing with Context-Free Grammars
Parsing with Context-Free Grammars Berlin Chen 2005 References: 1. Natural Language Understanding, chapter 3 (3.1~3.4, 3.6) 2. Speech and Language Processing, chapters 9, 10 NLP-Berlin Chen 1 Grammars
More informationProcessing/Speech, NLP and the Web
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25 Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March, 2011 Bracketed Structure: Treebank Corpus [ S1[
More informationProbabilistic Context-Free Grammars. Michael Collins, Columbia University
Probabilistic Context-Free Grammars Michael Collins, Columbia University Overview Probabilistic Context-Free Grammars (PCFGs) The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationSupport Vector Machines: Kernels
Support Vector Machines: Kernels CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 14.1, 14.2, 14.4 Schoelkopf/Smola Chapter 7.4, 7.6, 7.8 Non-Linear Problems
More informationDecision Tree Learning Lecture 2
Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over
More information6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm
6.864: Lecture 5 (September 22nd, 2005) The EM Algorithm Overview The EM algorithm in general form The EM algorithm for hidden markov models (brute force) The EM algorithm for hidden markov models (dynamic
More informationNatural Language Processing : Probabilistic Context Free Grammars. Updated 5/09
Natural Language Processing : Probabilistic Context Free Grammars Updated 5/09 Motivation N-gram models and HMM Tagging only allowed us to process sentences linearly. However, even simple sentences require
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Support Vector Machine (SVM) Hamid R. Rabiee Hadi Asheri, Jafar Muhammadi, Nima Pourdamghani Spring 2013 http://ce.sharif.edu/courses/91-92/2/ce725-1/ Agenda Introduction
More informationProbabilistic Context-Free Grammar
Probabilistic Context-Free Grammar Petr Horáček, Eva Zámečníková and Ivana Burgetová Department of Information Systems Faculty of Information Technology Brno University of Technology Božetěchova 2, 612
More informationReview. Earley Algorithm Chapter Left Recursion. Left-Recursion. Rule Ordering. Rule Ordering
Review Earley Algorithm Chapter 13.4 Lecture #9 October 2009 Top-Down vs. Bottom-Up Parsers Both generate too many useless trees Combine the two to avoid over-generation: Top-Down Parsing with Bottom-Up
More informationCSE 151 Machine Learning. Instructor: Kamalika Chaudhuri
CSE 151 Machine Learning Instructor: Kamalika Chaudhuri Linear Classification Given labeled data: (xi, feature vector yi) label i=1,..,n where y is 1 or 1, find a hyperplane to separate from Linear Classification
More informationLecture 4 February 2
4-1 EECS 281B / STAT 241B: Advanced Topics in Statistical Learning Spring 29 Lecture 4 February 2 Lecturer: Martin Wainwright Scribe: Luqman Hodgkinson Note: These lecture notes are still rough, and have
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationLecture 10: A brief introduction to Support Vector Machine
Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department
More informationLearning From Data Lecture 25 The Kernel Trick
Learning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel M. Magdon-Ismail CSCI 400/600 recap: Large Margin is Better Controling Overfitting Non-Separable Data 0.08 random
More information18.9 SUPPORT VECTOR MACHINES
744 Chapter 8. Learning from Examples is the fact that each regression problem will be easier to solve, because it involves only the examples with nonzero weight the examples whose kernels overlap the
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationIntroduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research
Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationSVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels
SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic
More informationPredicting Graph Labels using Perceptron. Shuang Song
Predicting Graph Labels using Perceptron Shuang Song shs037@eng.ucsd.edu Online learning over graphs M. Herbster, M. Pontil, and L. Wainer, Proc. 22nd Int. Conf. Machine Learning (ICML'05), 2005 Prediction
More informationStatistical Methods for NLP
Statistical Methods for NLP Text Categorization, Support Vector Machines Sameer Maskey Announcement Reading Assignments Will be posted online tonight Homework 1 Assigned and available from the course website
More information