- PDF Free Download

Size: px

Start display at page:

Download ""

Melina Morrison
5 years ago
Views:

1 1

2 Where we are Eperiments with a hash-trick implementation of logistic regression Net question: how do you parallelize SGD, or more generally, this kind of streaming algorithm? each eample affects the net prediction è order matters è parallelization changes the behavior we will step back to perceptrons and then step forward to parallel perceptrons then another nice parallel learning algorithm then a midterm 2

3 Recap: perceptrons

4 The perceptron A ^ instance i Compute: y B i =sign( v k. i ) ^ If mistake: v k+1 = v k + y i i y i y i 2

5 The perceptron A ^ instance i Compute: y B i =sign( v k. i ) If mistake: v k+1 = v k + y i i u ^ y i y i A lot like SGD update for logistic regression! positive Mistake bound: 2 æ R = ç è g ö ø 2 -u negative 2γ margin 2

$Pick a v k at random according to m k /m, the fraction of eamples it was used for. 2.$

6 On-line to batch learning m 1 =3 m 2 =4 1. Pick a v k at random according to m k /m, the fraction of eamples it was used for. 2. Predict using the v k you just picked. 3. (Actually, use some sort of deterministic approimation to this). m=10

7 predict using sign(v*. ) m 1 =3 m 2 =4 1. Pick a v k at random according to m k /m, the fraction of eamples it was used for. 2. Predict using the v k you just picked. 3. (Actually, use some sort of deterministic approimation to this). m=10

8 predict using sign(v*. ) Also: there s a sparsification trick that makes learning the averaged perceptron fast Last perceptron Averaging/voting

9 KERNELS AND PERCEPTRONS

10 The kernel perceptron A instance i ^ y i B y i ^ Compute: y i = v k. i Compute : yˆ å = i. + - i. - k k k + ÎFN å k - ÎFP If mistake: v k+1 = v k + y i i If If falsepositive (toohigh) mistake : add falsepositive (toolow) mistake : add i i to FP to FN Mathematically the same as before but allows use of the kernel trick 10

11 The kernel perceptron A instance i ^ y i y i B K(, ) º å yˆ K(, + ) - K(, - ) k + k å = i k i k ÎFN ÎFP k k - ^ Compute: y i = v k. i Compute : yˆ å = i. + - i. - k k k + ÎFN å k - ÎFP If mistake: v k+1 = v k + y i i If If falsepositive (toohigh) mistake : add falsepositive (toolow) mistake : add i i to FP to FN Mathematically the same as before but allows use of the kernel trick Other kernel methods (SVM, Gaussian processes) aren t constrained to limited set (+1/-1/0) of weights on the K(,v) values. 11

12 Some common kernels Linear kernel: Polynomial kernel: K(, ') º ' K (, ') º ( ' + 1) d Gaussian kernel: K(, ') º e - -' 2 /s 12

13 Some common kernels Polynomial kernel: K (, ') º ( ' + 1) d for d=2 ), +, ), = ) ) = ) ) ) ) = ) + ) +2 ) 0 ) ( ) ) ) ( + + )+1 1, ), +, ) +, ) +, + +, 1, ), +, ) +, ) +, + + = 1, 2 ), 2 +, 2 ) +, ) +, + +, 1, 2 ), 2 +, 2 ) +, ) +,

14 Some common kernels Polynomial kernel: K (, ') º ( ' + 1) d for d=2 ), +, ), = 1, 2 ), 2 +, 2 ) +, ) +, + +, 1, 2 ), 2 +, 2 ) +, ) +, + + Similarity with the kernel on is equivalent to dotproduct similarity on a transformed feature vector φ() 14

15 Eplicitly map from to φ() i.e. to the point corresponding to in the Hilbert space (RKHS) Kernels 101 Duality: two ways to look at this Implicitly map from to φ() by changing the kernel function K y ˆ = w = w = å k + K(, w) + - yˆ = f( ) w å k + å ÎFN k - Î FP k Observation about perceptron k + ) - å w = f( f( ) k ÎFN - Î FP k - k - å yˆ K(, + ) - K(, - ) k + å = i k i k ÎFN ÎFP k - K(, k ) º f ( ) f( k ) Generalization of perceptron same behavior but compute time/space are different Generalization: add weights to the sums for w 15

16 Kernels 101 Duality Gram matri: K: k ij = K( i, j ) K(, ) = K(,) è Gram matri is symmetric K(,) > 0 è diagonal of K is positive è K is positive semi-definite è z T K z >= 0 for all z 16

17 A FAMILIAR KERNEL

18 Learning as optimization for regularized logistic regression + hashes Algorithm: Initialize arrays W, A of size R and set k=0 For each iteration t=1, T For each eample ( i,y i ) V is a hash table For j : j >0 increment V[h[j]] by j p i = ; k++ For each hash value h: V[h]>0: V[h] =»W[h] *= (1 - λ2µ) k-a[h]»w[h] = W[h] + λ(y i - p i )V[h]»A[h] = k j i j:hash( j )%R ==h 18

19 19

20 ϕ[h] = j:hash( j)%m==h Some details Slightly different hash to avoid systematic bias V[h] = j i j:hash( j )%R ==h ξ( j) i j, where ξ( j) { 1,+1 } m is the number of buckets you hash into (R in my discussion) 20

21 ϕ[h] = j:hash( j)%m==h Some details Slightly different hash to avoid systematic bias ξ( j) i j, where ξ( j) 1,+1 { } I.e., for large feature sets the variance should be low 21

22 Some details I.e. a hashed vector is probably close to the original vector 22

23 Some details I.e. the inner products between and are probably not changed too much by the hash function: a classifier will probably still work 23

24 The Voted Perceptron for Ranking and Structured Classification William Cohen

The voted perceptron for ranking A instances 1 2 3 4 b* b B ^ Compute: y i

25 The voted perceptron for ranking A instances b* b B ^ Compute: y i = v k. i Return: the inde b* of the best i If mistake: v k+1 = v k + b - b*

26 u Ranking some s with the target vector u γ -u

27 u Ranking some s with some guess vector v part 1 γ v -u

28 u Ranking some s with some guess vector v part 2. -u v The purple-circled is b* - the one the learner has chosen to rank highest. The green circled is b, the right answer.

29 u Correcting v by adding b b* v -u

30 V k+1 Correcting v by adding b b* (part 2) v k

31 (3a) The guess v 2 after the two positive eamples: v 2 =v u + 2 v 2 u >γ v v 1 -u -u 2γ

32 (3a) The guess v 2 after the two positive eamples: v 2 =v u + 2 v 2 u >γ v v 1 -u -u 3 2γ

33 (3a) The guess v 2 after the two positive eamples: v 2 =v u + 2 v 2 u >γ v 1 v -u 2γ -u 3

positive eamples: v 2 =v 1 + 2 + 2 v 2 u >γ v v 1

34 Notice this doesn t depend at all on the number of s being ranked u (3a) The guess v 2 after the two positive eamples: v 2 =v v 2 u >γ v v 1 -u -u 2γ Neither proof depends on the dimension of the s.

35 Ranking perceptrons è structured perceptrons The API: A sends B a (maybe huge) set of items to rank B finds the single best one according to the current weight vector A tells B which one was actually best Structured classification on a sequence Input: list of words: =(w 1,,w n ) Output: list of labels: y=(y 1,,y n ) If there are K classes, there are K n labels possible for

36 Borkar et al s: HMMs for segmentation Eample: Addresses, bib records Problem: some DBs may split records up differently (eg no mail stop field, combine address and apt #, ) or not at all Solution: Learn to segment tetual form of records Author Year Title Journal Volume Page P.P.Wangikar, T.P. Graycar, D.A. Estell, D.S. Clark, J.S. Dordick (1993) Protein and Solvent Engineering of Subtilising BPN' in Nearly Anhydrous Organic Media J.Amer. Chem. Soc. 115,

37 IE with Hidden Markov Models Smith Cohen Jordan Transition probabilities dddd dd 0.5 Author 0.9 Year Journal 0.2 Title 0.5 Learning Conve Comm. Trans. Chemical Emission probabilities

38 Inference for linear-chain CRFs When will prof Cohen post the notes Idea 1: features are properties of two adjacent tokens, and the pair of labels assigned to them (Begin,Inside,Outside) (y(i)==b or y(i)==i) and (token(i) is capitalized) (y(i)==i and y(i-1)==b) and (token(i) is hyphenated) (y(i)==b and y(i-1)==b) eg tell Rose William is on the way Idea 2: construct a graph where each path is a possible sequence labeling.

39 Inference for a linear-chain CRF When will prof Cohen post the notes B B B B B B B I I I I I I I O O O O O O O Inference: find the highest-weight path given a weighting of features This can be done efficiently using dynamic programming (Viterbi)

40 Ranking perceptrons è structured perceptrons The API: A sends B a (maybe huge) set of items to rank B finds the single best one according to the current weight vector A tells B which one was actually best Structured classification on a sequence Input: list of words: =(w 1,,w n ) Output: list of labels: y=(y 1,,y n ) If there are K classes, there are K n labels possible for

41 Ranking perceptrons è structured perceptrons New API: A sends B the word sequence B finds the single best y according to the current weight vector using Viterbi A tells B which y was actually best This is equivalent to ranking pairs g=(,y ) Structured classification on a sequence Input: list of words: =(w 1,,w n ) Output: list of labels: y=(y 1,,y n ) If there are K classes, there are K n labels possible for

42 The voted perceptron for ranking A instances b* b B ^ Compute: y i = v k. i Return: the inde b* of the best i If mistake: v k+1 = v k + b - b* Change number one is notation: replace with g

43 The voted perceptron for structured classification tasks A instances g 1 g 2 g 3 g 4 B ^ Compute: y i = v k. g i Return: the inde b* of the best g i b* b If mistake: v k+1 = v k + g b - g b* 1. A sends B feature functions, and instructions for creating the instances g: A sends a word vector i. Then B could create the instances g 1 =F( i,y 1 ), g 2 = F( i,y 2 ), but instead B just returns the y* that gives the best score for the dot product v k. F( i,y*) by using Viterbi. 2. A sends B the correct label sequence y i. 3. On errors, B sets v k+1 = v k + g b - g b* = v k + F( i,y) - F( i,y*)

44 Results from the original paper. EMNLP 2002, Best paper

45 Collins Eperiments POS tagging NP Chunking (words and POS tags from Brill s tagger as features) and BIO output tags Compared logistic regression methods (MaEnt) and Voted Perceptron trained HMM s With and w/o averaging With and w/o feature selection (count>5)

46 Collins results

47 Where we are Eperiments with a hash-trick implementation of logistic regression Net question: how do you parallelize SGD, or more generally, this kind of streaming algorithm? each eample affects the net prediction è order matters è parallelization changes the behavior we will step back to perceptrons and then step forward to parallel perceptrons then another nice parallel learning algorithm then a midterm 47

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods Conditional Random Fields Structured Perceptron Discussion