CS395T: Structured Models for NLP Lecture 2: Machine Learning Review

Size: px

Start display at page:

Download "CS395T: Structured Models for NLP Lecture 2: Machine Learning Review"

Clarence Harris
5 years ago
Views:

1 CS395T: Structured Models for NLP Lecture 2: Machine Learning Review Greg Durrett Some slides adapted from Vivek Srikumar, University of Utah

2 Administrivia Course enrollment Lecture slides posted on website

3 This Lecture Linear classificalon fundamentals Naive Bayes, maximum likelihood in generalve models Three discriminalve models: logislc regression, perceptron, SVM Different molvalons but very similar update rules / inference!

4 ClassificaLon Datapoint x with label y 2{0, 1} Embed datapoint in a feature space f(x) 2 R n but in this lecture f(x) and x are interchangeable Linear decision rule: w > f(x)+b>0 w > f(x) > 0 Can delete bias if we augment feature space: f(x) = [0.5, 1.6, 0.3] [0.5, 1.6, 0.3, 1]

5 Linear funclons are powerful! x x x

6 Linear funclons are powerful! x 2 x 1 x ??? x 1 x f(x) = [x 1, x 2 ] f(x) = [x 1, x 2, x 12, x 22, x 1 x 2 ] Kernel trick does this for free, but is too expensive to use in NLP applicalons, training is O(n 2 ) instead of O(n (num feats))

7 ClassificaLon: SenLment Analysis this movie was great! would watch again PosiLve this movie was not really very enjoyable NegaLve Doing well at this is going to require structure, but let s start with simple approaches

8 Text ClassificaLon: Ham or Spam Hi, I just wanted to send over the latest results from training the LSTM model. In the aeachment. What do you think of the performance? hi i have very valuable business proposilon for you. you make lots of $$$ I just need you to send a small amount of funds Ham Spam Surface cues can basically tell you what s going on here Machine learning is good at this! Lots of data, simple paeern recognilon task, hard to write rules by hand

9 Text ClassificaLon: Ham or Spam Hi, I just wanted to send over the latest results from training the LSTM model. In the aeachment. What do you think of the performance? hi i have very valuable business proposilon for you. you make lots of $$$ I just need you to send a small amount of funds Ham Spam??? Why do we think this? CondiLonal probabililes (chance of spam given $$$ is high)

10 Text ClassificaLon: Ham or Spam Hi, I just wanted to send over the latest results from training the LSTM model. In the aeachment. What do you think of the performance? hi i have very valuable business proposilon for you. you make lots of $$$ I just need you to send a small amount of funds Ham Spam??? Feature representalon: Indicator[doc contains $$$], Indicator[doc contains training], Indicator[doc contains send] Convert a document to a vector: [1, 0, 1, ] (~50,000 long) Requires indexing the features (mapping them to axes) Very high dimensional space! How do we learn feature weights?

11 Naive Bayes Data point x =(x 1,...,x n ), label y 2{0, 1} Formulate a probabilislc model that places a distribulon P (x, y) Compute P (y x) and then label an example with argmax y P (y x) P (y x) = P (y)p (x y) Bayes Rule y P (x) constant: irrelevant / P (y)p (x y) for finding the max x i ny Naive assumplon: n = P (y) i=1 P (x i y) linear model! nx # argmax y P (y x) = argmax y log P (y x) = argmax y "log P (y)+ log P (x i y) i=1

12 Text ClassificaLon: Ham or Spam Hi, I just wanted to send over the latest results from training the LSTM model. In the aeachment. What do you think of the performance? hi i have very valuable business proposilon for you. you make lots of $$$ I just need you to send a small amount of funds Ham Spam??? nx # argmax y log P (y x) = argmax y "log P (y)+ log P (x i y) i=1 P (x funds =0 spam) = 0.9 P (x funds =1 spam) = 0.1 P (x funds =0 ham) = 0.99 P (x funds =1 ham) = 0.01 spam gets more points in the final posterior Note that this is not P (y x) not the probability of ham given the word

13 Maximum Likelihood EsLmaLon Data points (x j,y j ) provided (j indexes over examples) Find values of my P (y), P(x i y) that maximize data likelihood (generalve): " # my Y n P (y j,x j )= P (y j ) P (x ji y j ) j=1 j=1 i=1 data points (j) features (i) ith feature of jth example Equivalent to maximizing logarithm of data likelihood: mx mx " nx # log P (y j,x j )= log P (y j )+ log P (x ji y j ) j=1 j=1 i=1

14 Maximum Likelihood EsLmaLon Imagine a coin flip which is heads with probability p Observe (H, H, H, T) and maximize log likelihood likelihood mx log P (y j ) = 3 log p + log(1 p) p j=1 Maximum likelihood parameters for mullnomial = read counts off of the data

15 Maximum Likelihood for Naive Bayes Hi, I just wanted to send over the latest results from training the LSTM model. In the aeachment. What do you think of the performance? hi i have very valuable business proposilon for you. you make lots of $$$ I just need you to send a small amount of funds Ham Spam??? P (y = ham) = 0.5 P (x funds =1 spam) = 1 P (x funds =0 spam) = 0 Smoothing: add very small counts for each entry to avoid zeroes (bias-variance tradeoff) P (x funds =1 spam) = 0.99 P (x funds =0 spam) = 0.01

16 Naive Bayes: Summary Model y ny P (x, y) =P (y) P (x i y) x i i=1 n Inference nx # argmax y log P (y x) = argmax y "log P (y)+ log P (x i y) i=1 AlternaLvely: log P (y = spam x) log P (y = ham x) > 0, log P (y = spam x) P (y = ham x) + n X i=1 log P (x i y = spam) P (x i y = ham) > 0 Learning: maximize P (x, y) by reading counts off the data

17 Problems with Naive Bayes Features are correlated P (x funds =1 spam) = 0.1 P (x funds =1 ham) = 0.01 P (x transfer =1 spam) = 0.1 P (x transfer =1 ham) = 0.01 Hi, in order to close on the house we need you to transfer the requested funds to the escrow account. Ham Spam??? This one sentence will make the probability of spam very high! Ham Bad independence assumplon in NB: these words are not independent! SoluLon: beeer model, algorithms that explicitly minimize loss rather than maximizing data likelihood

18 GeneraLve vs. DiscriminaLve Models GeneraLve models: P (x, y) Bayes nets / graphical models Some of the model capacity goes to explaining the distribulon of x; prediclon uses Bayes rule post-hoc Can sample new instances (x, y) DiscriminaLve models: P (y x) SVMs, logislc regression, CRFs, most neural networks Model is trained to be good at prediclon, but doesn t model x We ll come back to this dislnclon throughout this class Break!

(StochasLc) gradient ascent to maximize log likelihood L(x j,y j =

19 LogisLc Regression P (y = spam x) = logistic(w > x) P (y = spam x) = exp (P n i=1 w ix i ) 1+exp( P n i=1 w ix i ) How to set the weights w? (StochasLc) gradient ascent to maximize log likelihood L(x j,y j = spam) = log P (y j = spam x j ) nx = w i x ji log 1+exp nx w i x ji!! i=1 i=1

20 LogisLc Regression L(x j,y j = spam) = log P (y j x j )= nx w i x ji log 1+exp nx w i x ji!! i=1 j,y j i = i log 1+exp = x ji 1 1+exp( P n i=1 w ix ji ) i w i x ji!! 1+exp = x ji 1 1+exp( P n i=1 w ix ji ) x ji exp nx i=1 nx i=1 w i x ji!! w i x ji! deriv of log deriv of exp = x ji x ji exp ( P n i=1 w ix ji ) 1+exp( P n i=1 w ix ji ) = x ji (1 P (y j = spam x j ))

21 LogisLc Regression Gradient of w i on posilve example = x ji (1 P (y j = spam x j )) If P(spam) is close to 1, make very liele update Otherwise make w i look more like x ji, which will increase P(spam) Gradient of wi on negalve example = x ji ( P (y j = spam x j )) If P(spam) is close to 0, make very liele update Otherwise make w i look less like x ji, which will decrease P(spam) Final gradient: x j (y j P (y j =1 x j ))

22 RegularizaLon Can end up making extreme updates to fit the training data w funds = w transfer = -900 w send = +742 w the = +203 All examples have P(correct) > 0.999, but classifier does crazy things on new examples

23 RegularizaLon Can end up making extreme updates to fit the training data Rather than oplmizing likelihood alone, impose a penalty on the norm of the weight vector (can also view as a Gaussian prior) w 2 `1 Maximize mx `2 L(x j,y j ) kwk 2 2 j=1 w 1

24 LogisLc Regression: Summary Model P (y = spam x) = exp (P n i=1 w ix i ) 1+exp( P n i=1 w ix i ) Inference argmax y P (y x) similar to Naive Bayes, but different model/learning P (y =1 x) 0.5, w > x 0 Learning: gradient ascent on the (regularized) discriminalve loglikelihood

25 Perceptron Simple error-driven learning approach similar to logislc regression Decision rule: w > f(x) > 0 LogisLc Regression If incorrect: if posilve, w w + x w w + x(1 P (y =1 x)) if negalve, w w x w w xp (y =1 x) Guaranteed to eventually separate the data if the data are separable, but does it learn a good boundary?

26 Support Vector Machines Many separalng hyperplanes is there a best one?

27 Support Vector Machines Many separalng hyperplanes is there a best one? margin

28 Support Vector Machines Constraint formulalon: find w via following quadralc program: Minimize kwk 2 2 minimizing norm <=> s.t. 8j w > x j 1ify j =1 maximizing margin w > x j apple 1ify j =0 As a single constraint: 8j y j (w > x j )+(1 y j )( w > x j ) 1 8j (2y j 1)(w > x j ) 1 What s wrong with this quadralc program for real data? Data is generally non-separable need slack!

29 N-Slack SVMs Minimize kwk mx j=1 j s.t. 8j (2y j 1)(w > x j ) 1 j 8j j 0 The j are a fudge factor to make all constraints salsfied (Sub-)gradient descent: focus on second part i j =0if i j =(2y j 1)x ji if j > 0 = x ji if y j =1, x ji if y j =0 Looks like the perceptron! But updates more frequently

30 Loss FuncLons Hinge (SVM) 0-1 (ideal) LogisLc Perceptron

31 OpLmizaLon next Lme Haven t talked about oplmizalon at all Range of techniques from simple gradient descent (works preey well) to more complex methods (can work beeer)

32 SenLment Analysis Classify sentence as posilve or negalve senlment PosiLve this movie was great! would watch again NegaLve the movie was gross and overwrought, but I liked it this movie was not really very enjoyable Bag-of-words doesn t seem sufficient (discourse structure, negalon) There are some ways around this: extract bigram feature for not X for all X following the not Bo Pang, Lillian Lee, Shivakumar Vaithyanathan (2002)

33 SenLment Analysis Simple feature sets can do preey well! Bo Pang, Lillian Lee, Shivakumar Vaithyanathan (2002)

34 SenLment Analysis Wang and Manning (2012) Naive Bayes is doing well! Ng and Jordan (2002) NB can be beeer for small data Before neural nets had taken off results weren t that great Two years later Kim (2014) with neural networks

35 Recap LogisLc regression: P (y =1 x) = exp ( P n i=1 w ix i ) (1 + exp ( P n i=1 w ix i )) Decision rule: P (y =1 x) 0.5, w > x 0 Gradient (unregularized): x(y P (y =1 x)) SVM: Decision rule: w > x 0 (Sub)gradient (unregularized): 0 if correct with margin of 1, else x(2y 1)

36 Recap LogisLc regression, SVM, and perceptron are closely related SVM and perceptron inference require taking maxes, logislc regression has a similar update but is soyer due to its probabilislc nature All gradient updates: make it look more like the right thing and less like the wrong thing

CS395T: Structured Models for NLP Lecture 5: Sequence Models II

CS395T: Structured Models for NLP Lecture 5: Sequence Models II Greg Durrett Some slides adapted from Dan Klein, UC Berkeley Recall: HMMs Input x =(x 1,...,x n ) Output y =(y 1,...,y n ) y 1 y 2 y n P