Statistical Methods for NLP

Size: px

Start display at page:

Download "Statistical Methods for NLP"

Philomena Berry
5 years ago
Views:

1 Statistical Methods for NLP Text Categorization, Support Vector Machines Sameer Maskey

2 Announcement Reading Assignments Will be posted online tonight Homework 1 Assigned and available from the course website Due in 2 Weeks (Feb 16, 4pm) 2 programming assignments

3 Project Proposals Reminder to think about projects Proposals due in 3 weeks (Feb 23)

4 Topics for Today Naïve Bayes Classifier for Text Smoothing Support Vector Machines Paper review session

5 Naïve Bayes Classifier for Text P(y k,x 1,X 2,...,X N )=P(y k )Π i P(X i y k ) Prior Probability of the Class Here N is the number of words, not to confuse with the total vocabulary size Conditional Probability of feature given the Class

6 Naïve Bayes Classifier for Text P(y =y k X 1,X 2,...,X N )= P(y=y k)p(x 1,X 2,..,X N y=y k ) j P(y=y j)p(x 1,X 2,..,X N y=y j ) = P(y=y k)π i P(X i y=y k ) j P(y=y j)π i P(X i y=y j ) y argmax yk P(y =y k )Π i P(X i y =y k )

7 Naïve Bayes Classifier for Text Given the training data what are the parameters to be estimated? P(y) P(X y 1 ) P(X y 2 ) Diabetes : 0.8 Hepatitis : 0.2 the: diabetic : 0.02 blood : sugar : 0.02 weight : the: diabetic : water : fever : 0.01 weight : y argmax yk P(y =y k )Π i P(X i y =y k )

8 Estimating Parameters Maximum Likelihood Estimates Relative Frequency Counts For a new document Find which one gives higher posterior probability Log ratio Thresholding Classify accordingly

9 Smoothing MLE for Naïve Bayes (relative frequency counts) may not generalize well Zero counts Smoothing With less evidence, believe in prior more With more evidence, believe in data more

10 Laplace Smoothing Assume we have one more count for each element Zero counts become 1 P smooth (w)= c w +1 w {c(w)+1} P smooth (w)= c w+1 N+V Vocab Size

11 Back to Discriminative Classification f(x)=w T x+b b w

12 Linear Classification If we have linearly separable data we can find w such that y i (w T x i +b)>0 i

13 Margin Let us have hyperplanes such that w T x i +b +1ify i =+1 w T x i +b 1ify i = 1 y i (w T x i +b) 1 0 i d+ d- Total margin is sum of d+ and d-

14 Maximizing Margin Distance between H and H+ is 1 w Distance between H+ and H- is 2 w In order to maximize the margin need to minimize the denominator 1 2 w 2

15 Maximizing Margin with Constraints We can combine the two inequalities to get y i (w T x i +b) 1 0 i Problem formulation Minimize Subject to w 2 2 y i (w T x i +b) 1 0 i

16 Solving with Lagrange Multipliers Solve by introducing Lagrange Multipliers for the constraints Minimize J(w,b,α)= w 2 2 n i=1 α i{y i (w T x i +b) 1} Forgivenα i w J(w,b,α)=w n i=1 α iy i x i b J(w,b,α)= n i=1 α iy i

17 Dual Problem Solve dual problem instead Maximize J(α)= n i=1 α i 1 2 n i,j=1 α iα j y i y j (x i.x j ) subject to constraints of α i 0 i n i=1 α iy i =0

18 Quadratic Programming Problem Minimize f(x) such that g(x) = k Where f(x) is quadratic and g(x) are linear constraints Constrained optimization problem Saw the example before

19 SVM Solution Linear combination of weighted training example Sparse Solution, why? ŵ= n i=1ˆα i y i x i Weights zero for non-support vectors i SV α iy i (x i.x)+ b

20 Sequential Minimal Optimization (SMO) Algorithm The weights are just linear combinations of training vectors weighted with alphas We still have not answered how do we get alphas Coordinate ascent Do until converged select pair of alpha(i) and alpha(j) reoptimize W(alpha) with respect to alpha(i) and alpha(j) holding all other alphas constant done

21 Not Linearly Separable

22 Transformation Transformation h( ) =

23 Non Linear SVMs Map data to a higher dimension where linear separation is possible We can get a longer feature vector by adding dimensions x (x 2,x) φ(x)=(x 2 1,x 2 2, 2x 1 x 2, 2x 1, 2x 2,1)

24 Kernels Given feature mapping φ(x) define K(x,z)=φ(x) T φ(z) φ(x) T φ(z) =x 2 1z 2 1+x 2 2z 2 2+2x 1 x 2 z 1 z 2 +2x 1 z 1 +2x 2 z 2 +1 =(x.z+1) 2 May not need to explicitly transform

25 Example of Kernel Functions K(x,z)=x.z K(x,z)=(x.z+1) p K(x,z)=exp( x z 2 2σ 2 ) Linear Kernel Polynomial Kernel Gaussian Kernel

26 Non-separable case Some data sets may not be linearly separable Introduce slack variable Also helps regularization Less sensitive to outliers Minimize Subject to w 2 2 +C n i=1 ξ i y i (w T x i +b) 1 ξ i i ξ i 0 i

27 Summary CLASS1 Features X PREDICT CLASS2 Linear Classification Methods Fisher s Linear Discriminant Perceptron Support Vector Machines

28 References Tutorials on

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning. Support Vector Machines. Manfred Huber Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data