Statistical Methods for NLP

Similar documents
Machine Learning. Support Vector Machines. Manfred Huber

Support Vector Machine (continued)

Statistical Machine Learning from Data

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Linear & nonlinear classifiers

Perceptron Revisited: Linear Separators. Support Vector Machines

SVMs, Duality and the Kernel Trick

CS145: INTRODUCTION TO DATA MINING

Introduction to Support Vector Machines

SVM optimization and Kernel methods

Introduction to Logistic Regression and Support Vector Machine

SUPPORT VECTOR MACHINE

Data Science and Technology Entrepreneurship

Kernel Methods and Support Vector Machines

Introduction to SVM and RVM

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

COMP 875 Announcements

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Max Margin-Classifier

(Kernels +) Support Vector Machines

Support Vector Machines

Support Vector Machine

Support Vector Machines

Statistical Methods for NLP

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines

Review: Support vector machines. Machine learning techniques and image analysis

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Learning with kernels and SVM

Linear & nonlinear classifiers

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Jeff Howbert Introduction to Machine Learning Winter

Statistical Pattern Recognition

Machine Learning, Fall 2012 Homework 2

L5 Support Vector Classification

Discriminative Models

Support Vector Machines for Classification: A Statistical Portrait

Lecture 10: A brief introduction to Support Vector Machine

Constrained Optimization and Support Vector Machines

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Announcements - Homework

Support Vector Machine & Its Applications

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Linear Classification and SVM. Dr. Xin Zhang

Modelli Lineari (Generalizzati) e SVM

Modeling Dependence of Daily Stock Prices and Making Predictions of Future Movements

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Support Vector Machine (SVM) and Kernel Methods

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Machine Learning for NLP

Discriminative Models

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

MLE/MAP + Naïve Bayes

0.5. (b) How many parameters will we learn under the Naïve Bayes assumption?

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Support Vector Machines: Maximum Margin Classifiers

Lecture Support Vector Machine (SVM) Classifiers

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

Support Vector Machines

Support Vector Machines.

Support Vector Machines

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Convex Optimization and Support Vector Machine

Kernel methods CSE 250B

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Support Vector Machines. Maximizing the Margin

1. Kernel ridge regression In contrast to ordinary least squares which has a cost function. m (θ T x (i) y (i) ) 2, J(θ) = 1 2.

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Support Vector Machines

CS798: Selected topics in Machine Learning

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Support Vector Machines and Kernel Methods

Pattern Recognition 2018 Support Vector Machines

Machine Learning - MT Classification: Generative Models

Support Vector Machines, Kernel SVM

Classifier Complexity and Support Vector Classifiers

Support Vector Machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

About this class. Maximizing the Margin. Maximum margin classifiers. Picture of large and small margin hyperplanes

Midterm: CS 6375 Spring 2015 Solutions

Applied inductive learning - Lecture 7

Transcription:

Statistical Methods for NLP Text Categorization, Support Vector Machines Sameer Maskey

Announcement Reading Assignments Will be posted online tonight Homework 1 Assigned and available from the course website Due in 2 Weeks (Feb 16, 4pm) 2 programming assignments

Project Proposals Reminder to think about projects Proposals due in 3 weeks (Feb 23)

Topics for Today Naïve Bayes Classifier for Text Smoothing Support Vector Machines Paper review session

Naïve Bayes Classifier for Text P(y k,x 1,X 2,...,X N )=P(y k )Π i P(X i y k ) Prior Probability of the Class Here N is the number of words, not to confuse with the total vocabulary size Conditional Probability of feature given the Class

Naïve Bayes Classifier for Text P(y =y k X 1,X 2,...,X N )= P(y=y k)p(x 1,X 2,..,X N y=y k ) j P(y=y j)p(x 1,X 2,..,X N y=y j ) = P(y=y k)π i P(X i y=y k ) j P(y=y j)π i P(X i y=y j ) y argmax yk P(y =y k )Π i P(X i y =y k )

Naïve Bayes Classifier for Text Given the training data what are the parameters to be estimated? P(y) P(X y 1 ) P(X y 2 ) Diabetes : 0.8 Hepatitis : 0.2 the: 0.001 diabetic : 0.02 blood : 0.0015 sugar : 0.02 weight : 0.018 the: 0.001 diabetic : 0.0001 water : 0.0118 fever : 0.01 weight : 0.008 y argmax yk P(y =y k )Π i P(X i y =y k )

Estimating Parameters Maximum Likelihood Estimates Relative Frequency Counts For a new document Find which one gives higher posterior probability Log ratio Thresholding Classify accordingly

Smoothing MLE for Naïve Bayes (relative frequency counts) may not generalize well Zero counts Smoothing With less evidence, believe in prior more With more evidence, believe in data more

Laplace Smoothing Assume we have one more count for each element Zero counts become 1 P smooth (w)= c w +1 w {c(w)+1} P smooth (w)= c w+1 N+V Vocab Size

Back to Discriminative Classification f(x)=w T x+b b w

Linear Classification If we have linearly separable data we can find w such that y i (w T x i +b)>0 i

Margin Let us have hyperplanes such that w T x i +b +1ify i =+1 w T x i +b 1ify i = 1 y i (w T x i +b) 1 0 i d+ d- Total margin is sum of d+ and d-

Maximizing Margin Distance between H and H+ is 1 w Distance between H+ and H- is 2 w In order to maximize the margin need to minimize the denominator 1 2 w 2

Maximizing Margin with Constraints We can combine the two inequalities to get y i (w T x i +b) 1 0 i Problem formulation Minimize Subject to w 2 2 y i (w T x i +b) 1 0 i

Solving with Lagrange Multipliers Solve by introducing Lagrange Multipliers for the constraints Minimize J(w,b,α)= w 2 2 n i=1 α i{y i (w T x i +b) 1} Forgivenα i w J(w,b,α)=w n i=1 α iy i x i b J(w,b,α)= n i=1 α iy i

Dual Problem Solve dual problem instead Maximize J(α)= n i=1 α i 1 2 n i,j=1 α iα j y i y j (x i.x j ) subject to constraints of α i 0 i n i=1 α iy i =0

Quadratic Programming Problem Minimize f(x) such that g(x) = k Where f(x) is quadratic and g(x) are linear constraints Constrained optimization problem Saw the example before

SVM Solution Linear combination of weighted training example Sparse Solution, why? ŵ= n i=1ˆα i y i x i Weights zero for non-support vectors i SV α iy i (x i.x)+ b

Sequential Minimal Optimization (SMO) Algorithm The weights are just linear combinations of training vectors weighted with alphas We still have not answered how do we get alphas Coordinate ascent Do until converged select pair of alpha(i) and alpha(j) reoptimize W(alpha) with respect to alpha(i) and alpha(j) holding all other alphas constant done

Not Linearly Separable

Transformation Transformation h( ) =

Non Linear SVMs Map data to a higher dimension where linear separation is possible We can get a longer feature vector by adding dimensions x (x 2,x) φ(x)=(x 2 1,x 2 2, 2x 1 x 2, 2x 1, 2x 2,1)

Kernels Given feature mapping φ(x) define K(x,z)=φ(x) T φ(z) φ(x) T φ(z) =x 2 1z 2 1+x 2 2z 2 2+2x 1 x 2 z 1 z 2 +2x 1 z 1 +2x 2 z 2 +1 =(x.z+1) 2 May not need to explicitly transform

Example of Kernel Functions K(x,z)=x.z K(x,z)=(x.z+1) p K(x,z)=exp( x z 2 2σ 2 ) Linear Kernel Polynomial Kernel Gaussian Kernel

Non-separable case Some data sets may not be linearly separable Introduce slack variable Also helps regularization Less sensitive to outliers Minimize Subject to w 2 2 +C n i=1 ξ i y i (w T x i +b) 1 ξ i i ξ i 0 i

Summary CLASS1 Features X PREDICT CLASS2 Linear Classification Methods Fisher s Linear Discriminant Perceptron Support Vector Machines

References Tutorials on www.svms.org/tutorial