Natural Language Processing and Information Retrieval

Similar documents
MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Kernel Methods and SVMs Extension

Support Vector Machines

Lecture 10 Support Vector Machines II

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Support Vector Machines

Support Vector Machines

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

Support Vector Machines

Linear Classification, SVMs and Nearest Neighbors

Support Vector Machines

Which Separator? Spring 1

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Chapter 6 Support vector machine. Séparateurs à vaste marge

Ensemble Methods: Boosting

Maximal Margin Classifier

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Multilayer Perceptron (MLP)

Lecture 3: Dual problems and Kernels

Intro to Visual Recognition

Lagrange Multipliers Kernel Trick

Kristin P. Bennett. Rensselaer Polytechnic Institute

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 10: Classifica8on with Support Vector Machine (cont.

18-660: Numerical Methods for Engineering Design and Optimization

CSC 411 / CSC D11 / CSC C11

Support Vector Machines CS434

Lecture 6: Support Vector Machines

10-701/ Machine Learning, Fall 2005 Homework 3

Pattern Classification

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Generalized Linear Methods

1 Convex Optimization

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

CSCI B609: Foundations of Data Science

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

The exam is closed book, closed notes except your one-page cheat sheet.

17 Support Vector Machines

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Chapter Newton s Method

3.1 ML and Empirical Distribution

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Statistical machine learning and its application to neonatal seizure detection

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Lecture 10 Support Vector Machines. Oct

Support Vector Novelty Detection

Feature Selection: Part 1

MAXIMUM A POSTERIORI TRANSDUCTION

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Lecture 20: November 7

Learning Theory: Lecture Notes

Assortment Optimization under MNL

Boostrapaggregating (Bagging)

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

Learning to Align Sequences: A Maximum-Margin Approach

Supporting Information

Homework Assignment 3 Due in class, Thursday October 15

SDMML HT MSc Problem Sheet 4

FMA901F: Machine Learning Lecture 5: Support Vector Machines. Cristian Sminchisescu

The Minimum Universal Cost Flow in an Infeasible Flow Network

SVMs: Duality and Kernel Trick. SVMs as quadratic programs

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

Lecture Notes on Linear Regression

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Newton s Method for One - Dimensional Optimization - Theory

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

One-class Classification: ν-svm

Fundamentals of Neural Networks

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

PHYS 705: Classical Mechanics. Calculus of Variations II

Support Vector Machines CS434

CSE 546 Midterm Exam, Fall 2014(with Solution)

ECE559VV Project Report

Regression Using Support Vector Machines: Basic Foundations

Pairwise Multi-classification Support Vector Machines: Quadratic Programming (QP-P A MSVM) formulations

Randomness and Computation

14 Lagrange Multipliers

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

MMA and GCMMA two methods for nonlinear optimization

Feature Selection in Multi-instance Learning

Online Classification: Perceptron and Winnow

Probabilistic & Unsupervised Learning

Amiri s Supply Chain Model. System Engineering b Department of Mathematics and Statistics c Odette School of Business

Chapter 10 The Support-Vector-Machine (SVM) A statistical approach of learning theory for designing an optimal classifier

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Evaluation of classifiers MLPs

Maximum Likelihood Estimation (MLE)

CSE 252C: Computer Vision III

EEE 241: Linear Systems

SVM Tutorial: Classification, Regression, and Ranking

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Problem Set 9 Solutions

Transcription:

Natural Language Processng and Informaton Retreval Support Vector Machnes Alessandro Moschtt Department of nformaton and communcaton technology Unversty of Trento Emal: moschtt@ds.untn.t

Summary Support Vector Machnes Hard-margn SVMs Soft-margn SVMs

Whch hyperplane choose?

Classfer wth a Maxmum Margn Var 1 IDEA 1: Select the hyperplane wth maxmum margn Margn Margn Var 2

Support Vector Var 1 Support Vectors Margn Var 2

Support Vector Machne Classfers Var 1 The margn s equal to 2 k w w! w!! x! + b = k w! " x! + b =! k k k w!! x! + b = 0 Var 2

Support Vector Machnes Var 1 The margn s equal to We need to solve 2 k w w! w! " x! + b =! k k w!! x! + b = k k w!! x! + b Var 2 = 0 max 2! k w! w " x! + b # +k, f x! s postve! w " x! + b $ %k, f x! s negatve

Support Vector Machnes Var 1 There s a scale for whch k=1. The problem transforms n: w!!! w! x + b = " 1 1!! w! x + b = 1 1 w!! x! + b Var 2 = 0 max 2! w! w " x! + b # +1, f x! s postve! w " x! + b $ %1, f x! s negatve

Fnal Formulaton max 2! w! w " x! + b # +1, y =1! w " x! + b $ %1, y = -1 " max 2! w y ( w! " x! + b) #1 " mn! w 2 " " y ( w! " x! + b) #1 mn! w 2 2 y ( w! " x! + b) #1

Optmzaton Problem Optmal Hyperplane: Mnmze Subject to " (! w ) = 1 2! w 2 y ((! w #! x ) + b) $ 1, = 1,...,m The dual problem s smpler

Lagrangan Defnton "

Dual Optmzaton Problem

Dual Transformaton Gven the Lagrangan assocated wth our problem To solve the dual problem we need to evaluate: Let us mpose the dervatves to 0, wth respect to w!

Dual Transformaton (cont d) and wrt b Then we substtuted them n the Lagrange functon

Fnal Dual Problem

Khun-Tucker Theorem Necessary and suffcent condtons to optmalty L( w, α, β ) = 0 w L( w, α, β ) = 0 b α g ( w ) = 0, =1,.., m g ( w ) 0, =1,.., m α 0, =1,.., m

Propertes comng from constrants Lagrange constrants: m!! y = 0 =1 Karush-Kuhn-Tucker constrants! w = m!!! y x =1 " #[y (! x #! w + b) $1] = 0, = 1,...,m! Support Vectors have not null To evaluate b, we can apply the followng equaton

Warnng! On the graphcal examples, we always consder normalzed hyperplane (hyperplanes wth normalzed gradent) b n ths case s exactly the dstance of the hyperplane from the orgn So f we have an equaton not normalzed we may have! x "! w '+b = 0 wth! x = x,y ( ) and! w '= 1,1 ( ) and b s not the dstance

Warnng! Let us consder a normalzed gradent! w = 1/ 2,1/ 2 ( ) ( x,y) " ( 1/ 2,1/ 2) + b = 0 # x / 2 + y/ 2 = $b # y = $x $ b 2 Now we see that -b s exactly the dstance. For x =0, we have the ntersecton wth. Ths! w dstance projected on s -b "b 2

Soft Margn SVMs Var 1!! slack varables are added w!!! w! x + b = 1 Some errors are allowed but they should penalze the objectve functon!! w! x + b = " 1 1 1 w!! x! + b = 0 Var 2

Soft Margn SVMs Var 1! The new constrants are y ( w! " x! + b) #1$% & x! where % # 0 w!!! w! x + b = " 1 1!! w! x + b = 1 1 w!! x! + b Var 2 = 0 The objectve functon penalzes the ncorrect classfed examples mn 1 2 w! 2 +C # " C s the trade-off between margn and the error

Dual formulaton mn 1 2 w + C m =1 ξ2 y ( w x + b) 1 ξ, =1,.., m ξ 0, =1,.., m L( w, b, ξ, α )= 1 2 w w + C 2 m ξ 2 m α [y ( w x + b) 1+ξ ], =1 =1 By dervng wrt! w,! " and b

Partal Dervatves

Substtuton n the objectve functon! j of Kronecker

Fnal dual optmzaton problem

Soft Margn Support Vector Machnes The algorthm tres to keep ξ low and maxmze the margn NB: The number of error s not drectly mnmzed (NP-complete problem); the dstances from the hyperplane are mnmzed If C, the soluton tends to the one of the hard-margn algorthm Attenton!!!: f C = 0 we get w! = 0, snce mn 1 2 w! 2 +C # " y (! w "! x + b) #1$% &! x % # 0 y b "1#$ %! x If C ncreases the number of error decreases. When C tends to nfnte the number of errors must be 0,.e. the hard-margn formulaton

Robusteness of Soft vs. Hard Margn SVMs Var 1 Var 1! ξ w!! x! + b = 0 Var 2 w!! x! + b = 0 Var 2 Soft Margn SVM Hard Margn SVM

Soft vs Hard Margn SVMs Soft-Margn has ever a soluton Soft-Margn s more robust to odd examples Hard-Margn does not requre parameters

Parameters mn 1 2 w! 2 +C! = mn 1 2 w! 2 +C +!! + + C "!! (! " ) = mn 1 2 w! 2 +C J!! + +!! " C: trade-off parameter J: cost factor

Theoretcal Justfcaton

Defnton of Tranng Set error Tranng Data N f : R! { ±1} (! Emprcal Rsk (error) Rsk (error) R emp [ f ] = 1 m x 1,y 1 ),...,( x! m,y m ) " R N # ±1 m # =1 1 2 f ( x! ) " y R[ f ] = 1 2 f (! x ) " y dp(! x,y) # { }

Error Characterzaton (part 1) From PAC-learnng Theory (Vapnk): R(") # R emp (") + $( d m, log(% ) m ) $( d m, log(% ) m ) = d (log 2m d +1)&log(% 4 ) m where d s thevc-dmenson, m s the number of examples, δ s a bound on the probablty to get such error and α s a classfer parameter.

There are many versons for dfferent bounds

Error Characterzaton (part 2)

Rankng, Regresson and Multclassfcaton

The Rankng SVM [Herbrch et al. 1999, 2000; Joachms et al. 2002] The am s to classfy nstance pars as correctly ranked or ncorrectly ranked Ths turns an ordnal regresson problem back nto a bnary classfcaton problem We want a rankng functon f such that x > x j ff f(x ) > f(x j ) or at least one that tres to do ths wth mnmal error Suppose that f s a lnear functon f(x ) = w x

Sec. 15.4.2 The Rankng SVM Rankng Model: f x f (x )

Sec. 15.4.2 The Rankng SVM Then (combnng the two equatons on the last slde): x > x j ff w x w x j > 0 x > x j ff w (x x j ) > 0 Let us then create a new nstance space from such pars: z k = x x k y k = +1, 1 as x, < x k

Support Vector Rankng mn 1 2 w + C m =1 ξ2 y k ( w ( x x j )+b) 1 ξ k, ξ k 0, k =1,.., m 2, j =1,.., m y k = 1 f rank( x ) > rank( x j ),"1 0 otherwse, where k = m + j Gven two examples we buld one example (x, x j )

Support Vector Regresson (SVR) f(x) ( x) = wx b f + +ε 0 -ε Mn y 1 2 Constrants: w T Soluton: w x T w b ε w T x + b y ε x

Support Vector Regresson (SVR) f(x) ( x) = wx b f + ξ ξ* +ε 0 -ε Mnmse: N 1 w + C " 2 wt # + " y w * ξ, ξ T x w T Constrants: x + b 0 b y ( * ) =1 ε + ξ ε + ξ * x

Support Vector Regresson y s not -1 or 1 anymore, now t s a value ε s the tollerance of our functon value

From Bnary to Multclass classfers Three dfferent approaches: ONE-vs-ALL (OVA) Gven the example sets, {E1, E2, E3, } for the categores: {C1, C2, C3, } the bnary classfers: {b1, b2, b3, } are bult. For b1, E1 s the set of postves and E2 E3 s the set of negatves, and so on For testng: gven a classfcaton nstance x, the category s the one assocated wth the maxmum margn among all bnary classfers

From Bnary to Multclass classfers ALL-vs-ALL (AVA) Gven the examples: {E1, E2, E3, } for the categores {C1, C2, C3, } buld the bnary classfers: {b1_2, b1_3,, b1_n, b2_3, b2_4,, b2_n,,bn-1_n} by learnng on E1 (postves) and E2 (negatves), on E1 (postves) and E3 (negatves) and so on For testng: gven an example x, all the votes of all classfers are collected where b E1E2 = 1 means a vote for C1 and b E1E2 = -1 s a vote for C2 Select the category that gets more votes

From Bnary to Multclass classfers Error Correctng Output Codes (ECOC) The tranng set s parttoned accordng to bnary sequences (codes) assocated wth category sets. For example, 10101 ndcates that the set of examples of C1,C3 and C5 are used to tran the C 10101 classfer. The data of the other categores,.e. C2 and C4 wll be negatve examples In testng: the code-classfers are used to decode one the orgnal class, e.g. C 10101 = 1 and C 11010 = 1 ndcates that the nstance belongs to C1 That s, the only one consstent wth the codes

SVM-lght: an mplementaton of SVMs Implements soft margn Contans the procedures for solvng optmzaton problems Bnary classfer Examples and descrptons n the web ste: http://www.joachms.org/ (http://svmlght.joachms.org/)

Structured Output

Mult-classfcaton

References A tutoral on Support Vector Machnes for Pattern Recognton Downloadable artcle (Chrss Burges) The Vapnk-Chervonenks Dmenson and the Learnng Capablty of Neural Nets Downloadable Presentaton Computatonal Learnng Theory (Sally A Goldman Washngton Unversty St. Lous Mssour) Downloadable Artcle AN INTRODUCTION TO SUPPORT VECTOR MACHINES (and other kernel-based learnng methods) N. Crstann and J. Shawe-Taylor Cambrdge Unversty Press Check our lbrary The Nature of Statstcal Learnng Theory Vladmr Naumovch Vapnk - Sprnger Verlag (December, 1999) Check our lbrary

Exercse 1. The equatons of SVMs for Classfcaton, Rankng and Regresson (you can get them from my sldes). 2. The perceptron algorthm for Classfcaton, Rankng and Regresson (the last two you have to provde by lookng at what you wrote n pont (1)). 3. The same as pont (2) by usng kernels (wrte the kernel defnton as ntroducton of ths secton).