Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

Similar documents
Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machines: Maximum Margin Classifiers

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Support Vector Machine (SVM) and Kernel Methods

Pattern Recognition 2018 Support Vector Machines

Learning with kernels and SVM

Jeff Howbert Introduction to Machine Learning Winter

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

CS145: INTRODUCTION TO DATA MINING

Lecture 10: A brief introduction to Support Vector Machine

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

Introduction to Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Linear & nonlinear classifiers

Support Vector Machines

L5 Support Vector Classification

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Kernel Methods and Support Vector Machines

Review: Support vector machines. Machine learning techniques and image analysis

SVMs, Duality and the Kernel Trick

Lecture 10: Support Vector Machine and Large Margin Classifier

Introduction to SVM and RVM

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Introduction to Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

ICS-E4030 Kernel Methods in Machine Learning

Machine Learning for natural language processing

Support Vector Machines and Kernel Methods

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Support Vector Machine for Classification and Regression

Lecture Notes on Support Vector Machine

CS798: Selected topics in Machine Learning

A Tutorial on Support Vector Machine

ML (cont.): SUPPORT VECTOR MACHINES

Support Vector Machines

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Machine Learning and Data Mining. Support Vector Machines. Kalev Kask

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Support Vector Machine

SUPPORT VECTOR MACHINE

Announcements - Homework

Statistical Machine Learning from Data

Support Vector Machines for Classification and Regression

Support Vector Machines

Introduction to Logistic Regression and Support Vector Machine

Linear & nonlinear classifiers

Constrained Optimization and Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Support Vector Machines

Machine Learning. Support Vector Machines. Manfred Huber

Machine Learning And Applications: Supervised Learning-SVM

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Machine Learning: Assignment 1

Perceptron Revisited: Linear Separators. Support Vector Machines

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Naïve Bayes classification

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Announcements. Proposals graded

Generative Clustering, Topic Modeling, & Bayesian Inference

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Machine Learning Lecture 6 Note

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CSC 411 Lecture 17: Support Vector Machine

CS-E4830 Kernel Methods in Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Support vector machines

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Support Vector and Kernel Methods

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

SVMC An introduction to Support Vector Machines Classification

Support Vector Machines

Support vector machines Lecture 4

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Support Vector Machines

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Lecture Support Vector Machine (SVM) Classifiers

CMU-Q Lecture 24:

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Logistic Regression. COMP 527 Danushka Bollegala

Machine Learning, Fall 2012 Homework 2

Classification and Support Vector Machine

Non-linear Support Vector Machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Kernel methods CSE 250B

Lecture 2: Linear SVM in the Dual

Transcription:

Machine Learning Ludovic Samper Antidot September 1st, 2015 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 1 / 77

Antidot Software vendor since 1999 Paris, Lyon, Aix-en-Provence 45 employees Founders : Fabrice Lacroix CEO, Stéphane Loesel CTO, Jérôme Mainka Chief Scientist Officer Software products and solutions Antidot Finder Suite (AFS) search engine Antidot Information Factory (AIF) a pipe & filters framework SaaS, Hosted License, 0n-site License 50% of the revenue invested in R&D Ludovic Samper (Antidot) Machine Learning September 1st, 2015 2 / 77

Antidot Machine Learning Automatic text document classification Named Entity Extraction Compound Splitter (for german words) Clustering algorithm (for news agregation) Open Data, Semantic Web http://www.rechercheisidore.fr/ Social Sciences and Humanities research platform. Enriched with open resources https://github.com/antidot/db2triples/ open source library to export a db in RDF Antidot is a Partner organization in WDAqua project Ludovic Samper (Antidot) Machine Learning September 1st, 2015 3 / 77

Tutorial Study a classical task in Machine Learning : text classification Show scikit-learn.org Python machine learning library Follow the Working with text data tutorial : http://scikit-learn.org/stable/tutorial/text_analytics/ working_with_text_data.html Additional material on http://blog.antidot.net/ Ludovic Samper (Antidot) Machine Learning September 1st, 2015 4 / 77

Summary of the tutorial 1 Problem definition Supervised classification Evaluation metrics 2 Extracting features from text files Bag of words model Term frequency inverse document frequency (tfidf) 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Methodology Ludovic Samper (Antidot) Machine Learning September 1st, 2015 5 / 77

Sommaire 1 Problem definition Supervised classification Evaluation metrics 2 Extracting features from text files 3 Algorithms for classification 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, 2015 6 / 77

20 newsgroups dataset http://qwone.com/~jason/20newsgroups/ 20 newsgroups 20 newsgroups documents collected in the 90 s The label is the newsgroup the document belongs to A popular collection 18846 documents : 11314 in train, 7532 in test wiss-ml.ipynb#the-20-newsgroups-dataset Ludovic Samper (Antidot) Machine Learning September 1st, 2015 7 / 77

Classification Problem statement One label per document Automatically determine the label of an unseen document. Set of documents and their labels A supervised classification problem Training Set of documents and their labels Build a model Inference Given a new document, use the model to predict its label Ludovic Samper (Antidot) Machine Learning September 1st, 2015 8 / 77

Precision and Recall I Binary classification C C Labeled C TP True Positive FP False Positive Not labeled C FN False Negative TN True Negative Precision Proba(e C e labeled C ) Recall Proba(e labeled C e C) TP TP + FP TP TP + FN Ludovic Samper (Antidot) Machine Learning September 1st, 2015 9 / 77

Precision and Recall II F 1 F 1 = 2 P R P + R Harmonic mean of Precision and Recall Accuracy TP + TN TP + TN + FP + FN Ludovic Samper (Antidot) Machine Learning September 1st, 2015 10 / 77

Multiclass I N C = number of class Macro Average B macro = 1 N C N C (B binary (TP k, FP k, TN k, FN k )) k=1 Average mesure by class. Large classes count has much as small ones. Micro Average N C N C N C N C B micro = B binary ( TP i, FP i, TN k, FN k ) Average mesure by instance k=1 k=1 k=1 k=1 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 11 / 77

Multiclass II Micro average in single label multiclass and Then, N C N C (FN k ) = (FP k ) k=1 k=1 N C N C (TN k ) = (TP k ) k=1 k=1 Precision micro = Recall micro = Accuracy = NC k=1 (TP k) Nbdoc Ludovic Samper (Antidot) Machine Learning September 1st, 2015 12 / 77

Sommaire 1 Problem definition 2 Extracting features from text files Bag of words model Term frequency inverse document frequency (tfidf) 3 Algorithms for classification 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, 2015 13 / 77

Bag of words From text to features Count the number of occurrences of words in text bag because position isn t taken into account Extensions Remove stop words Remove too frequent words (max_df) lowercase Ngram (ngram_range) tokenize ngrams instead of words. Useful to take into account word positions wiss-ml.ipynb#bag-of-words Ludovic Samper (Antidot) Machine Learning September 1st, 2015 14 / 77

Term frequency inverse document frequency (tfidf) I Intuition Take into account relative importance of each word regarding the whole dataset If a word occurs in every document, it doesn t hold any information Ludovic Samper (Antidot) Machine Learning September 1st, 2015 15 / 77

Term frequency inverse document frequency (tfidf) II Definition Term frequency inverse document frequency tfidf (w, d) = tf (w, d) idf (w, d) tf (w, d) = term frequency(word w in doc d) N doc idf (w) = log( doc freq(w) ) In scikit-learn : tfidf (w, d) = tf (w, d) (idf (w) + 1) Terms that occurs in all documents idf = 0 will not be ignored Ludovic Samper (Antidot) Machine Learning September 1st, 2015 16 / 77

Term frequency inverse document frequency (tfidf) III Options Normalisation doc = 1. Ex, for norm L 2, w d tfidf(w, d)2 = 1 Smoothing : add one to document frequencies as if an extra doc contained every term in the collection exactly once Example N doc + 1 idf (w) = log( doc freq(w) + 1 ) Show most significants words of a doc wiss-ml.ipynb#tfidf Ludovic Samper (Antidot) Machine Learning September 1st, 2015 17 / 77

Sommaire 1 Problem definition 2 Extracting features from text files 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, 2015 18 / 77

Supervised classification problem I Notations x = (x 1,, x n ) = (x i ) 0 i<n feature vector {(x d, y d )} 0 d<d the training set i, x i R n x i feature vector for document i n dimension of the feature space d, y d {1,, N C } N C the number of classes y d the class of document d ŷ class prediction For a new vector x, ŷ is the predicted class of x. Ludovic Samper (Antidot) Machine Learning September 1st, 2015 19 / 77

Supervised classification problem II Goal Find a function F : R n {1,, N C } x ŷ Ludovic Samper (Antidot) Machine Learning September 1st, 2015 20 / 77

In 20newsgroups I Values in 20 newsgroups n = 130107 nb features (number of unique terms) D = 11314 training samples N C = 20 different classes Goal Find a function F that given a new document predicts its class Ludovic Samper (Antidot) Machine Learning September 1st, 2015 21 / 77

Naïve Bayes Algorithm I Bayes theorem P(A B) = P(B A)P(A) P(B) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 22 / 77

Naïve Bayes Algorithm II Posterior probability of class C P(x) does not depend on C, P(C x) = P(x C)P(C) P(x) P(C x) P(x C)P(C) Naïve Bayes independent assumption : each feature i is conditionally independent of every other feature j P(C x) P(C) n P(x i C) i=1 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 23 / 77

Naïve Bayes Algorithm III Classifier from the probability model ŷ = arg max P(y = k) k {1,,N C } n P(x i y = k) i=0 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 24 / 77

Parameter estimation in Naïve Bayes classifier Prior of a class P(y = k) = Can also be uniform : P(y = k) = 1 N C nb samples in class k total nb samples Ludovic Samper (Antidot) Machine Learning September 1st, 2015 25 / 77

Multinomial Naïve Bayes I Naïve Bayes P(x y = k) = n i=1 P(x i y = k) Multinomial distribution Event word is i follows a multinomial distribution with parameters (p 1,, p n ) where p i = P(word = i) P(x 1,, x n ) = n i=1 p x i i Where i p i = 1. p i = P(w = i) One distribution for each class y. Ludovic Samper (Antidot) Machine Learning September 1st, 2015 26 / 77

Multinomial Naïve Bayes II Multinomial Naïve Bayes One multinomial distribution for each class P(i y = k) = sum of occurrences of word x i in class k total nb words in class k = 0 j<n d k x i d k x j With smoothing, P(i y = k) = 0 j<n d k x i + α d k x j + αn Ludovic Samper (Antidot) Machine Learning September 1st, 2015 27 / 77

Multinomial Naïve Bayes III Inference in Multinomial Naïve Bayes ŷ = arg max P(y = k x) k = arg max P(y = k) k = arg max k 0 i<n P(i y = k) x i ( log(p(y = k)) + 0 i<n x i log(p(i y = k)) ) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 28 / 77

Multinomial Naïve Bayes IV A linear model In the log space, W 0, is the vector of priors : W is the matrix of distributions : (log P(y = k x)) k W 0 + W T.x W 0 = log(p(y = k)) W = (w ik ), i [1, n], k [1, N C ] w ik = log P(i y = k) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 29 / 77

Multinomial Naïve Bayes V Example step-by-step http://www.antidot.net/wiss2015/wiss-ml.html#naive-bayes Ludovic Samper (Antidot) Machine Learning September 1st, 2015 30 / 77

Sommaire 1 Problem definition 2 Extracting features from text files 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, 2015 31 / 77

A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, 2015 32 / 77

A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, 2015 33 / 77

A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, 2015 34 / 77

A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, 2015 35 / 77

A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, 2015 36 / 77

Support Vector Machine, notations Problem S, training set {(x i, y i ), x i R n, y i { 1, 1}} i 0..D Find a linear function w, x i + b such that : sign( w, x i + b) = y i Ludovic Samper (Antidot) Machine Learning September 1st, 2015 37 / 77

SVM, maximum margin classifier Ludovic Samper (Antidot) Machine Learning September 1st, 2015 38 / 77

Margin distance(x +, x ) = w w, x + x = = = = 1 w ( w, x + w, x ) 1 w (( w, x + + b) ( w, x + b)) 1 (1 ( 1)) w 2 w Ludovic Samper (Antidot) Machine Learning September 1st, 2015 39 / 77

SVM, maximum margin classifier Ludovic Samper (Antidot) Machine Learning September 1st, 2015 40 / 77

Solving an optimization problem using the Lagrangien Primal problem minimize w,b f (w, b) Under the constraints, h i (w, b) 0 Lagrange function L(w, b, α) = f (w, b) i α i h i (w, b) Let, g(α) = inf (w,b) L(w, b, α) w, b, g(α) L(w, b, α) Moreover, L(w, b, α) f (w, b) Thus, α i 0, g(α) min w,b f (w, b) And with Karush Kuhn Tucker (KKT) optimality condition, max α g(α) = min f (w, b) α ih i (w, x) = 0 w,b Ludovic Samper (Antidot) Machine Learning September 1st, 2015 41 / 77

Support Vector Machine, problem Primal problem w 2 minimize (w,b) 2 Under the constraints, 0 < i D, y i ( w, x i + b) 1 Lagrange function L(w, b, α) = 1 2 w 2 i α i (y i ( w, x i + b) 1) Dual problem : maximize (w,b,α) L(w, b, α) with α i 0 Optimality in w, b is a saddle point with α Ludovic Samper (Antidot) Machine Learning September 1st, 2015 42 / 77

Support Vector Machine, problem Derivative in w, b need to vanish w L(w, b, α) = w i α i y i x i = 0 b L(w, b, α) = i α i y i = 0 Dual problem under the constraints, maximize α 1 α i α j y i y j x i, x j + 2 i,j i { i α iy i = 0 α i 0 α i Ludovic Samper (Antidot) Machine Learning September 1st, 2015 43 / 77

Support Vectors Support vectors w = i y i α i x i Karush Kuhn Tucker (KKT) optimality condition Lagrange multiplier times constraint equals zero α i (y i ( w, x i + b) 1) = 0 Thus, { αi = 0 α i > 0 y i ( w, x i + b) = 1 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 44 / 77

Experiments with separable space SVMvaryingC.ipynb Ludovic Samper (Antidot) Machine Learning September 1st, 2015 45 / 77

What happens if space is not separable Ludovic Samper (Antidot) Machine Learning September 1st, 2015 46 / 77

Adding slack variable Problem was With, minimize (w,b) w 2 2 y i (w.x i + b) 1 With slack minimize (w,b) w 2 2 + C i ξ i With, { yi (w.x i + b) 1 ξ i ξ i 0 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 47 / 77

Support Vector Machine, without slack Primal problem With, minimize (w,b) w 2 2 y i (w.x i + b) 1 Lagrange function L(w, b, α) = 1 2 w 2 i α i (y i ( w, x i + b) 1) Dual problem : maximize (w,b,α) L(w, b, α) Optimality in w, b, is a saddle point with α Ludovic Samper (Antidot) Machine Learning September 1st, 2015 48 / 77

Support Vector Machine, with slack Primal problem With, w 2 minimize (w,b) + C 2 i { yi (w.x i + b) 1 ξ i ξ i 0 ξ i Lagrange function L(w, b, ξ, α, η) = 1 2 w 2 + C i ξ i i α i (y i ( x i, w + b) + ξ i 1) i η i ξ i Dual problem : maximize (w,b,ξ,α,η) L(w, b, ξ, α, η) Optimality in w, b, ξ is a saddle point with α, η Ludovic Samper (Antidot) Machine Learning September 1st, 2015 49 / 77

Support Vector Machine, problem Derivative in w, b, ξ need to vanish w L(w, b, ξ, α, η) = w i α i y i x i = 0 b L(w, b, ξ, α, η) = i α i y i = 0 ξ L(w, b, ξ, α, η) = C α i η i = 0 η i = C α i Dual problem maximize α 1 α i α j y i y j x i, x j + 2 i,j i under the constraints, i α iy i = 0 and 0 α i C α i Ludovic Samper (Antidot) Machine Learning September 1st, 2015 50 / 77

Support Vectors Support vectors w = i y i α i x i Karush Kuhn Tucker (KKT) optimality condition Lagrange multiplier times constraint equals zero α i (y i ( w, x i + b) + ξ i 1) = 0 η i ξ i = 0 (C α i )ξ i = 0 Thus, α i = 0 y i ( w, x i + b) 1 0 < α i < C y i ( w, x i + b) = 1 α i = C y i ( w, x i + b) 1 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 51 / 77

Support Vector Machine, Loss functions Primal problem With, w 2 minimize (w,b) + C 2 i { yi (w.x i + b) 1 ξ i ξ i 0 ξ i With loss function w 2 minimize (w,b) + C 2 i max(0, 1 y i (w.x i + b)) here, loss(x i, y i ) = max(0, 1 y i (w.x i + b)) = max(0, 1 f (x i )) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 52 / 77

Support Vector Machine, Common loss functions Common loss functions hinge loss, L 1 -loss : max(0, 1 y i (w.x i + b)) squares hinge L 2 -loss : max(0, (1 y i (w.x i + b)) 2 ) logistic loss : log(1 + exp( y i (w.x i + b))) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 53 / 77

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 54 / 77

Expermiments with different values for C SVMvaryingC.ipynb#Varying-C-parameter Ludovic Samper (Antidot) Machine Learning September 1st, 2015 55 / 77

Non linearly separable data Ludovic Samper (Antidot) Machine Learning September 1st, 2015 56 / 77

Non linearly separable data, Φ(x) = (x, x 2 ) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 57 / 77

Non linearly separable data, Φ(x) = (x, x 2 ) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 58 / 77

Linear case Primal Problem minimize w,b 1 2 w 2 + C i ξ i subject to, y i ( w, x i + b) 1 ξ i and ξ i 0 Dual Problem maximize α 1 2 subject to, i α iy i = 0 and 0 α i C α i α j y i y j x i, x j + i,j i α i Support vector expansion f (x) = i α i y i x i, x + b Ludovic Samper (Antidot) Machine Learning September 1st, 2015 59 / 77

With a transformation Φ : x Φ(x) Primal Problem minimize w,b 1 2 w 2 + C i ξ i subject to, y i ( w, Φ(x i ) + b) 1 ξ i and ξ i 0 Dual Problem maximize α 1 2 α i α j y i y j Φ(x i ), Φ(x j ) + i,j i α i subject to, i α iy i = 0 and 0 α i C Support vector expansion f (x) = i α i y i Φ(x i ), Φ(x) + b Ludovic Samper (Antidot) Machine Learning September 1st, 2015 60 / 77

The kernel trick Kernel function k(x, x ) = Φ(x), Φ(x ) We just need to compute the dot product in the new space Dual Problem maximize α 1 2 subject to, i α iy i = 0 and 0 α i C α i α j y i y j k(x i, x j ) + i,j i α i Support vector expansion f (x) = i α i y i k(x i, x) + b Ludovic Samper (Antidot) Machine Learning September 1st, 2015 61 / 77

Kernels Kernel functions linear : k(x, x ) = x, x polynomial : k(x, x ) = (γ x, x + r) d rbf : k(x, x ) = exp( γ x x 2 ) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 62 / 77

RBF Kernel imply an infinite space Here we re in dimension 1, x R k(x, x ) = exp( (x x ) 2 ) = exp( x 2 )exp( x 2 )exp(2xx ) With Taylor transformation, k(x, x ) = exp( x 2 )exp( x 2 ) 2 k x k x k k=0 k! = (, 2k 1 k! exp( x 2 )x k, ), (, 2k 1 k! exp( x 2 )x k, ) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 63 / 77

Experiments with different kernels www.antidot.net/wiss2015/svmvaryingc.html#non-linear-kernels Ludovic Samper (Antidot) Machine Learning September 1st, 2015 64 / 77

SVM in multiclass one-vs-the rest N C binary classifiers (but each involving all dataset) At prediction time, choose the class with maximum decision value one-vs-one N C (N C 1) 2 binary classifiers At prediction time, vote Ludovic Samper (Antidot) Machine Learning September 1st, 2015 65 / 77

SVM in scikit-learn SVC : Support Vector Classification sklearn.svm.linearsvc based on Liblinear library strategy : one-vs-the rest only linear kernel loss can be : hinge or squared hinge sklearn.svm.svc based on libsvm multiclass strategy : one-vs-one kernel can be : linear, polynomial, RBF, sigmoid, precomputed only hinge loss Ludovic Samper (Antidot) Machine Learning September 1st, 2015 66 / 77

Sommaire 1 Problem definition 2 Extracting features from text files 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, 2015 67 / 77

Cross validation I http://scikit-learn.org/stable/modules/cross_validation.html Overfitting Estimation of parameters on the test set can lead to overfitting : parameters are the best for this test set but not in the general case. Train, test and validation dataset A solution : tweak the parameters on the test set validate on a validation dataset only few data in training dataset Ludovic Samper (Antidot) Machine Learning September 1st, 2015 68 / 77

Cross validation II Cross validation k-fold cross validation Split training data in k partitions of the same size train the model on k 1 partitions then, evaluate on the kth partition Ludovic Samper (Antidot) Machine Learning September 1st, 2015 69 / 77

Cross validation III Ludovic Samper (Antidot) Machine Learning September 1st, 2015 70 / 77

Grid Search http://scikit-learn.org/stable/modules/grid_search.html Grid search Test each value for each parameter brut force algorithm to find the best value for each parameter In scikit-learn Automatically runs k number of parameters values trainings Keeps the best model Demo with scikit-learn http://www.antidot.net/wiss2015/grid_search_20newsgroups.html Ludovic Samper (Antidot) Machine Learning September 1st, 2015 71 / 77

Sommaire 1 Problem definition 2 Extracting features from text files 3 Algorithms for classification 4 Conclusion Methodology Ludovic Samper (Antidot) Machine Learning September 1st, 2015 72 / 77

1 Problem definition Supervised classification Evaluation metrics 2 Extracting features from text files Bag of words model Term frequency inverse document frequency (tfidf) 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Methodology Ludovic Samper (Antidot) Machine Learning September 1st, 2015 73 / 77

Methodology To solve a problem using Machine Learning, you have to : 1 Understand the data 2 Choose an evaluation measure 3 Be able to test the model 4 Find the main features 5 Try the algorithms, with different parameters Ludovic Samper (Antidot) Machine Learning September 1st, 2015 73 / 77

Conclusion Machine Learning has a lot of applications With libraries like scikit-learn, no need to implement algorithms yourself Ludovic Samper (Antidot) Machine Learning September 1st, 2015 74 / 77

Questions? Ludovic Samper (Antidot) Machine Learning September 1st, 2015 75 / 77

References Machine Learning in Python : http://scikit-learn.org Alex Smola very good lecture on Machine Learning at CMU : http://alex.smola.org/teaching/10-701-15/ Kernels : https://www.youtube.com/watch?v=0nis-omlbds SVM : https://www.youtube.com/watch?v=bsbpqnikqzu Ludovic Samper (Antidot) Machine Learning September 1st, 2015 76 / 77

Bernoulli Naïve Bayes Features x i = 1 iff word i is present in document Else, x i = 0 The number of occurrences of word i doesn t matter Bernoulli For each feature i, P(x i y = k) = P(i y = k)x i + (1 P(i y = k))(1 x i ) Absence of a feature is explicitly taken into account Estimation of P(i y = k) P(i y = k) = 1 + nb of documents in k that contains word i nb of documents in k Ludovic Samper (Antidot) Machine Learning September 1st, 2015 77 / 77