Machine Learning Ludovic Samper Antidot September 1st, 2015 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 1 / 77
Antidot Software vendor since 1999 Paris, Lyon, Aix-en-Provence 45 employees Founders : Fabrice Lacroix CEO, Stéphane Loesel CTO, Jérôme Mainka Chief Scientist Officer Software products and solutions Antidot Finder Suite (AFS) search engine Antidot Information Factory (AIF) a pipe & filters framework SaaS, Hosted License, 0n-site License 50% of the revenue invested in R&D Ludovic Samper (Antidot) Machine Learning September 1st, 2015 2 / 77
Antidot Machine Learning Automatic text document classification Named Entity Extraction Compound Splitter (for german words) Clustering algorithm (for news agregation) Open Data, Semantic Web http://www.rechercheisidore.fr/ Social Sciences and Humanities research platform. Enriched with open resources https://github.com/antidot/db2triples/ open source library to export a db in RDF Antidot is a Partner organization in WDAqua project Ludovic Samper (Antidot) Machine Learning September 1st, 2015 3 / 77
Tutorial Study a classical task in Machine Learning : text classification Show scikit-learn.org Python machine learning library Follow the Working with text data tutorial : http://scikit-learn.org/stable/tutorial/text_analytics/ working_with_text_data.html Additional material on http://blog.antidot.net/ Ludovic Samper (Antidot) Machine Learning September 1st, 2015 4 / 77
Summary of the tutorial 1 Problem definition Supervised classification Evaluation metrics 2 Extracting features from text files Bag of words model Term frequency inverse document frequency (tfidf) 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Methodology Ludovic Samper (Antidot) Machine Learning September 1st, 2015 5 / 77
Sommaire 1 Problem definition Supervised classification Evaluation metrics 2 Extracting features from text files 3 Algorithms for classification 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, 2015 6 / 77
20 newsgroups dataset http://qwone.com/~jason/20newsgroups/ 20 newsgroups 20 newsgroups documents collected in the 90 s The label is the newsgroup the document belongs to A popular collection 18846 documents : 11314 in train, 7532 in test wiss-ml.ipynb#the-20-newsgroups-dataset Ludovic Samper (Antidot) Machine Learning September 1st, 2015 7 / 77
Classification Problem statement One label per document Automatically determine the label of an unseen document. Set of documents and their labels A supervised classification problem Training Set of documents and their labels Build a model Inference Given a new document, use the model to predict its label Ludovic Samper (Antidot) Machine Learning September 1st, 2015 8 / 77
Precision and Recall I Binary classification C C Labeled C TP True Positive FP False Positive Not labeled C FN False Negative TN True Negative Precision Proba(e C e labeled C ) Recall Proba(e labeled C e C) TP TP + FP TP TP + FN Ludovic Samper (Antidot) Machine Learning September 1st, 2015 9 / 77
Precision and Recall II F 1 F 1 = 2 P R P + R Harmonic mean of Precision and Recall Accuracy TP + TN TP + TN + FP + FN Ludovic Samper (Antidot) Machine Learning September 1st, 2015 10 / 77
Multiclass I N C = number of class Macro Average B macro = 1 N C N C (B binary (TP k, FP k, TN k, FN k )) k=1 Average mesure by class. Large classes count has much as small ones. Micro Average N C N C N C N C B micro = B binary ( TP i, FP i, TN k, FN k ) Average mesure by instance k=1 k=1 k=1 k=1 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 11 / 77
Multiclass II Micro average in single label multiclass and Then, N C N C (FN k ) = (FP k ) k=1 k=1 N C N C (TN k ) = (TP k ) k=1 k=1 Precision micro = Recall micro = Accuracy = NC k=1 (TP k) Nbdoc Ludovic Samper (Antidot) Machine Learning September 1st, 2015 12 / 77
Sommaire 1 Problem definition 2 Extracting features from text files Bag of words model Term frequency inverse document frequency (tfidf) 3 Algorithms for classification 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, 2015 13 / 77
Bag of words From text to features Count the number of occurrences of words in text bag because position isn t taken into account Extensions Remove stop words Remove too frequent words (max_df) lowercase Ngram (ngram_range) tokenize ngrams instead of words. Useful to take into account word positions wiss-ml.ipynb#bag-of-words Ludovic Samper (Antidot) Machine Learning September 1st, 2015 14 / 77
Term frequency inverse document frequency (tfidf) I Intuition Take into account relative importance of each word regarding the whole dataset If a word occurs in every document, it doesn t hold any information Ludovic Samper (Antidot) Machine Learning September 1st, 2015 15 / 77
Term frequency inverse document frequency (tfidf) II Definition Term frequency inverse document frequency tfidf (w, d) = tf (w, d) idf (w, d) tf (w, d) = term frequency(word w in doc d) N doc idf (w) = log( doc freq(w) ) In scikit-learn : tfidf (w, d) = tf (w, d) (idf (w) + 1) Terms that occurs in all documents idf = 0 will not be ignored Ludovic Samper (Antidot) Machine Learning September 1st, 2015 16 / 77
Term frequency inverse document frequency (tfidf) III Options Normalisation doc = 1. Ex, for norm L 2, w d tfidf(w, d)2 = 1 Smoothing : add one to document frequencies as if an extra doc contained every term in the collection exactly once Example N doc + 1 idf (w) = log( doc freq(w) + 1 ) Show most significants words of a doc wiss-ml.ipynb#tfidf Ludovic Samper (Antidot) Machine Learning September 1st, 2015 17 / 77
Sommaire 1 Problem definition 2 Extracting features from text files 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, 2015 18 / 77
Supervised classification problem I Notations x = (x 1,, x n ) = (x i ) 0 i<n feature vector {(x d, y d )} 0 d<d the training set i, x i R n x i feature vector for document i n dimension of the feature space d, y d {1,, N C } N C the number of classes y d the class of document d ŷ class prediction For a new vector x, ŷ is the predicted class of x. Ludovic Samper (Antidot) Machine Learning September 1st, 2015 19 / 77
Supervised classification problem II Goal Find a function F : R n {1,, N C } x ŷ Ludovic Samper (Antidot) Machine Learning September 1st, 2015 20 / 77
In 20newsgroups I Values in 20 newsgroups n = 130107 nb features (number of unique terms) D = 11314 training samples N C = 20 different classes Goal Find a function F that given a new document predicts its class Ludovic Samper (Antidot) Machine Learning September 1st, 2015 21 / 77
Naïve Bayes Algorithm I Bayes theorem P(A B) = P(B A)P(A) P(B) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 22 / 77
Naïve Bayes Algorithm II Posterior probability of class C P(x) does not depend on C, P(C x) = P(x C)P(C) P(x) P(C x) P(x C)P(C) Naïve Bayes independent assumption : each feature i is conditionally independent of every other feature j P(C x) P(C) n P(x i C) i=1 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 23 / 77
Naïve Bayes Algorithm III Classifier from the probability model ŷ = arg max P(y = k) k {1,,N C } n P(x i y = k) i=0 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 24 / 77
Parameter estimation in Naïve Bayes classifier Prior of a class P(y = k) = Can also be uniform : P(y = k) = 1 N C nb samples in class k total nb samples Ludovic Samper (Antidot) Machine Learning September 1st, 2015 25 / 77
Multinomial Naïve Bayes I Naïve Bayes P(x y = k) = n i=1 P(x i y = k) Multinomial distribution Event word is i follows a multinomial distribution with parameters (p 1,, p n ) where p i = P(word = i) P(x 1,, x n ) = n i=1 p x i i Where i p i = 1. p i = P(w = i) One distribution for each class y. Ludovic Samper (Antidot) Machine Learning September 1st, 2015 26 / 77
Multinomial Naïve Bayes II Multinomial Naïve Bayes One multinomial distribution for each class P(i y = k) = sum of occurrences of word x i in class k total nb words in class k = 0 j<n d k x i d k x j With smoothing, P(i y = k) = 0 j<n d k x i + α d k x j + αn Ludovic Samper (Antidot) Machine Learning September 1st, 2015 27 / 77
Multinomial Naïve Bayes III Inference in Multinomial Naïve Bayes ŷ = arg max P(y = k x) k = arg max P(y = k) k = arg max k 0 i<n P(i y = k) x i ( log(p(y = k)) + 0 i<n x i log(p(i y = k)) ) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 28 / 77
Multinomial Naïve Bayes IV A linear model In the log space, W 0, is the vector of priors : W is the matrix of distributions : (log P(y = k x)) k W 0 + W T.x W 0 = log(p(y = k)) W = (w ik ), i [1, n], k [1, N C ] w ik = log P(i y = k) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 29 / 77
Multinomial Naïve Bayes V Example step-by-step http://www.antidot.net/wiss2015/wiss-ml.html#naive-bayes Ludovic Samper (Antidot) Machine Learning September 1st, 2015 30 / 77
Sommaire 1 Problem definition 2 Extracting features from text files 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, 2015 31 / 77
A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, 2015 32 / 77
A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, 2015 33 / 77
A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, 2015 34 / 77
A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, 2015 35 / 77
A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, 2015 36 / 77
Support Vector Machine, notations Problem S, training set {(x i, y i ), x i R n, y i { 1, 1}} i 0..D Find a linear function w, x i + b such that : sign( w, x i + b) = y i Ludovic Samper (Antidot) Machine Learning September 1st, 2015 37 / 77
SVM, maximum margin classifier Ludovic Samper (Antidot) Machine Learning September 1st, 2015 38 / 77
Margin distance(x +, x ) = w w, x + x = = = = 1 w ( w, x + w, x ) 1 w (( w, x + + b) ( w, x + b)) 1 (1 ( 1)) w 2 w Ludovic Samper (Antidot) Machine Learning September 1st, 2015 39 / 77
SVM, maximum margin classifier Ludovic Samper (Antidot) Machine Learning September 1st, 2015 40 / 77
Solving an optimization problem using the Lagrangien Primal problem minimize w,b f (w, b) Under the constraints, h i (w, b) 0 Lagrange function L(w, b, α) = f (w, b) i α i h i (w, b) Let, g(α) = inf (w,b) L(w, b, α) w, b, g(α) L(w, b, α) Moreover, L(w, b, α) f (w, b) Thus, α i 0, g(α) min w,b f (w, b) And with Karush Kuhn Tucker (KKT) optimality condition, max α g(α) = min f (w, b) α ih i (w, x) = 0 w,b Ludovic Samper (Antidot) Machine Learning September 1st, 2015 41 / 77
Support Vector Machine, problem Primal problem w 2 minimize (w,b) 2 Under the constraints, 0 < i D, y i ( w, x i + b) 1 Lagrange function L(w, b, α) = 1 2 w 2 i α i (y i ( w, x i + b) 1) Dual problem : maximize (w,b,α) L(w, b, α) with α i 0 Optimality in w, b is a saddle point with α Ludovic Samper (Antidot) Machine Learning September 1st, 2015 42 / 77
Support Vector Machine, problem Derivative in w, b need to vanish w L(w, b, α) = w i α i y i x i = 0 b L(w, b, α) = i α i y i = 0 Dual problem under the constraints, maximize α 1 α i α j y i y j x i, x j + 2 i,j i { i α iy i = 0 α i 0 α i Ludovic Samper (Antidot) Machine Learning September 1st, 2015 43 / 77
Support Vectors Support vectors w = i y i α i x i Karush Kuhn Tucker (KKT) optimality condition Lagrange multiplier times constraint equals zero α i (y i ( w, x i + b) 1) = 0 Thus, { αi = 0 α i > 0 y i ( w, x i + b) = 1 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 44 / 77
Experiments with separable space SVMvaryingC.ipynb Ludovic Samper (Antidot) Machine Learning September 1st, 2015 45 / 77
What happens if space is not separable Ludovic Samper (Antidot) Machine Learning September 1st, 2015 46 / 77
Adding slack variable Problem was With, minimize (w,b) w 2 2 y i (w.x i + b) 1 With slack minimize (w,b) w 2 2 + C i ξ i With, { yi (w.x i + b) 1 ξ i ξ i 0 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 47 / 77
Support Vector Machine, without slack Primal problem With, minimize (w,b) w 2 2 y i (w.x i + b) 1 Lagrange function L(w, b, α) = 1 2 w 2 i α i (y i ( w, x i + b) 1) Dual problem : maximize (w,b,α) L(w, b, α) Optimality in w, b, is a saddle point with α Ludovic Samper (Antidot) Machine Learning September 1st, 2015 48 / 77
Support Vector Machine, with slack Primal problem With, w 2 minimize (w,b) + C 2 i { yi (w.x i + b) 1 ξ i ξ i 0 ξ i Lagrange function L(w, b, ξ, α, η) = 1 2 w 2 + C i ξ i i α i (y i ( x i, w + b) + ξ i 1) i η i ξ i Dual problem : maximize (w,b,ξ,α,η) L(w, b, ξ, α, η) Optimality in w, b, ξ is a saddle point with α, η Ludovic Samper (Antidot) Machine Learning September 1st, 2015 49 / 77
Support Vector Machine, problem Derivative in w, b, ξ need to vanish w L(w, b, ξ, α, η) = w i α i y i x i = 0 b L(w, b, ξ, α, η) = i α i y i = 0 ξ L(w, b, ξ, α, η) = C α i η i = 0 η i = C α i Dual problem maximize α 1 α i α j y i y j x i, x j + 2 i,j i under the constraints, i α iy i = 0 and 0 α i C α i Ludovic Samper (Antidot) Machine Learning September 1st, 2015 50 / 77
Support Vectors Support vectors w = i y i α i x i Karush Kuhn Tucker (KKT) optimality condition Lagrange multiplier times constraint equals zero α i (y i ( w, x i + b) + ξ i 1) = 0 η i ξ i = 0 (C α i )ξ i = 0 Thus, α i = 0 y i ( w, x i + b) 1 0 < α i < C y i ( w, x i + b) = 1 α i = C y i ( w, x i + b) 1 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 51 / 77
Support Vector Machine, Loss functions Primal problem With, w 2 minimize (w,b) + C 2 i { yi (w.x i + b) 1 ξ i ξ i 0 ξ i With loss function w 2 minimize (w,b) + C 2 i max(0, 1 y i (w.x i + b)) here, loss(x i, y i ) = max(0, 1 y i (w.x i + b)) = max(0, 1 f (x i )) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 52 / 77
Support Vector Machine, Common loss functions Common loss functions hinge loss, L 1 -loss : max(0, 1 y i (w.x i + b)) squares hinge L 2 -loss : max(0, (1 y i (w.x i + b)) 2 ) logistic loss : log(1 + exp( y i (w.x i + b))) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 53 / 77
Ludovic Samper (Antidot) Machine Learning September 1st, 2015 54 / 77
Expermiments with different values for C SVMvaryingC.ipynb#Varying-C-parameter Ludovic Samper (Antidot) Machine Learning September 1st, 2015 55 / 77
Non linearly separable data Ludovic Samper (Antidot) Machine Learning September 1st, 2015 56 / 77
Non linearly separable data, Φ(x) = (x, x 2 ) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 57 / 77
Non linearly separable data, Φ(x) = (x, x 2 ) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 58 / 77
Linear case Primal Problem minimize w,b 1 2 w 2 + C i ξ i subject to, y i ( w, x i + b) 1 ξ i and ξ i 0 Dual Problem maximize α 1 2 subject to, i α iy i = 0 and 0 α i C α i α j y i y j x i, x j + i,j i α i Support vector expansion f (x) = i α i y i x i, x + b Ludovic Samper (Antidot) Machine Learning September 1st, 2015 59 / 77
With a transformation Φ : x Φ(x) Primal Problem minimize w,b 1 2 w 2 + C i ξ i subject to, y i ( w, Φ(x i ) + b) 1 ξ i and ξ i 0 Dual Problem maximize α 1 2 α i α j y i y j Φ(x i ), Φ(x j ) + i,j i α i subject to, i α iy i = 0 and 0 α i C Support vector expansion f (x) = i α i y i Φ(x i ), Φ(x) + b Ludovic Samper (Antidot) Machine Learning September 1st, 2015 60 / 77
The kernel trick Kernel function k(x, x ) = Φ(x), Φ(x ) We just need to compute the dot product in the new space Dual Problem maximize α 1 2 subject to, i α iy i = 0 and 0 α i C α i α j y i y j k(x i, x j ) + i,j i α i Support vector expansion f (x) = i α i y i k(x i, x) + b Ludovic Samper (Antidot) Machine Learning September 1st, 2015 61 / 77
Kernels Kernel functions linear : k(x, x ) = x, x polynomial : k(x, x ) = (γ x, x + r) d rbf : k(x, x ) = exp( γ x x 2 ) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 62 / 77
RBF Kernel imply an infinite space Here we re in dimension 1, x R k(x, x ) = exp( (x x ) 2 ) = exp( x 2 )exp( x 2 )exp(2xx ) With Taylor transformation, k(x, x ) = exp( x 2 )exp( x 2 ) 2 k x k x k k=0 k! = (, 2k 1 k! exp( x 2 )x k, ), (, 2k 1 k! exp( x 2 )x k, ) Ludovic Samper (Antidot) Machine Learning September 1st, 2015 63 / 77
Experiments with different kernels www.antidot.net/wiss2015/svmvaryingc.html#non-linear-kernels Ludovic Samper (Antidot) Machine Learning September 1st, 2015 64 / 77
SVM in multiclass one-vs-the rest N C binary classifiers (but each involving all dataset) At prediction time, choose the class with maximum decision value one-vs-one N C (N C 1) 2 binary classifiers At prediction time, vote Ludovic Samper (Antidot) Machine Learning September 1st, 2015 65 / 77
SVM in scikit-learn SVC : Support Vector Classification sklearn.svm.linearsvc based on Liblinear library strategy : one-vs-the rest only linear kernel loss can be : hinge or squared hinge sklearn.svm.svc based on libsvm multiclass strategy : one-vs-one kernel can be : linear, polynomial, RBF, sigmoid, precomputed only hinge loss Ludovic Samper (Antidot) Machine Learning September 1st, 2015 66 / 77
Sommaire 1 Problem definition 2 Extracting features from text files 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, 2015 67 / 77
Cross validation I http://scikit-learn.org/stable/modules/cross_validation.html Overfitting Estimation of parameters on the test set can lead to overfitting : parameters are the best for this test set but not in the general case. Train, test and validation dataset A solution : tweak the parameters on the test set validate on a validation dataset only few data in training dataset Ludovic Samper (Antidot) Machine Learning September 1st, 2015 68 / 77
Cross validation II Cross validation k-fold cross validation Split training data in k partitions of the same size train the model on k 1 partitions then, evaluate on the kth partition Ludovic Samper (Antidot) Machine Learning September 1st, 2015 69 / 77
Cross validation III Ludovic Samper (Antidot) Machine Learning September 1st, 2015 70 / 77
Grid Search http://scikit-learn.org/stable/modules/grid_search.html Grid search Test each value for each parameter brut force algorithm to find the best value for each parameter In scikit-learn Automatically runs k number of parameters values trainings Keeps the best model Demo with scikit-learn http://www.antidot.net/wiss2015/grid_search_20newsgroups.html Ludovic Samper (Antidot) Machine Learning September 1st, 2015 71 / 77
Sommaire 1 Problem definition 2 Extracting features from text files 3 Algorithms for classification 4 Conclusion Methodology Ludovic Samper (Antidot) Machine Learning September 1st, 2015 72 / 77
1 Problem definition Supervised classification Evaluation metrics 2 Extracting features from text files Bag of words model Term frequency inverse document frequency (tfidf) 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Methodology Ludovic Samper (Antidot) Machine Learning September 1st, 2015 73 / 77
Methodology To solve a problem using Machine Learning, you have to : 1 Understand the data 2 Choose an evaluation measure 3 Be able to test the model 4 Find the main features 5 Try the algorithms, with different parameters Ludovic Samper (Antidot) Machine Learning September 1st, 2015 73 / 77
Conclusion Machine Learning has a lot of applications With libraries like scikit-learn, no need to implement algorithms yourself Ludovic Samper (Antidot) Machine Learning September 1st, 2015 74 / 77
Questions? Ludovic Samper (Antidot) Machine Learning September 1st, 2015 75 / 77
References Machine Learning in Python : http://scikit-learn.org Alex Smola very good lecture on Machine Learning at CMU : http://alex.smola.org/teaching/10-701-15/ Kernels : https://www.youtube.com/watch?v=0nis-omlbds SVM : https://www.youtube.com/watch?v=bsbpqnikqzu Ludovic Samper (Antidot) Machine Learning September 1st, 2015 76 / 77
Bernoulli Naïve Bayes Features x i = 1 iff word i is present in document Else, x i = 0 The number of occurrences of word i doesn t matter Bernoulli For each feature i, P(x i y = k) = P(i y = k)x i + (1 P(i y = k))(1 x i ) Absence of a feature is explicitly taken into account Estimation of P(i y = k) P(i y = k) = 1 + nb of documents in k that contains word i nb of documents in k Ludovic Samper (Antidot) Machine Learning September 1st, 2015 77 / 77