Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

Size: px

Start display at page:

Download "Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77"

Adela Higgins
5 years ago
Views:

1 Machine Learning Ludovic Samper Antidot September 1st, 2015 Ludovic Samper (Antidot) Machine Learning September 1st, / 77

2 Antidot Software vendor since 1999 Paris, Lyon, Aix-en-Provence 45 employees Founders : Fabrice Lacroix CEO, Stéphane Loesel CTO, Jérôme Mainka Chief Scientist Officer Software products and solutions Antidot Finder Suite (AFS) search engine Antidot Information Factory (AIF) a pipe & filters framework SaaS, Hosted License, 0n-site License 50% of the revenue invested in R&D Ludovic Samper (Antidot) Machine Learning September 1st, / 77

3 Antidot Machine Learning Automatic text document classification Named Entity Extraction Compound Splitter (for german words) Clustering algorithm (for news agregation) Open Data, Semantic Web Social Sciences and Humanities research platform. Enriched with open resources open source library to export a db in RDF Antidot is a Partner organization in WDAqua project Ludovic Samper (Antidot) Machine Learning September 1st, / 77

4 Tutorial Study a classical task in Machine Learning : text classification Show scikit-learn.org Python machine learning library Follow the Working with text data tutorial : working_with_text_data.html Additional material on Ludovic Samper (Antidot) Machine Learning September 1st, / 77

5 Summary of the tutorial 1 Problem definition Supervised classification Evaluation metrics 2 Extracting features from text files Bag of words model Term frequency inverse document frequency (tfidf) 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Methodology Ludovic Samper (Antidot) Machine Learning September 1st, / 77

6 Sommaire 1 Problem definition Supervised classification Evaluation metrics 2 Extracting features from text files 3 Algorithms for classification 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, / 77

7 20 newsgroups dataset 20 newsgroups 20 newsgroups documents collected in the 90 s The label is the newsgroup the document belongs to A popular collection documents : in train, 7532 in test wiss-ml.ipynb#the-20-newsgroups-dataset Ludovic Samper (Antidot) Machine Learning September 1st, / 77

8 Classification Problem statement One label per document Automatically determine the label of an unseen document. Set of documents and their labels A supervised classification problem Training Set of documents and their labels Build a model Inference Given a new document, use the model to predict its label Ludovic Samper (Antidot) Machine Learning September 1st, / 77

9 Precision and Recall I Binary classification C C Labeled C TP True Positive FP False Positive Not labeled C FN False Negative TN True Negative Precision Proba(e C e labeled C ) Recall Proba(e labeled C e C) TP TP + FP TP TP + FN Ludovic Samper (Antidot) Machine Learning September 1st, / 77

10 Precision and Recall II F 1 F 1 = 2 P R P + R Harmonic mean of Precision and Recall Accuracy TP + TN TP + TN + FP + FN Ludovic Samper (Antidot) Machine Learning September 1st, / 77

11 Multiclass I N C = number of class Macro Average B macro = 1 N C N C (B binary (TP k, FP k, TN k, FN k )) k=1 Average mesure by class. Large classes count has much as small ones. Micro Average N C N C N C N C B micro = B binary ( TP i, FP i, TN k, FN k ) Average mesure by instance k=1 k=1 k=1 k=1 Ludovic Samper (Antidot) Machine Learning September 1st, / 77

12 Multiclass II Micro average in single label multiclass and Then, N C N C (FN k ) = (FP k ) k=1 k=1 N C N C (TN k ) = (TP k ) k=1 k=1 Precision micro = Recall micro = Accuracy = NC k=1 (TP k) Nbdoc Ludovic Samper (Antidot) Machine Learning September 1st, / 77

13 Sommaire 1 Problem definition 2 Extracting features from text files Bag of words model Term frequency inverse document frequency (tfidf) 3 Algorithms for classification 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, / 77

14 Bag of words From text to features Count the number of occurrences of words in text bag because position isn t taken into account Extensions Remove stop words Remove too frequent words (max_df) lowercase Ngram (ngram_range) tokenize ngrams instead of words. Useful to take into account word positions wiss-ml.ipynb#bag-of-words Ludovic Samper (Antidot) Machine Learning September 1st, / 77

15 Term frequency inverse document frequency (tfidf) I Intuition Take into account relative importance of each word regarding the whole dataset If a word occurs in every document, it doesn t hold any information Ludovic Samper (Antidot) Machine Learning September 1st, / 77

16 Term frequency inverse document frequency (tfidf) II Definition Term frequency inverse document frequency tfidf (w, d) = tf (w, d) idf (w, d) tf (w, d) = term frequency(word w in doc d) N doc idf (w) = log( doc freq(w) ) In scikit-learn : tfidf (w, d) = tf (w, d) (idf (w) + 1) Terms that occurs in all documents idf = 0 will not be ignored Ludovic Samper (Antidot) Machine Learning September 1st, / 77

17 Term frequency inverse document frequency (tfidf) III Options Normalisation doc = 1. Ex, for norm L 2, w d tfidf(w, d)2 = 1 Smoothing : add one to document frequencies as if an extra doc contained every term in the collection exactly once Example N doc + 1 idf (w) = log( doc freq(w) + 1 ) Show most significants words of a doc wiss-ml.ipynb#tfidf Ludovic Samper (Antidot) Machine Learning September 1st, / 77

18 Sommaire 1 Problem definition 2 Extracting features from text files 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, / 77

19 Supervised classification problem I Notations x = (x 1,, x n ) = (x i ) 0 i<n feature vector {(x d, y d )} 0 d<d the training set i, x i R n x i feature vector for document i n dimension of the feature space d, y d {1,, N C } N C the number of classes y d the class of document d ŷ class prediction For a new vector x, ŷ is the predicted class of x. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

20 Supervised classification problem II Goal Find a function F : R n {1,, N C } x ŷ Ludovic Samper (Antidot) Machine Learning September 1st, / 77

21 In 20newsgroups I Values in 20 newsgroups n = nb features (number of unique terms) D = training samples N C = 20 different classes Goal Find a function F that given a new document predicts its class Ludovic Samper (Antidot) Machine Learning September 1st, / 77

22 Naïve Bayes Algorithm I Bayes theorem P(A B) = P(B A)P(A) P(B) Ludovic Samper (Antidot) Machine Learning September 1st, / 77

23 Naïve Bayes Algorithm II Posterior probability of class C P(x) does not depend on C, P(C x) = P(x C)P(C) P(x) P(C x) P(x C)P(C) Naïve Bayes independent assumption : each feature i is conditionally independent of every other feature j P(C x) P(C) n P(x i C) i=1 Ludovic Samper (Antidot) Machine Learning September 1st, / 77

24 Naïve Bayes Algorithm III Classifier from the probability model ŷ = arg max P(y = k) k {1,,N C } n P(x i y = k) i=0 Ludovic Samper (Antidot) Machine Learning September 1st, / 77

25 Parameter estimation in Naïve Bayes classifier Prior of a class P(y = k) = Can also be uniform : P(y = k) = 1 N C nb samples in class k total nb samples Ludovic Samper (Antidot) Machine Learning September 1st, / 77

26 Multinomial Naïve Bayes I Naïve Bayes P(x y = k) = n i=1 P(x i y = k) Multinomial distribution Event word is i follows a multinomial distribution with parameters (p 1,, p n ) where p i = P(word = i) P(x 1,, x n ) = n i=1 p x i i Where i p i = 1. p i = P(w = i) One distribution for each class y. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

27 Multinomial Naïve Bayes II Multinomial Naïve Bayes One multinomial distribution for each class P(i y = k) = sum of occurrences of word x i in class k total nb words in class k = 0 j<n d k x i d k x j With smoothing, P(i y = k) = 0 j<n d k x i + α d k x j + αn Ludovic Samper (Antidot) Machine Learning September 1st, / 77

28 Multinomial Naïve Bayes III Inference in Multinomial Naïve Bayes ŷ = arg max P(y = k x) k = arg max P(y = k) k = arg max k 0 i<n P(i y = k) x i ( log(p(y = k)) + 0 i<n x i log(p(i y = k)) ) Ludovic Samper (Antidot) Machine Learning September 1st, / 77

29 Multinomial Naïve Bayes IV A linear model In the log space, W 0, is the vector of priors : W is the matrix of distributions : (log P(y = k x)) k W 0 + W T.x W 0 = log(p(y = k)) W = (w ik ), i [1, n], k [1, N C ] w ik = log P(i y = k) Ludovic Samper (Antidot) Machine Learning September 1st, / 77

30 Multinomial Naïve Bayes V Example step-by-step Ludovic Samper (Antidot) Machine Learning September 1st, / 77

31 Sommaire 1 Problem definition 2 Extracting features from text files 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, / 77

32 A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, / 77

33 A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, / 77

34 A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, / 77

35 A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, / 77

36 A linear classifier Ludovic Samper (Antidot) Machine Learning September 1st, / 77

37 Support Vector Machine, notations Problem S, training set {(x i, y i ), x i R n, y i { 1, 1}} i 0..D Find a linear function w, x i + b such that : sign( w, x i + b) = y i Ludovic Samper (Antidot) Machine Learning September 1st, / 77

38 SVM, maximum margin classifier Ludovic Samper (Antidot) Machine Learning September 1st, / 77

39 Margin distance(x +, x ) = w w, x + x = = = = 1 w ( w, x + w, x ) 1 w (( w, x + + b) ( w, x + b)) 1 (1 ( 1)) w 2 w Ludovic Samper (Antidot) Machine Learning September 1st, / 77

40 SVM, maximum margin classifier Ludovic Samper (Antidot) Machine Learning September 1st, / 77

41 Solving an optimization problem using the Lagrangien Primal problem minimize w,b f (w, b) Under the constraints, h i (w, b) 0 Lagrange function L(w, b, α) = f (w, b) i α i h i (w, b) Let, g(α) = inf (w,b) L(w, b, α) w, b, g(α) L(w, b, α) Moreover, L(w, b, α) f (w, b) Thus, α i 0, g(α) min w,b f (w, b) And with Karush Kuhn Tucker (KKT) optimality condition, max α g(α) = min f (w, b) α ih i (w, x) = 0 w,b Ludovic Samper (Antidot) Machine Learning September 1st, / 77

42 Support Vector Machine, problem Primal problem w 2 minimize (w,b) 2 Under the constraints, 0 < i D, y i ( w, x i + b) 1 Lagrange function L(w, b, α) = 1 2 w 2 i α i (y i ( w, x i + b) 1) Dual problem : maximize (w,b,α) L(w, b, α) with α i 0 Optimality in w, b is a saddle point with α Ludovic Samper (Antidot) Machine Learning September 1st, / 77

43 Support Vector Machine, problem Derivative in w, b need to vanish w L(w, b, α) = w i α i y i x i = 0 b L(w, b, α) = i α i y i = 0 Dual problem under the constraints, maximize α 1 α i α j y i y j x i, x j + 2 i,j i { i α iy i = 0 α i 0 α i Ludovic Samper (Antidot) Machine Learning September 1st, / 77

44 Support Vectors Support vectors w = i y i α i x i Karush Kuhn Tucker (KKT) optimality condition Lagrange multiplier times constraint equals zero α i (y i ( w, x i + b) 1) = 0 Thus, { αi = 0 α i > 0 y i ( w, x i + b) = 1 Ludovic Samper (Antidot) Machine Learning September 1st, / 77

45 Experiments with separable space SVMvaryingC.ipynb Ludovic Samper (Antidot) Machine Learning September 1st, / 77

46 What happens if space is not separable Ludovic Samper (Antidot) Machine Learning September 1st, / 77

47 Adding slack variable Problem was With, minimize (w,b) w 2 2 y i (w.x i + b) 1 With slack minimize (w,b) w C i ξ i With, { yi (w.x i + b) 1 ξ i ξ i 0 Ludovic Samper (Antidot) Machine Learning September 1st, / 77

48 Support Vector Machine, without slack Primal problem With, minimize (w,b) w 2 2 y i (w.x i + b) 1 Lagrange function L(w, b, α) = 1 2 w 2 i α i (y i ( w, x i + b) 1) Dual problem : maximize (w,b,α) L(w, b, α) Optimality in w, b, is a saddle point with α Ludovic Samper (Antidot) Machine Learning September 1st, / 77

49 Support Vector Machine, with slack Primal problem With, w 2 minimize (w,b) + C 2 i { yi (w.x i + b) 1 ξ i ξ i 0 ξ i Lagrange function L(w, b, ξ, α, η) = 1 2 w 2 + C i ξ i i α i (y i ( x i, w + b) + ξ i 1) i η i ξ i Dual problem : maximize (w,b,ξ,α,η) L(w, b, ξ, α, η) Optimality in w, b, ξ is a saddle point with α, η Ludovic Samper (Antidot) Machine Learning September 1st, / 77

50 Support Vector Machine, problem Derivative in w, b, ξ need to vanish w L(w, b, ξ, α, η) = w i α i y i x i = 0 b L(w, b, ξ, α, η) = i α i y i = 0 ξ L(w, b, ξ, α, η) = C α i η i = 0 η i = C α i Dual problem maximize α 1 α i α j y i y j x i, x j + 2 i,j i under the constraints, i α iy i = 0 and 0 α i C α i Ludovic Samper (Antidot) Machine Learning September 1st, / 77

51 Support Vectors Support vectors w = i y i α i x i Karush Kuhn Tucker (KKT) optimality condition Lagrange multiplier times constraint equals zero α i (y i ( w, x i + b) + ξ i 1) = 0 η i ξ i = 0 (C α i )ξ i = 0 Thus, α i = 0 y i ( w, x i + b) 1 0 < α i < C y i ( w, x i + b) = 1 α i = C y i ( w, x i + b) 1 Ludovic Samper (Antidot) Machine Learning September 1st, / 77

52 Support Vector Machine, Loss functions Primal problem With, w 2 minimize (w,b) + C 2 i { yi (w.x i + b) 1 ξ i ξ i 0 ξ i With loss function w 2 minimize (w,b) + C 2 i max(0, 1 y i (w.x i + b)) here, loss(x i, y i ) = max(0, 1 y i (w.x i + b)) = max(0, 1 f (x i )) Ludovic Samper (Antidot) Machine Learning September 1st, / 77

53 Support Vector Machine, Common loss functions Common loss functions hinge loss, L 1 -loss : max(0, 1 y i (w.x i + b)) squares hinge L 2 -loss : max(0, (1 y i (w.x i + b)) 2 ) logistic loss : log(1 + exp( y i (w.x i + b))) Ludovic Samper (Antidot) Machine Learning September 1st, / 77

54 Ludovic Samper (Antidot) Machine Learning September 1st, / 77

55 Expermiments with different values for C SVMvaryingC.ipynb#Varying-C-parameter Ludovic Samper (Antidot) Machine Learning September 1st, / 77

56 Non linearly separable data Ludovic Samper (Antidot) Machine Learning September 1st, / 77

57 Non linearly separable data, Φ(x) = (x, x 2 ) Ludovic Samper (Antidot) Machine Learning September 1st, / 77

58 Non linearly separable data, Φ(x) = (x, x 2 ) Ludovic Samper (Antidot) Machine Learning September 1st, / 77

59 Linear case Primal Problem minimize w,b 1 2 w 2 + C i ξ i subject to, y i ( w, x i + b) 1 ξ i and ξ i 0 Dual Problem maximize α 1 2 subject to, i α iy i = 0 and 0 α i C α i α j y i y j x i, x j + i,j i α i Support vector expansion f (x) = i α i y i x i, x + b Ludovic Samper (Antidot) Machine Learning September 1st, / 77

60 With a transformation Φ : x Φ(x) Primal Problem minimize w,b 1 2 w 2 + C i ξ i subject to, y i ( w, Φ(x i ) + b) 1 ξ i and ξ i 0 Dual Problem maximize α 1 2 α i α j y i y j Φ(x i ), Φ(x j ) + i,j i α i subject to, i α iy i = 0 and 0 α i C Support vector expansion f (x) = i α i y i Φ(x i ), Φ(x) + b Ludovic Samper (Antidot) Machine Learning September 1st, / 77

61 The kernel trick Kernel function k(x, x ) = Φ(x), Φ(x ) We just need to compute the dot product in the new space Dual Problem maximize α 1 2 subject to, i α iy i = 0 and 0 α i C α i α j y i y j k(x i, x j ) + i,j i α i Support vector expansion f (x) = i α i y i k(x i, x) + b Ludovic Samper (Antidot) Machine Learning September 1st, / 77

62 Kernels Kernel functions linear : k(x, x ) = x, x polynomial : k(x, x ) = (γ x, x + r) d rbf : k(x, x ) = exp( γ x x 2 ) Ludovic Samper (Antidot) Machine Learning September 1st, / 77

63 RBF Kernel imply an infinite space Here we re in dimension 1, x R k(x, x ) = exp( (x x ) 2 ) = exp( x 2 )exp( x 2 )exp(2xx ) With Taylor transformation, k(x, x ) = exp( x 2 )exp( x 2 ) 2 k x k x k k=0 k! = (, 2k 1 k! exp( x 2 )x k, ), (, 2k 1 k! exp( x 2 )x k, ) Ludovic Samper (Antidot) Machine Learning September 1st, / 77

64 Experiments with different kernels Ludovic Samper (Antidot) Machine Learning September 1st, / 77

65 SVM in multiclass one-vs-the rest N C binary classifiers (but each involving all dataset) At prediction time, choose the class with maximum decision value one-vs-one N C (N C 1) 2 binary classifiers At prediction time, vote Ludovic Samper (Antidot) Machine Learning September 1st, / 77

66 SVM in scikit-learn SVC : Support Vector Classification sklearn.svm.linearsvc based on Liblinear library strategy : one-vs-the rest only linear kernel loss can be : hinge or squared hinge sklearn.svm.svc based on libsvm multiclass strategy : one-vs-one kernel can be : linear, polynomial, RBF, sigmoid, precomputed only hinge loss Ludovic Samper (Antidot) Machine Learning September 1st, / 77

67 Sommaire 1 Problem definition 2 Extracting features from text files 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Ludovic Samper (Antidot) Machine Learning September 1st, / 77

68 Cross validation I Overfitting Estimation of parameters on the test set can lead to overfitting : parameters are the best for this test set but not in the general case. Train, test and validation dataset A solution : tweak the parameters on the test set validate on a validation dataset only few data in training dataset Ludovic Samper (Antidot) Machine Learning September 1st, / 77

69 Cross validation II Cross validation k-fold cross validation Split training data in k partitions of the same size train the model on k 1 partitions then, evaluate on the kth partition Ludovic Samper (Antidot) Machine Learning September 1st, / 77

70 Cross validation III Ludovic Samper (Antidot) Machine Learning September 1st, / 77

71 Grid Search Grid search Test each value for each parameter brut force algorithm to find the best value for each parameter In scikit-learn Automatically runs k number of parameters values trainings Keeps the best model Demo with scikit-learn Ludovic Samper (Antidot) Machine Learning September 1st, / 77

72 Sommaire 1 Problem definition 2 Extracting features from text files 3 Algorithms for classification 4 Conclusion Methodology Ludovic Samper (Antidot) Machine Learning September 1st, / 77

73 1 Problem definition Supervised classification Evaluation metrics 2 Extracting features from text files Bag of words model Term frequency inverse document frequency (tfidf) 3 Algorithms for classification Naïve Bayes Support Vector Machine (SVM) Tuning parameters Cross validation Grid search 4 Conclusion Methodology Ludovic Samper (Antidot) Machine Learning September 1st, / 77

74 Methodology To solve a problem using Machine Learning, you have to : 1 Understand the data 2 Choose an evaluation measure 3 Be able to test the model 4 Find the main features 5 Try the algorithms, with different parameters Ludovic Samper (Antidot) Machine Learning September 1st, / 77

75 Conclusion Machine Learning has a lot of applications With libraries like scikit-learn, no need to implement algorithms yourself Ludovic Samper (Antidot) Machine Learning September 1st, / 77

76 Questions? Ludovic Samper (Antidot) Machine Learning September 1st, / 77

77 References Machine Learning in Python : Alex Smola very good lecture on Machine Learning at CMU : Kernels : SVM : Ludovic Samper (Antidot) Machine Learning September 1st, / 77

78 Bernoulli Naïve Bayes Features x i = 1 iff word i is present in document Else, x i = 0 The number of occurrences of word i doesn t matter Bernoulli For each feature i, P(x i y = k) = P(i y = k)x i + (1 P(i y = k))(1 x i ) Absence of a feature is explicitly taken into account Estimation of P(i y = k) P(i y = k) = 1 + nb of documents in k that contains word i nb of documents in k Ludovic Samper (Antidot) Machine Learning September 1st, / 77

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2