Introduction to Machine Learning

Size: px

Start display at page:

Download "Introduction to Machine Learning"

Marsha Shepherd
6 years ago
Views:

1 Introduction to Machine Learning 1 Balázs Kégl Département d informatique et recherche opérationnelle Université de Montréal September 10-17, 2004

2 Outline 2 Introduction, motivating examples The supervised learning model Brief history Algorithms and principles Unsupervised learning

3 Introduction 3 First example: handwritten character recognition design a function that maps every binary image to one of the 26 classes of a, b,..., z database? entries!! manual design? soft specification, not enough knowledge Learning to the rescue given n (image,label) pairs where n learn a function that generalizes to unseen examples

4 Introduction 4 Second example: gene expression analysis design a function that maps every gene expression pattern (real vectors of length 30,000) to one of the 2 classes of cancerous/healthy database? entries!! manual design? soft specification, not enough knowledge Learning to the rescue given n (gene expression,label) pairs where n learn a function that generalizes to unseen examples

5 Introduction 5 Traditional applications: pattern recognition speech, fingerprint, face, object New domains: data mining finance, bioinfo, text processing (internet!) Challenges never been done by humans enormous data size, dimensionality, increased ambiguity

6 Introduction 6 Some unconventional applications personalized wireless portals improving opponent intelligence in video games user identification by mouse movements learning to fly football player recognition emotion classification in internet chat hardwood log defect detection

7 Introduction 7 Machine learning at the crossroads cognitive science, neurosciences statistics probability theory optimization theory, numerical algebra information theory, signal processing

8 Outline 8 Introduction The supervised learning model Brief history Algorithms and principles Unsupervised learning

9 The supervised learning model 9

10 The supervised learning model 10

11 The supervised learning model 11

12 The supervised learning model 12

13 The supervised learning model 13 a

14 The supervised learning model 14 a observation vector

15 The supervised learning model 15 a observation vector class label

16 The supervised learning model 16 a observation vector class label observation vector: x R d class label: y { 1,1} binary classification classifier: g : R d { 1,1} discriminant function: f : R d [ 1,1] { 1, if f (x) 0, g(x) = 1, if f (x) < 0 decision border: {x : f (x) = 0}

17 The supervised learning model 17

18 The supervised learning model 18

19 The supervised learning model 19

20 The supervised learning model 20 teacher (nature)

21 The supervised learning model 21 teacher (nature) ( ), a ( ), b, b ( ), a ( )

22 The supervised learning model 22 teacher (nature) ( ), a ( ), b, b ( ), a ( )

23 The supervised learning model 23 teacher (nature) f ( ), a ( ), b, b ( ), a ( )

24 The supervised learning model 24 teacher (nature) ( ) ( ), a, b (, b ) (, a ) f Learning by experience, with a teacher training sample: D n = { (x 1,y 1 ),...,(x n,y n ) } function set: F learning algorithm: ALGO : ( R d { 1,1} ) n F ALGO(D n ) f goal: small generalization error P [ ALGO(D n,x) Y ]

25 The supervised learning model 25 Research questions What is a good f? Given the function class and the training data, how can we find a good f?

26 Outline 26 Introduction The supervised learning model Brief history Algorithms and principles Unsupervised learning

27 Brief history 27 Algorithms 1958: Perceptron [Rosenblatt, 58] [Minsky Papert 69] 1986: Multilayer Neural Networks and backpropagation [Rumelhart Hinton Williams, 86] 1995: Support Vector Machines [Boser Guyon Vapnik, 92], [Cortes Vapnik, 95] 1997: boosting, AdaBoost [Freund, 95], [Freund Schapire, 97]

28 Outline 28 Introduction The supervised learning model Brief history Algorithms and principles Unsupervised learning

29 The perceptron 29 Linear discriminant functions: f (x) = g( ) x d w (d) x (d) = w t x i=0 f( x) Σ x (0) =1 (0) (d) w ww (1) w (2) x (1) x (2) x (d)

30 The perceptron 30 Learning algorithm: PERCEPTRON 1 w w 0 2 do 3 find a misclassified training point x i 4 w w + x i 5 until convergence converges to a zero training error classifier if the training data is linearly separable learning principle: find f F that minimizes the training error

31 Multilayer neural nets 31 f (x) = T α ( j) σ(w t j x) = j=1 (1) h ( x) (1) α T α ( j) h ( j) (x) j=1 (t) α Σ f( x) (t) (T) h ( x) (T) h ( x) α Σ... Σ... Σ w 1 w j w T x (0) =1 x (1) x... (2) x (d)

32 Multilayer neural nets 32 Algorithm: standard gradient descent optimization the hard threshold function is replaced by a smooth sigmoid error minimization is replaced by minimizing a soft quadratic error ( f (x) y) 2

33 Support vector networks 33 Find the unique linear separator exactly in the middle between the two classes y 2 maximum maximum margin b margin b R 1 optimal hyperplane R 2 y 1

34 Support vector networks 34 Algorithm: standard quadratic optimization can be extended to the non-separable case Can be used to optimize non-linear discriminants Model: f (x) = T α ( j) y ( j) K(x ( j),x) j=1 K(, ) is a similarity function (kernel) x ( j) s are the support vectors

35 AdaBoost 35 Model: f (x) = T α ( j) h ( j) (x) j=1 h ( j) can be any weak classifier extremely versatile interpretation: weighted vote of weak experts

36 AdaBoost 36 Algorithm extremely simple, no gradient descent, no quadratic optimization add experts (features h) one at a time set their weights proportional to their errors train the next expert on data points that were missed by previous experts automatically concentrates on training points that are hard to learn

37 Fonctions discriminantes linéaires généralisées 37 Modèle: f (x) = N α ( j) h ( j) (x) j=1 h ( j) : X [ 1,1] classifieurs simples, traits, experts α ( j) R + poids de l expert h ( j) dans le vote final

38 Fonctions discriminantes linéaires généralisées 38 Modèle: f (x) = N α ( j) h ( j) (x) j=1 Exemple: classification des. s x = contexte, y {1:fin de phrase, 1 : fin de phrase} h (1) (x) 1, α (1) = 0.3 h (2) (x) si le mot précédent est dans le dictionnaire, alors +1, sinon -1, α (2) = 0.5 h (3) (x) si le mot précédent commence avec une lettre minuscule +1, sinon -1, α (3) = 0.4 etc.

39 Fonctions discriminantes linéaires généralisées 39 Réseaux de neurones classiques h ( j) (x) doivent être dérivable Machines à vecteurs de support (SVM) h ( j) (x) = y j K(x j,x) Plus-proches-voisins sophistiqué AdaBoost aucune restriction sur la forme de h ( j) (x)!!

40 AdaBoost 40 L intuition de l algorithme ajouter un expert à la fois le meilleur sur les points d entraînement mal-classifiés par des experts précédents mettre son poids proportionnel à son erreur

41 AdaBoost 41 Pondération sur les points d entraînement w 1,...,w n pondération normalisée: n i=1 w i = 1 initialiser uniformément: w = (1/n,...,1/n) si x i est mal-classifié par h ( j), augmenter w i sinon, diminuer w i au fur et à mesure les points difficiles obtiennent des poids élevés

42 AdaBoost 42 Choix d expert experts binaires: h ( j) : X { 1,1} erreur pondérée ε = ε(h) = n i=1 w i I {h(xi ) y i } choisir l expert qui minimise l erreur pondérée h (t) = argminε(h) h mettre son poids à α (t) = 1 2 ln 1 ε ε

43 AdaBoost 43 Repondération des points si h(x i ) y i alors w i w i 1 2ε si h(x i ) = y i alors w i w i 1 2(1 ε) Convergence soit ε 1 2 δ dans toutes les itérations: h est un peu meilleur qu une décision aléatoire lnn erreur = 0 après + 1 itérations 2δ 2

44 AdaBoost ADABOOST(D n, BASE(D n,w),t ) 1 w (1) (1/n,...,1/n) poids initiaux 2 pour t 1 à T 3 h (t) BASE(D n,w) choix d expert 4 ε (t) h (t) (x i ) y i w i erreur pondérée 5 si ε (t) 1/2 alors 6 retourner f t 1 ( ) = t 1 j=1 α( j) h ( j) ( ) ( 7 α (t) 1 2 ln 1 ε (t) poids de h (t) ε (t) ) 8 pour i 1 à n repondération des points 9 si h (t) (x i ) y i alors 10 w (t+1) i 11 sinon w(t) i 2ε (t) 12 w (t+1) i w(t) i 2(1 ε (t) ) 13 retourner f (T ) ( ) = t=1 T α (t) h (t) ( ) 44

45 AdaBoost 45 Extension 1: experts ternaires: h ( j) : X { 1,0,1} experts peuvent s abstenir experts locaux experts spécialistes

46 AdaBoost 46 Choix d expert erreur: ε = ε (h) = n i=1 w i I {h(xi )y i = 1} taux de classification correcte: ε + = ε + (h) = taux d abstention: ε 0 = ε 0 (h) = n i=1 w i I {h(xi )y i =0} choisir l expert qui minimise ε ε + ε n i=1 w i I {h(xi )y i =1} mettre son poids à α (t) = 1 2 ln ε + ε

47 AdaBoost 47 Repondération des points si h(x i )y i = 1 alors w i w i 1 2ε + ε 0 ε /ε + si h(x i )y i = 1 alors w i w i 1 2ε + + ε 0 ε+ /ε si h(x i )y i = 0 alors w i w i 1 ε ε + ε

48 AdaBoost 48 Extension 2: experts avec confiances: h ( j) : X [ 1,1] signe(h ( j) (x)) est la classification de x par l expert h ( j) (x) est la confiance de l expert Choix d expert choisir l expert qui maximise n i=1 w i h(x i )y i mettre son poids à α (t) = argmin α n i=1 w i e αh(t) (x i )y i

49 AdaBoost 49 Repondération des points w (t+1) i w (t) i e α(t) h (t) (x i )y i n j=1 w (t) j e α(t) h (t) (x j )y j

50 AdaBoost 50 Conclusion apprentissage extrêmement simple, pas de descente de gradient, pas d optimisation numérique compliquée interprétation intuitive: vote pondéré des experts aucune restriction sur la forme des experts performance de généralisation d état d art

51 Outline 51 Introduction The supervised learning model Brief history Algorithms and principles Unsupervised learning

52 Unsupervised learning 52 Objective learn something from unlabeled data X = { X 1,...,X n } Taxonomy density estimation: find the generating probability distribution clustering: find a good partitioning dimension reduction: find a good and efficient representation

53 Dimension reduction 53 Objective learn a projection f : R d R D from X = { X 1,...,X n } projected data: X = { X 1,...,X } { n = f (X1 ),..., f (X n ) } Good representation small expected distortion (as in data compression) preserving pairwise distances (multidimensional scaling) Efficient representation D d

54 Dimension reduction 54 Applications data compression data mining: understanding the data visualization (D = 2,3) fighting the curse of dimensionality : feature extraction

55 Dimension reduction 55 [Teh Roweis, 02]

56 Dimension reduction 56 [Teh Roweis, 02]

57 Dimension reduction 57 [Roweis Saul, 00]

58 Dimension reduction 58 Data model embedded in a high-dimensional space concentrated on a manifold low-dimensional non-linear smooth

59 Non-linearity 59

60 Smoothness 60 Local linearity

61 Dimension reduction 61 Iterative methods self-organizing maps, multidimensional scaling, auto-associative networks problem: local minima Methods based on local linearity local linear embedding (LLE), ISOMAP

62 62 Dimension reduction Local linear embedding [Roweis Saul, 00] g W ij 0if X j does sum to one: j W ij 1. The optimal weights t s a t b s m o t i f r n e t m e n s r e e t

63 Dimension reduction 63 ISOMAP [Tenenbaum Silva Langford, 00]

64 Dimension reduction 64 ISOMAP Step 1: construct a neighborhood graph al op Step 2: find the shortest path between each pair of points

65 Dimension reduction 65 ISOMAP l optimality; for intrinsically Euclidean man- Step 3: use linear multidimensional scaling

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Neural Networks with Applications to Vision and Language Feedforward Networks Marco Kuhlmann Feedforward networks Linear separability x 2 x 2 0 1 0 1 0 0 x 1 1 0 x 1 linearly separable not linearly separable