Machine Learning in a Nutshell. Data Model Performance Measure Machine Learner

Size: px

Start display at page:

Download "Machine Learning in a Nutshell. Data Model Performance Measure Machine Learner"

Merry Todd
6 years ago
Views:

1 Machine Learning 1

2 Machine Learning in a Nutshell Data Model Performance Measure Machine Learner 2

3 Machine Learning in a Nutshell Data Model Performance Measure Machine Learner Data with a5ributes ID A1 Reflex RefLow RefHigh Label Normal No Normal No Normal Yes Elevated No Normal No Normal Yes Decreased Yes Normal Yes Instance x i X with label y i Y 3

Machine Learning in a Nutshell Data Model Performance Measure Machine Learner Data with a5ributes ID A1 Reflex RefLow RefHigh Label 1 5.6 Normal 3.4 7 No 2 5.5 Normal 2.4 5.7 No 3 5.3 Normal 2.4 5.7 Yes 4 5.

4 Machine Learning in a Nutshell Data Model Performance Measure Machine Learner Data with a5ributes ID A1 Reflex RefLow RefHigh Label Normal No Normal No Normal Yes Elevated No Normal No Normal Yes Decreased Yes Normal Yes Instance x i X with label y i Y Model LogisHc regression Support vector machines f : X Y Mixture Models Hierarchical Bayesian Networks x 1 x 2 x 3 x 4 x 5 4

5 Machine Learning in a Nutshell Data Model Performance Measure Machine Learner Data with a5ributes ID A1 Reflex RefLow RefHigh Label Normal No Normal No Normal Yes Elevated No Normal No Normal Yes Decreased Yes Normal Yes Instance x i X with label y i Y Model LogisHc regression Support vector machines f : X Y Mixture Models Hierarchical Bayesian Networks x 1 x 2 x 3 x 4 x 5 EvaluaHon Measure predicted labels vs actual labels on test data Performance Learning Curve # Training Examples 5

6 A training set 6

7 ID3-induced decision tree 7

8 I Model spaces I Decision tree + - I Nearest neighbor + + Version space 8

9 Decision tree-induced partition example Color green red blue Size + Shape big small square round - + Size + big small - + I 9

10 k-nearest Neighbor Instance-Based Learning Some material adapted from slides by Andrew Moore, CMU. Visit for Andrew s repository of Data Mining tutorials. 10

11 1-Nearest Neighbor n One of the simplest of all machine learning classifiers n Simple idea: label a new point the same as the closest known point Label it red.! 11

12 1-Nearest Neighbor n A type of instance-based learning n Also known as memory-based learning n Forms a Voronoi tessellation of the instance space 12

13 Distance Metrics n Different metrics can change the decision surface Dist(a,b) =(a 1 b 1 ) 2 + (a 2 b 2 ) 2! Dist(a,b) =(a 1 b 1 ) 2 + (3a 2 3b 2 ) 2! n Standard Euclidean distance metric: n Two-dimensional: Dist(a,b) = sqrt((a 1 b 1 ) 2 + (a 2 b 2 ) 2 ) n Multivariate: Dist(a,b) = sqrt( (a i b i ) 2 ) Adapted from Instance-Based 13 Learning! lecture slides by Andrew Moore, CMU.!

14 Four Aspects of an Instance-Based Learner: 1. A distance metric 2. How many nearby neighbors to look at? 3. A weighting function (optional) 4. How to fit with the local points? Adapted from Instance-Based Learning! lecture slides by Andrew Moore, 14 CMU.!

15 1-NN s Four Aspects as an Instance-Based Learner: 1. A distance metric n Euclidian 2. How many nearby neighbors to look at? n One 3. A weighting function (optional) n Unused 4. How to fit with the local points? n Just predict the same output as the nearest neighbor. Adapted from Instance-Based Learning! lecture slides by Andrew Moore, 15 CMU.!

Zen Gardens Mystery of renowned zen garden revealed [CNN Article] Thursday, September 26, 2002 Posted: 10:11 AM EDT (1411 GMT) LONDON (Reuters) -- For centuries visitors to the renowned Ryoanji

The five sparse clusters on a rectangle of raked gravel are said to be pleasing to the eyes of the hundreds of thousands of tourists who visit the garden each year.

16 Zen Gardens Mystery of renowned zen garden revealed [CNN Article] Thursday, September 26, 2002 Posted: 10:11 AM EDT (1411 GMT) LONDON (Reuters) -- For centuries visitors to the renowned Ryoanji Temple garden in Kyoto, Japan have been entranced and mystified by the simple arrangement of rocks. The five sparse clusters on a rectangle of raked gravel are said to be pleasing to the eyes of the hundreds of thousands of tourists who visit the garden each year. Scientists in Japan said on Wednesday they now believe they have discovered its mysterious appeal. "We have uncovered the implicit structure of the Ryoanji garden's visual ground and have shown that it includes an abstract, minimalist depiction of natural scenery," said Gert Van Tonder of Kyoto University. The researchers discovered that the empty space of the garden evokes a hidden image of a branching tree that is sensed by the unconscious mind. "We believe that the unconscious perception of this pattern contributes to the enigmatic appeal of the garden," Van Tonder added. He and his colleagues believe that whoever created the garden during the Muromachi era between knew exactly what they were doing and placed the rocks around the tree image. By using a concept called medial-axis transformation, the scientists showed that the hidden branched tree converges on the main area from which the garden is viewed. The trunk leads to the prime viewing site in the ancient temple that once overlooked the garden. It is thought that abstract art may have a similar impact. "There is a growing realisation that scientific analysis can reveal unexpected structural features hidden in controversial abstract paintings," Van Tonder said Adapted from Instance-Based Learning lecture slides by Andrew Moore, CMU.! 16

17 k Nearest Neighbor n Generalizes 1-NN to smooth away noise in the labels n A new point is now assigned the most frequent label of its k nearest neighbors Label it red, when k = 3! Label it blue, when k = 7! 17

18 k-nearest Neighbor (k = 9) Appalling behavior! Loses all the detail that 1-nearest neighbor would give. The tails are horrible!! A magnificent job of noise smoothing. Three cheers for 9- nearest-neighbor.! But the lack of gradients and the jerkiness isn t good.! Fits much less of the noise, captures trends. But still, frankly, pathetic compared! with linear regression.! Adapted from Instance-Based Learning! lecture slides by Andrew Moore, 18 CMU.!

19 The Naïve Bayes Classifier Some material adapted from slides by Tom Mitchell, CMU. 19

20 20 The Naïve Bayes Classifier ) ( ) ( ) ( ) ( j i j i j i X P Y X P Y P X Y P = n Recall Bayes rule: n Which is short for: n We can re-write this as: ) ( ) ( ) ( ) ( j i j i j i x X P y Y x X P y Y P x X y Y P = = = = = = = = = = = = = = = = k k k j i j i j i y Y P y Y x X P y Y x X P y Y P x X y Y P ) ( ) ( ) ( ) ( ) (

21 Deriving Naïve Bayes n Idea: use the training data to directly estimate: P( X Y ) and P(Y ) n Then, we can use these values to estimate P Y X ) using Bayes rule. ( new n Recall that representing the full joint probability P X, X,, X ) is not practical. ( 1 2 n Y 21

22 Deriving Naïve Bayes n However, if we make the assumption that the attributes are independent, estimation is easy! n = i P ( X, 1, X Y ) P( X Y ) i n In other words, we assume all attributes are conditionally independent given Y. n Often this assumption is violated in practice, but more on that later 22

23 Deriving Naïve Bayes n Let X = X,, 1 X n and label Y be discrete. n Then, we can estimate P( X i Yi ) and directly from the training data by counting! P( Y i ) Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(Sky = sunny Play = yes) =? P(Humid = high Play = yes) =? 23

24 24 The Naïve Bayes Classifier n Now we have: which is just a one-level Bayesian Network n To classify a new point X new : ) ( i H P Attributes (evidence) Labels (hypotheses) 1 n i j X X X Y ) ( j Y P ) ( j X i Y P = = = = = = k i k i k j i i j n j y Y X P y Y P y Y X P y Y P X X y Y P ) ( ) ( ) ( ) ( ),, ( 1 = = i k i k y new y Y X P y P Y Y k ) ( ) ( argmax

25 The Naïve Bayes Algorithm n For each value y k n Estimate P(Y = y k ) from the data. n For each value x ij of each attribute X i n Estimate P(X i =x ij Y = y k ) n Classify a new point via: Y new argmax P( Y y k = y k ) P( Xi Y = i y k ) n In practice, the independence assumption doesn t often hold true, but Naïve Bayes performs very well despite it. 25

Naïve Bayes Applications n Text classification n Which e-mails are spam? n Which e-mails are meeting notices?

26 Naïve Bayes Applications n Text classification n Which s are spam? n Which s are meeting notices? n Which author wrote a document? n Classifying mental states Learning P(BrainActivity WordCategory) Pairwise Classification Accuracy: 85% People Words Animal Words 26

27 Polynomial Curve Fitting Slides adapted from Pa5ern RecogniHon and Machine Learning by Christopher Bishop 27

28 Polynomial Curve FiUng

29 Sum- of- Squares Error FuncHon

30 0 th Order Polynomial

31 1 st Order Polynomial

32 3 rd Order Polynomial

33 9 th Order Polynomial

34 Over- fiung Error

35 Polynomial Coefficients

36 Data Set Size: 9 th Order Polynomial

37 Data Set Size: 9 th Order Polynomial

38 RegularizaHon Penalize large coefficient values Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 (w 2) 2 n=1 L 2 Norm w 2 = i w 2 i Measures the complexity of w

39 RegularizaHon Penalize large coefficient values regularization parameter higher à more regularization Ẽ(w) = 1 2 Ẽ(w) = 1 2 N {y(x n, w) t n } 2 + λ 2 (w 2) 2 n=1 N {y(x n, w) t n } 2 + λ 2 n=1 i w 2 i L 2 Norm w 2 = i w 2 i Measures the complexity of w

40 RegularizaHon: = 1.5E- 8 = 1.5E- 8

41 RegularizaHon: = 1 = 1

42 RegularizaHon: vs.

43 Polynomial Coefficients

44 Learning via Gradient Descent Ẽ(w) = 1 2 j Ẽ(w) = N n=1 N n=1 x j n Choose w randomly, where w j =w j α j Ẽ(w) 2 λ y(x n, w) t n + 2 y(x n, w) t n + λw j Repeat unhl w converges (i.e., w w old < ε ) w old = w For j = 0... M: w j N(0,σ 2 ) j w 2 j

45 The Gaussian DistribuHon

46 The MulHvariate Gaussian

47 Gaussian Parameter EsHmaHon Likelihood funchon

48 Maximum (Log) Likelihood

49 Curve FiUng Re- visited

50 Maximum Likelihood Determine by minimizing sum- of- squares error,.

51 PredicHve DistribuHon

52 MAP: A Step towards Bayes Determine by minimizing regularized sum- of- squares error,.

53 Bayesian Curve FiUng

54 Bayesian PredicHve DistribuHon

55 Model SelecHon Cross- ValidaHon

56 Curse of Dimensionality

57 Curse of Dimensionality Polynomial curve fiung, M = 3 Gaussian DensiHes in higher dimensions

Generative v. Discriminative classifiers Intuition

Logistic Regression Machine Learning 070/578 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target